Mark Hall on Data Mining & Weka: March 2012

With the ever increasing growth in online social networking, text mining and social analytics are hot topics in predictive analytics. The standard approach to learning a document classifier is to convert unstructured text documents into something called the bag-of-words representation and then apply a standard propositional learning scheme to the result. Essentially this means splitting documents up into their constituent words, building a dictionary for the corpus and then converting each document into a fixed length vector of either binary word presence/absence indicators or word frequency counts. In general, this involves two passes over the data (especially if further transformations such as TF-IDF are to be applied) - one to build the dictionary, and a second one to convert the text to vectors. Following this a classifier can be learned.

Certain types of classifiers lend themselves naturally to incremental streaming scenarios, and can perform the tokenization of text and construction of the model in one pass. Naive Bayes multinomial is one such algorithm; linear support vector machines and logistic regression learned via stochastic gradient descent (SGD) are some others. These methods have the advantage of being "any time" algorithms - i.e. they can produce a prediction at any stage in the learning process. Furthermore, they scale linearly with the amount of data and can be considered "Big Data" methods.

Text classification is a supervised learning task, which means that each training document needs to have a category or "class" label provided by a "teacher". Manually labeling training data is a labor intensive process and typical training sets are not huge. This seems to preclude the need for big data methods. Enter Twitter's endless data stream and the prediction of sentiment. The limited size of tweets encourage the use of emoticons as a compact way of indicating the tweeter's mood, and these can be used to automate the labeling of training examples [1].

So how can this be implemented in Weka [2]? Some new text processing components for Weka's Knowledge Flow and the addition of NaiveBayesMultinomialText and SGDText for learning models directly from string attributes make it fairly simple.

This example Knowledge Flow process incrementally reads a file containing some 850K tweets. However, using the Groovy Scripting step with a little custom code, along with the new JsonFieldExtractor step, it would be straightforward to connect directly to the Twitter streaming service and process tweets in real-time. The SGDText classifier component performs tokenization, stemming, stopword removal, dictionary pruning and the learning of a linear logistic regression model all incrementally. Evaluation is performed by interleaved testing and training, i.e. a prediction is produced for each incoming instance before it is incorporated into the model. For the purposes of evaluation this example flow discards all tweets that don't contain any emoticons, which results in most of the data being discarded. If evaluation wasn't performed then all tweets could be scored, with only labeled ones getting used to train the model.

SGDText is included in Weka 3.7.5. The SubstringReplacer, SubstringLabeler and NaiveBayesMultinomialText classifier (not shown in the screenshot above) will be included with Weka 3.7.6 (due out April/May 2012). In the meantime, interested folks can grab a nightly snapshot of the developer version of Weka.

Options for SGDText

References
[1] Albert Bifet and Eibe Frank. Sentiment knowledge discovery in Twitter streaming data. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 1-15. Springer, 2010.

[2] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.

Mark Hall on Data Mining & Weka

Thursday, 1 March 2012

Sentiment analysis with Weka