Thursday 5 July 2012

R Integration in Weka

These days it seems like every man and his proverbial dog is integrating the open-source R statistical language with his/her analytic tool. R users have long had access to Weka via the RWeka package, which allows R scripts to call out to Weka schemes and get the results back into R. Not to be left out in the cold, Weka now has a brand new package that brings the power of R into the Weka framework.

Weka

In this section I briefly cover what the new RPlugin package for Weka >= 3.7.6 offers. This package can be installed via Weka's built-in package manager.

Here is an list of the functionality implemented:

  • Execution of arbitrary R scripts in Weka's Knowledge Flow engine
  • Datasets into and out of the R environment
  • Textual results out of the R environment
  • Graphics out of R in png format for viewing inside of Weka and saving to files via the JavaGD graphics device for R
  • A perspective for the Knowledge Flow and a plugin tab for the Explorer that provides visualization of R graphics and an interactive R console
  • A wrapper classifier that invokes learning and prediction of R machine learning schemes via the MLR (Machine Learning in R) library

The following screenshot shows the execution of two separate R scripts in Weka's Knowledge Flow environment. This is accomplished by a new RScriptExecutor step for the Knowledge Flow.


The upper part of the flow loads a dataset in Weka's ARFF format and passes it to a RScriptExecutor step that first pushes the data into R as a data frame, and then learns an rpart decision tree in R. The tree, in text form, is then sent to a TextViewer component. The lower part of the flow uses a second RScriptExecutor step to load the iris data (inside of the R environment) and then create a scatter plot matrix using the "pairs" function. It also exports the iris data from R into Weka's internal "Instances" format and sends this to a second TextViewer. The scatter plot matrix produced by the R script is exported as a png and sent to an ImageSaver step. The GUI dialog for the RScriptExecutor showing the R script producing these results is shown at the bottom of the screenshot.

Any graphics produced by an RScriptExecutor step are also picked up by the "RConsole/visualize" perspective for the Knowledge Flow.


This perspective (which is also available in Weka's Explorer as a plugin tab) maintains a list of images produced by Knowledge Flow processes. It also provides an interactive R console where R commands can be typed and evaluated immediately.

In order to evaluate R machine learning models in the Weka framework, and to use Weka as a vehicle for operationalizing such models, it is necessary to go beyond just executing R scripts. The MLR wrapper classifier for Weka provides a bridge between the MLR library in R and Weka's "Classifier" API. It allows R models to be learned, evaluated and used for prediction inside of Weka's framework. It also allows the models learned in R to be persisted via serialization and encapsulated in the MLRClassifier for use at a later date. The following screenshots show the MLRClassifier at work in a Knowledge Flow process and in Weka's Explorer UI.




Pentaho

R integration, for scoring/prediction using R models, in Pentaho's PDI data integration tool is achieved with minimal effort using the existing WekaScoring plugin step for PDI. WekaScoring already handles scoring using pre-constructed Weka models (classifiers and clusterers) and PMML models. Since MLRClassifier is a Weka classifier it can be consumed immediately by the step and R models can be used for scoring inside of a PDI transformation.


It is also possible to execute R scripts and construct R predictive models from scratch as part of a PDI transformation using the existing Knowledge Flow plugin step for PDI. This allows, for example, R predictive models to be refreshed and R visualizations to be generated as part of an automated ETL process.

Technical

Weka's R integration uses the JRI library which provides JNI interface to the R native libraries. This, of course, requires that the user have R installed on their computer and that they have installed the rJava package (which includes JRI) from within the R environment. It also requires several environment variables to be set in order for the JRI native library and dependent R libraries to be found. The RPlugin package has instructions for easing this pain and a mechanism that attempts to find the JRI library in the most common installation locations under Windows and MacOS. Once JRI and R are available to the Java VM then Weka's RPlugin will install various R libraries (such as MLR) automatically.

Class loaders + native libraries (combined with the single-threaded nature of the R environment) add up to quite a headache when considering things like plugin environments, application servers and the like. Weka's RPlugin can be used in such environments where it is loaded (perhaps multiple times) by plugin class loaders. To achieve native library visibility across child class loaders, and to maintain a single point of access to R by clients, the byte code for certain key classes (from JRI, REngine and Weka) are injected into the root class loader very early in the class loading process. Many thanks to the guys over at the snappy-java project for detailing this approach.

PDI is a streaming environment and so is Weka (as far as prediction goes). R and MLR operate most efficiently in a batch fashion since the data frame is the structure that is used for both learning a model and making predictions. As the conversion and transfer of data from Weka or PDI into R is costly, the best performance is obtained by pushing over data in batches for prediction. Prediction using R models in Weka 3.7.6 is slow because each test instance has to be transfered into R as a separate data frame. The next release of Weka (3.7.7) due out in August rectifies this with a new batch prediction interface (Note that nightly snapshots of Weka already include this performance improvement).

Update, September 2013

MLR is now available as an official R package from CRAN, yay!! Many thanks to chief MLR developer Bernd Bischl for getting MLR to this point. 


I've just updated Weka's RPlugin package (version 1.1.0) to use the new MLR version 1.1-18. The MLRClassifier will now install MLR in R automatically the first time it is used. MLRClassifier is now more robust and has improved error reporting. Update your package meta data in Weka and give it a go!

Thursday 1 March 2012

Sentiment analysis with Weka

With the ever increasing growth in online social networking, text mining and social analytics are hot topics in predictive analytics. The standard approach to learning a document classifier is to convert unstructured text documents into something called the bag-of-words representation and then apply a standard propositional learning scheme to the result. Essentially this means splitting documents up into their constituent words, building a dictionary for the corpus and then converting each document into a fixed length vector of either binary word presence/absence indicators or word frequency counts. In general, this involves two passes over the data (especially if further transformations such as TF-IDF are to be applied) - one to build the dictionary, and a second one to convert the text to vectors. Following this a classifier can be learned.

Certain types of classifiers lend themselves naturally to incremental streaming scenarios, and can perform the tokenization of text and construction of the model in one pass.  Naive Bayes multinomial is one such algorithm; linear support vector machines and logistic regression learned via stochastic gradient descent (SGD) are some others. These methods have the advantage of being "any time" algorithms - i.e. they can produce a prediction at any stage in the learning process. Furthermore, they scale linearly with the amount of data and can be considered "Big Data" methods.

Text classification is a supervised learning task, which means that each training document needs to have a category or "class" label  provided by a "teacher". Manually labeling training data is a labor intensive process and typical training sets are not huge. This seems to preclude the need for big data methods. Enter Twitter's endless data stream and the prediction of sentiment. The limited size of tweets encourage the use of emoticons as a compact way of indicating the tweeter's mood, and these can be used to automate the labeling of training examples [1].

So how can this be implemented in Weka [2]? Some new text processing components for Weka's Knowledge Flow and the addition of NaiveBayesMultinomialText and SGDText for learning models directly from string attributes make it fairly simple.


This example Knowledge Flow process incrementally reads a file containing some 850K tweets. However, using the Groovy Scripting step with a little custom code, along with the new JsonFieldExtractor step, it would be straightforward to connect directly to the Twitter streaming service and process tweets in real-time. The SGDText classifier component performs tokenization, stemming, stopword removal, dictionary pruning and the learning of a linear logistic regression model all incrementally. Evaluation is performed by interleaved testing and training, i.e. a prediction is produced for each incoming instance before it is incorporated into the model. For the purposes of evaluation this example flow discards all tweets that don't contain any emoticons, which results in most of the data being discarded. If evaluation wasn't performed then all tweets could be scored, with only labeled ones getting used to train the model.

SGDText is included in Weka 3.7.5. The SubstringReplacer, SubstringLabeler and NaiveBayesMultinomialText classifier (not shown in the screenshot above) will be included with Weka 3.7.6 (due out April/May 2012). In the meantime, interested folks can grab a nightly snapshot of the developer version of Weka.

Options for SGDText

References
[1] Albert Bifet and Eibe Frank. Sentiment knowledge discovery in Twitter streaming data. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 1-15. Springer, 2010.

[2] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.