Thursday, 5 July 2012

R Integration in Weka

These days it seems like every man and his proverbial dog is integrating the open-source R statistical language with his/her analytic tool. R users have long had access to Weka via the RWeka package, which allows R scripts to call out to Weka schemes and get the results back into R. Not to be left out in the cold, Weka now has a brand new package that brings the power of R into the Weka framework.

Weka

In this section I briefly cover what the new RPlugin package for Weka >= 3.7.6 offers. This package can be installed via Weka's built-in package manager.

Here is an list of the functionality implemented:

  • Execution of arbitrary R scripts in Weka's Knowledge Flow engine
  • Datasets into and out of the R environment
  • Textual results out of the R environment
  • Graphics out of R in png format for viewing inside of Weka and saving to files via the JavaGD graphics device for R
  • A perspective for the Knowledge Flow and a plugin tab for the Explorer that provides visualization of R graphics and an interactive R console
  • A wrapper classifier that invokes learning and prediction of R machine learning schemes via the MLR (Machine Learning in R) library

The following screenshot shows the execution of two separate R scripts in Weka's Knowledge Flow environment. This is accomplished by a new RScriptExecutor step for the Knowledge Flow.


The upper part of the flow loads a dataset in Weka's ARFF format and passes it to a RScriptExecutor step that first pushes the data into R as a data frame, and then learns an rpart decision tree in R. The tree, in text form, is then sent to a TextViewer component. The lower part of the flow uses a second RScriptExecutor step to load the iris data (inside of the R environment) and then create a scatter plot matrix using the "pairs" function. It also exports the iris data from R into Weka's internal "Instances" format and sends this to a second TextViewer. The scatter plot matrix produced by the R script is exported as a png and sent to an ImageSaver step. The GUI dialog for the RScriptExecutor showing the R script producing these results is shown at the bottom of the screenshot.

Any graphics produced by an RScriptExecutor step are also picked up by the "RConsole/visualize" perspective for the Knowledge Flow.


This perspective (which is also available in Weka's Explorer as a plugin tab) maintains a list of images produced by Knowledge Flow processes. It also provides an interactive R console where R commands can be typed and evaluated immediately.

In order to evaluate R machine learning models in the Weka framework, and to use Weka as a vehicle for operationalizing such models, it is necessary to go beyond just executing R scripts. The MLR wrapper classifier for Weka provides a bridge between the MLR library in R and Weka's "Classifier" API. It allows R models to be learned, evaluated and used for prediction inside of Weka's framework. It also allows the models learned in R to be persisted via serialization and encapsulated in the MLRClassifier for use at a later date. The following screenshots show the MLRClassifier at work in a Knowledge Flow process and in Weka's Explorer UI.




Pentaho

R integration, for scoring/prediction using R models, in Pentaho's PDI data integration tool is achieved with minimal effort using the existing WekaScoring plugin step for PDI. WekaScoring already handles scoring using pre-constructed Weka models (classifiers and clusterers) and PMML models. Since MLRClassifier is a Weka classifier it can be consumed immediately by the step and R models can be used for scoring inside of a PDI transformation.


It is also possible to execute R scripts and construct R predictive models from scratch as part of a PDI transformation using the existing Knowledge Flow plugin step for PDI. This allows, for example, R predictive models to be refreshed and R visualizations to be generated as part of an automated ETL process.

Technical

Weka's R integration uses the JRI library which provides JNI interface to the R native libraries. This, of course, requires that the user have R installed on their computer and that they have installed the rJava package (which includes JRI) from within the R environment. It also requires several environment variables to be set in order for the JRI native library and dependent R libraries to be found. The RPlugin package has instructions for easing this pain and a mechanism that attempts to find the JRI library in the most common installation locations under Windows and MacOS. Once JRI and R are available to the Java VM then Weka's RPlugin will install various R libraries (such as MLR) automatically.

Class loaders + native libraries (combined with the single-threaded nature of the R environment) add up to quite a headache when considering things like plugin environments, application servers and the like. Weka's RPlugin can be used in such environments where it is loaded (perhaps multiple times) by plugin class loaders. To achieve native library visibility across child class loaders, and to maintain a single point of access to R by clients, the byte code for certain key classes (from JRI, REngine and Weka) are injected into the root class loader very early in the class loading process. Many thanks to the guys over at the snappy-java project for detailing this approach.

PDI is a streaming environment and so is Weka (as far as prediction goes). R and MLR operate most efficiently in a batch fashion since the data frame is the structure that is used for both learning a model and making predictions. As the conversion and transfer of data from Weka or PDI into R is costly, the best performance is obtained by pushing over data in batches for prediction. Prediction using R models in Weka 3.7.6 is slow because each test instance has to be transfered into R as a separate data frame. The next release of Weka (3.7.7) due out in August rectifies this with a new batch prediction interface (Note that nightly snapshots of Weka already include this performance improvement).

Update, September 2013

MLR is now available as an official R package from CRAN, yay!! Many thanks to chief MLR developer Bernd Bischl for getting MLR to this point. 


I've just updated Weka's RPlugin package (version 1.1.0) to use the new MLR version 1.1-18. The MLRClassifier will now install MLR in R automatically the first time it is used. MLRClassifier is now more robust and has improved error reporting. Update your package meta data in Weka and give it a go!

11 comments:

  1. Hi Mark,

    Very nice blog! Is there an email address I can contact you in private?

    ReplyDelete
    Replies
    1. Hi Nikos,

      I can be reached at mhall{[at]}pentaho{[dot]}com

      Cheers,
      Mark.

      Delete
  2. Hi Mark Hall,It's a great article.I have used Weka for doing my intrusion detection projects and i am very much interested in that.
    I would like to know more about weka and i am anticipating more posts related to that.

    ReplyDelete
  3. great article Mark, I have used weka but not R , now together great mixture!

    ReplyDelete
  4. Hello Mark,
    my name is Amr from Egypt and I am new in Data Mining field, I am working on an idea related to Data Mining and public opinion mining so I have a lots of questions to ask beside I need your opinion in the Project, this is my email: amrmohsen91@gmail.com
    I appreciate it
    thank you,

    Amr Mohsen

    ReplyDelete
  5. Great post on new functionality of Weka.
    I was also wondering if there is a way in to load a weka saved model in R to run statistical tests on it.
    Otherwise, how is this normally done? I like using R only for statistical tests at this point.

    Thanks,
    Renaud

    ReplyDelete
  6. Hi Renaud,

    There is a RWeka package for R. I'd assume that it would be possible to load a serialized Weka model using this package somehow.

    Cheers,
    Mark.

    ReplyDelete
  7. Hi Mark,

    If we have to draw a comparison between R and Weka for implementing machine learning on data stored in HDFS. How can we contrast between 1) Rscripts in Weka knowledgeflow is 2)or using RWeka package for R where Weka model can be loaded

    Regards,
    Harsha

    ReplyDelete
  8. I guess the difference is that in one case you are using Weka models from the R environment and in the other you can access R functionality from within the Weka framework. In both cases you have access to HDFS. In the Weka case you can stream data from HDFS, which means that you can use Weka's incremental classifiers to process the data. I believe that in R you are pretty much restricted to the batch learning scenario, so all the data from an HDFS dataset would have to be materialised as an R data frame.

    Cheers,
    Mark.

    ReplyDelete
  9. Hi Mark.

    What is the difference between sgdtext and sgd?

    Regards,

    ::LiRa::

    ReplyDelete
  10. This information really worth saying, i think you are master of the content and thank you so much sharing that valuable information and get new skills after refer that post.
    R Language Training in Chennai

    ReplyDelete