Tuesday 30 June 2015

CPython Integration in Weka

Continuing the interoperability in Weka that was started with R integration a few years ago, we now have integration with Python. Whilst Weka has had the ability to do Python scripting via Jython for quite some time, the latest effort adds CPython integration in the form of a "wekaPython" package that can be installed via Weka's package manager. This opens the door to all the highly optimised scientific libraries in Python - such as numpy, scipy, pandas and scikit-learn - that have components written in C or Fortran. Scikit-learn is a relatively new machine learning library that is increasing in popularity very rapidly (see the latest KDNuggets software poll).

Like the R integration in Weka, the CPython support allows for general scripting via a Knowledge Flow Python scripting step. This allows arbitrary scripts to be executed and one or more variables to be extracted from the Python runtime. Weka instances are transferred into Python as pandas data frames, and pandas data frames can be extracted from Python and converted back into instances. Furthermore, arbitrary variables can be extracted in textual form, and matlibplot graphics can be extracted as PNG images.


The package also provides a wrapper classifier and wrapper clusterer for the supervised and unsupervised learning algorithms implemented in scikit-learn. This allows the scikit-learn algorithms to be used and evaluated within Weka's framework, just like the MLRClassifier from the RPlugin package allows ML algorithms from R to be used. With both RPlugin and wekaPython installed it is quite cool to run comparisons between implementations in the different frameworks - e.g. here is a quick comparison on some UCI datasets (using Weka's Experiment environment to run a 10x10 fold cross-validation) between random forest implementations in Weka, R and scikit-learn. All default settings were used except for the number of trees, which was set to 500 for each implementation. Since scikit-learn only handles numeric input variables, both Weka's random forest and the MLRClassifier running R random forest were wrapped in the FilteredClassifier to apply unsupervised nominal to binary encoding (one hot encoding) so that all three implementations received the same input:


weka.classifiers.sklearn.ScikitLearnClassifier

This classifier wraps the majority of the supervised learning algorithms in scikit-learn. The wrapper supports retrieving the underlying model from python (as a pickled string) so that the ScikitLearnClassifier can be serialised and used for prediction at a later date.


weka.clusterers.ScikitLearnClusterer

This clusterer wraps clustering algorithms in scikit-learn. It basically functions in exactly the same way as the ScikitLearnClassifier, which allows it to be used in any Weka UI or from Weka's command line interface.

Under the hood

The underlying integration works via a micro-server written in python that is launched by Weka automatically. Communication is done over plain sockets and messages are stored in JSON structures. Datasets are transmitted as plain CSV and image data as base64 encoded PNG.

wekaPython works with both Python 2.7.x and 3.x. As it relies on a few new features in core Weka, a snapshot build of the development version (3.7) of Weka is required until Weka 3.7.13 is released. Numpy, pandas, matplotlib and scikit-learn must be installed in python for the wekaPython package to operate. Anaconda is a nice python distribution that comes with all the requirements (and lots more).

17 comments:

  1. Thanks for providing this great functionality. I am however unable to use Python as I get the message "Python Environment not available:" even though I have Anaconda 3 and have defined them env vars. Can you please advise? Thanks

    ReplyDelete
  2. Assuming that the python executable is in your PATH, and that is available to Weka when you launch Weka, then there may be an issue with write permissions for python. Where did you install Anaconda, and as which user? The easiest way to get things working is to install Anaconda into your own account as you.

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. Hello Mark. when i click “get frame fields”, rasing an error like this “java.net.SocketException: writed failed”.can you please advise?thanks.

      Delete
  3. OK, I reinstalled Anaconda 3 in my own account as you suggested and it now works. Thank you!

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Good information
    Best QA / QC Course in India, Hyderabad. sanjaryacademy is a well-known institute. We have offer professional Engineering Course like Piping Design Course, QA / QC Course,document Controller course,pressure Vessel Design Course, Welding Inspector Course, Quality Management Course, #Safety officer course.
    QA / QC Course
    QA / QC Course in india
    QA / QC Course in hyderabad

    ReplyDelete
  7. Nice Post
    "Yaaron media is one of the rapidly growing digital marketing company in Hyderabad,india.Grow your business or brand name with best online, digital marketing companies in ameerpet, Hyderabad. Our Services digitalmarketing, SEO, SEM, SMO, SMM, e-mail marketing, webdesigning & development, mobile appilcation.
    "
    Best web designing companies in Hyderabad
    Best web designing & development companies in Hyderabad
    Best web development companies in Hyderabad

    ReplyDelete
  8. Thanks for the article! Could I use these features with Java API? Could you please provide code example?

    ReplyDelete
  9. hi dear i make the GUI Based application on python that Classified the data.
    i send data to weka
    then weka classified these data
    then show result in python GUI application .
    need your help in these task?
    Thanks

    ReplyDelete
  10. I wanted to implement association rules mining algorithm in python and run the algorithm on weka 3.9. Then I wanted to compare my newly design algorithm with existing association rules mining algorithms found in weka. Is there any possible way to integrate my python designed algorithm to weka tool?

    ReplyDelete
  11. The underlying integration works via a micro-server written in python that is launched by Weka automatically. Communication is done over plain sockets and messages are stored in JSON structures. Datasets are transmitted as plain CSV and image data as base64 encoded PNG. afghani topi , antique gold choker , embroidered patches for clothes , Afghani Style Shirt Handmade Embroidery

    ReplyDelete
  12. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
    internship for web development | internship in electrical engineering | mini project topics for it 3rd year | online internship with certificate | final year project for cse

    ReplyDelete
  13. Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..

    evs full form
    raw agent full form
    full form of tbh in instagram
    dbs bank full form
    https full form
    tft full form
    pco full form
    kra full form in hr
    tbh full form in instagram story
    epc full form

    ReplyDelete
  14. Thank you for sharing such a useful post.
    Custom ERP Solution

    ReplyDelete