Mark Hall on Data Mining & Weka: 2016

On Friday 15th April NZT we released Weka 3.8.0. I've only just got around to writing about it because, after collapsing from exhaustion right after the release went out, I required immediate vacational therapy with my son at the theme parks on Australia's Gold Coast :-)

Weka uses the Linux model of releases, where an even second digit of a release number indicates a "stable" release and an odd second digit indicates a "development" release. Weka 3.8.0 is the first stable release of Weka since 2008! There are tons of new features and improvements in 3.8.0 compared to 3.6 (the previous stable release). Furthermore, it supports the newest Weka MOOC that launched on the 25th of April and the forthcoming 4th edition of the Data Mining book.

Some of the features added since Weka 3.6 include:

A package management system

The Weka software has evolved considerably since Weka 3.6. Many new algorithms and features (too many to detail here) have been added to the system, a number of which have been contributed by the community. With so many algorithms on offer we felt that the software could be considered overwhelming to the new user. Therefore a number of algorithms and community contributions were removed and placed into plugin packages. A package management system was added that allows the user to browser for, and selectively install, packages of interest. To date, there are 177 packages that can be installed via the package manager.

Another motivation for introducing the package management system was to make the process of contributing to the Weka software easier, and to ease the maintenance burden on the Weka development team. A contributor of a plugin package is responsible for maintaining its code and hosting the installable archive; while Weka simply tracks the package metadata. The package system also opens the door to the use of third-party libraries, something that we'd discouraged in the past in order to keep a lightweight footprint for Weka.

Plugin packages for Weka can be viewed online at the central package metadata repository.

A completely rewritten Knowledge Flow

Weka's Knowledge Flow user interface received a graphical makeover a few years ago, but for 3.8 it has been completely rewritten from scratch. The rewrite includes a brand new engine that is fully multithreaded and supports pluggable execution environments. There is also a radically simplified API for developers to use. New features in the knowledge flow include:

Automatic execution of individual steps in separate threads
Single-threaded execution for streaming flows
Separate executor service for resource intensive steps and tasks
Support for attribute selection and prediction boundary visualisation
JSON-based flow persistence
Support for loading legacy .kfml flow files
Settings and preferences at the application and perspective level
User-configurable logging level
New and simplified API

MTJ-based linear algebra

The old JAMA-based linear algebra routines have been replaced with MTJ. MTJ provides faster pure JVM routines than JAMA and, more importantly, can seamlessly use reference and system-optimised versions of native libraries based on BLAS, LAPACK and ARPACK if available. To that end we have provided three plugin packages - one for each of the major OS's - providing JNI native libraries. Mac OSX users are lucky because system-optimised versions are pre-installed, giving the biggest speed increases. Linux and Windows users get a native reference implementation in their respective packages (which is faster than a pure JVM implementation), but will have to install a system-optimised library if they want the ultimate speed. Multiple linear regression, Gaussian processes, PCA and LDA are some of the schemes in Weka to benefit from MTJ.

Core improvements

Numerous efficiency improvements to core data structures, filters and some classifiers have been made over the years. All of these add up to better memory utilisation and faster execution.

Workbench

Aside from the Knowledge Flow, the other major graphical user interfaces in Weka (Explorer and Experimenter) have remained largely unchanged from Weka 3.6. However, one new user interface - the Workbench - has been added in 3.8.0. This is a unified graphical interface that combines the other three (and any plugins that the user has installed) into one application. The Workbench is highly configurable, allowing the user to specify which applications and plugins will appear, along with settings relating to them.

Using the approach developed for integrating Python into Weka, Pentaho Data Integration (PDI) now has a new step that can be used to leverage the Python programming language (and its extensive package-based support for scientific computing) as part of a data integration pipeline. The step has been released to the community from Pentaho Labs and can be installed directly from PDI via the marketplace.

Python is becoming a serious contender to R when it comes to programming language choice for data scientists. In fact, many folks are leveraging the strengths of both languages when developing solutions. With that in mind, it is clear that data scientists and predictive application developers can boost productivity by leveraging the PDI + Python combo. As we all know, data preparation consumes the bulk of time in a typical predictive project. That data prep can typically be achieved more quickly in PDI, compared to developing code from scratch, thanks to its intuitive graphical development environment and extensive library of connectors and processing steps. Instead of having to write (and rewrite) code to connect to source systems (such as relational databases, NoSQL databases, Hadoop filesystems and so forth), and to join/filter/blend data etc., PDI allows the developer to focus their coding efforts on the cool data science-oriented algorithms.

CPython Script Executor

As the name suggests, the new step uses the C implementation of the Python programming language. While there are JVM-based solutions available - such as Jython - that allow a more tightly integrated experience when executing in the JVM, these do not facilitate the use of many high-powered Python libraries for scientific computing, due to the fact that such libraries include highly optimised components that are written in C or Fortran. In order to gain access to such libraries, the PDI step launches, and communicates with, a micro-service running in the C Python environment. Communication is done over plain sockets and messages are stored in JSON structures. Datasets are transmitted as CSV and the very fast routines for reading and writing CSV from the pandas Python package are leveraged.

The step itself offers maximum flexibility when it comes to dealing with data. It can act as a start point/data source in PDI (thus allowing the developer the freedom to source data directly via their Python code if so desired), or it can accept data from an upstream step and push it into the Python environment. In the latter case, the user can opt to send all incoming rows to Python in one hit, send fixed sized batches of rows, or send rows one-at-a-time. In any of these cases the data sent is considered a dataset, gets stored in a user-specified variable in Python, and the user's Python script is invoked. In the "all data" case, there is also the option to apply reservoir sampling to down-sample to a fixed size before sending the data to Python. The pandas DataFrame is used as the data structure for datasets transferred into Python.

A python script can be specified via the built-in editor, or loaded from a file dynamically at runtime. There are two scenarios for getting output from the Python environment to pass on to downstream PDI steps for further processing. The first (primary) scenario is when there is a single variable to retrieve from Python and it is a pandas DataFrame. In this case, the columns of the data frame become output fields from the step. In the second scenario, one or more non-data frame variables may be specified. In this case, their values are assumed to be textual (or can be represented as text) or contain image data (in which case they are retrieved from Python as binary PNG data). Each variable is output in a separate PDI field.

Requirements

The CPython Script Executor step will work with PDI >= 5.0. Of course, it requires Python to be installed and the python executable to be in your PATH environment variable. The step has been tested with Python 2.7 and 3.x and, at a minimum, needs the pandas, matplotlib and numpy libraries to be installed. For Windows users in particular, I'd recommend installing the excellent Anaconda python distribution. This includes the entire SciPy stack (including pandas and scikit-learn) along with lots of other libraries.

Example

The example transformation shown in the following screenshot can be obtained from here.

The example uses Fisher's classic iris data. The first python step (at the top) simply computes some quartiles for the numeric columns in the iris data. This is output from the step as a pandas DataFrame, where each row corresponds to one of the quartiles computed (25th, 50th and 75th), and each column holds the value for one of the numeric fields in the iris data. The second python step from the top uses the scikit-learn decomposition routine to compute a principal components analysis on the iris data and then transforms the iris data into the PCA space, which is then the output of the step. The third python step from the top uses the matplotlib library and plotting routines from the pandas library to compute some visualisations of the iris data (scatter plot matrix, Andrew's curves, parallel coordinates and rad-viz). These are then extracted as binary PNG data from the python environment and saved to files in the same directory as the transformation was loaded from. The two python steps at the bottom of the transformation learn a decision tree model and then use that model to score the iris data respectively. The model is saved (from the python environment) to the directory that the transformation was loaded from.

Conclusion

The new PDI CPython Script Executor step opens up the power of Python to the PDI developer and data scientist. It joins the R Script Executor and Weka machine learning steps in PDI as part of an expanding array of advanced statistical and predictive tools that can be leveraged within data integration processes.

Mark Hall on Data Mining & Weka

Wednesday, 18 May 2016

New Weka 3.8.0 Stable Release