Thursday, 27 July 2017

Integrating Spark MLlib into Weka

The distributed Weka for Spark package has been available in Weka for several years now. One nice thing about the package is that it allows any Weka classifier to be trained in Spark. However, aside from averaging being implemented for regression and naive Bayes, most classifiers are learned in the cluster by using the embarrassingly simple "Dagging" (Disjoint Aggregation) ensemble approach. Spark's infrastructure provides the disjoint data chunks (RDD partitions) automatically, and standard desktop Weka classifiers are trained on each chunk/partition and then combined into a final voted ensemble via the Vote meta classifier. This can work fairly well, but partition size is another tuning parameter to consider and available RAM on worker nodes will enforce a hard upper limit. If partitions are too small relative to the total dataset size, then individual ensemble classifiers might fail to capture enough of the structure of the problem, in turn leading to lower predictive performance.

Spark's machine learning library (MLlib) has a small set of algorithms, but each has been designed to operate efficiently in a fully distributed fashion and produces a single final model. This could have an accuracy advantage over the Dagging approach for more complicated problems, and definitely has an advantage when interpretability of the final model is important.

This blog entry takes a look at how MLlib algorithms have been integrated into the latest version of the distributedWekaSpark package. Or, to be precise, there is now a new version of this package available with a new name: distributedWekaSparkDev. This has been done so that the old version of the package can continue to be used with (and remain consistent with) what is shown in the Advanced Data Mining With Weka MOOC. The new version of the package adds support for Spark data-frame based data sources (CSV, Avro and Parquet) as well as MLlib classifiers and regressors.

MLlib in desktop Weka

The new distributedWekaSparkDev package adds Weka wrapper classifiers for the major MLlib supervised learning schemes. These are designed to work just like any other Weka classifier, and operate on datasets that fit into main memory on the desktop client. They can be run from the command line, Explorer, Knowledge Flow and Experimenter interfaces. This allows MLlib schemes to be used within Weka's standard evaluation framework, used as base classifiers in meta learners, be combined with arbitrary Weka preprocessing filters in the FilteredClassifier, dropped into standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter. It continues Weka's interoperability theme that started with R (MLR) and CPython (Scikit-learn) integration. Now it is possible to run an experiment in the Experimenter that involves implementations from four different ML tools, safe in the knowledge that results are fully comparable due to the fact that the same data splits and evaluation metrics are used in each case.

10-fold cross-validation of an MLlib decision tree in Weka's Explorer


Under the hood the Weka wrappers for each MLlib classifier accept a standard Weka Instances object via the Classifier.buildClassifier() method. Standard Weka filters are applied where necessary to prepare the data for the underlying MLlib algorithm. For example, the MLlibNaiveBayes wrapper automatically discretizes any numeric fields if the user has selected a Bernoulli model. A utility class is then used to extract a list of individual Instance objects and then parallelize this into an RDD[Instance] via SparkContext.parallelize(). From here, the RDD[Instance] is converted into an RDD[LabeledPoint] that the underlying MLlib implementations can work with. During this conversion process auxiliary data structures, such as maps of categorical features, are computed for schemes that require them.

Comparing Weka's NaiveBayes to MLlib NaiveBayes in the Knowledge Flow


Default options for the MLlib wrapper classifiers result in a local Spark cluster getting started on-the-fly to perform the learning. Spark's local mode runs in the same JVM as Weka and utilizes the processing cores of the CPU as workers. However, there is nothing to stop the user specifying an external cluster to perform the processing.

Comparing MLlib algorithms to native Weka implementations in Weka's Experimenter UI

One nice thing about this approach is that standard Weka filters can be used for any data transformations needed, rather that using MLlib's transformers. The last mile involves invoking just the MLlib learning algorithm which, in turn, results in a model object for the type of classifier applied. These model objects can be used to predict individual LabeledPoint instances. LabeledPoint is a simple data structure that does not require the Spark distributed processing framework, so MLlib models can be applied to score data rapidly in a streaming fashion without requiring a cluster (local mode or otherwise). The following screenshots show Weka wrapped Spark MLlib decision tree model being used to score data in Pentaho Data Integration.

Weka wrapped Spark MLlib decision tree classifier loaded into the Weka Scoring step in Pentaho Data Integration

Previewing data scored using the Spark MLlib decision tree model in Pentaho Data Integration


MLlib in distributed Weka

The MLlib classifiers can also be applied in the distributed Weka for Spark framework on a real Spark cluster. The difference, compared to the desktop case, is that Spark's data sources are used to read large datasets directly into data frames in the distributed environment (rather than parallelizing a data set that has been read into Weka on the local machine). From here, a data frame is converted to an RDD[Instance], and the to an RDD[LabeledPoint]. During this process arbitrary Weka filters can be used to preprocess the data (prior to its conversion to LabeledPoints), as long as those filters are ones that are Streamable - i.e. do not require all the data to be seen as a batch before producing output. This is because the results of transforming the data in each partition must be consistent in terms of the structure of the data, in order to facilitate aggregation. Following this, an MLlib classifier is trained as per normal.

The distributedWekaSparkDev package also implements hold-out and cross-validation evaluation for MLlib classifiers when run in the cluster. In the case of cross-valdiation, it produces training and test folds that are consistent with those used when cross-validating Weka classifiers in the Spark cluster. This entails some fancy shuffling of the data for training MLlib classifiers because maximum parallelism during cross-validation in the Dagging and model averaging approach for Weka classifiers is achieved by building all training fold classifiers in one pass over the data. To do this, distributed Weka treats each partition of the RDD as containing part of each cross-validation fold (as shown in the following figure). On the other hand, cross-validation for MLlib classifiers is basically the sequential case - ie., each fold is processed in turn, albeit in parallel fashion by the learning algorithm, as a separate training dataset. So, in order to be comparable, a sequential training fold processed by MLlib during cross-validation needs to be constructed by assembling the data for that fold that is spread across the partitions of the RDD that distributed Weka processes during its cross-validation routine.

Cross-validation phase 1 for Weka classifiers when running in distributed Weka: building models for all training folds simultaneously

Cross-validation phase 2 for Weka classifiers when running in distributed Weka: evaluating models for all test folds simultaneously

Conclusion

Integration of Spark MLlib algorithms continues Weka's interoperability theme and expands the variety of schemes available to the user. It also provides, as is the case with Weka's R and Python integration, convenient no-coding access to the machine learning algorithms from MLlib. Weka's Interoperability with different languages and tools provides a convenient unified framework for experimental comparison across different implementations of the same algorithm. This simplifies the data scientist's job and reduces their workload when considering multiple tools for solving a particular predictive problem.

23 comments:

  1. Hi
    Mark,
    I want to use weka tool on spark cluster of 4 nodes ,so will you please tell me how can we configure for this problem.

    ReplyDelete
  2. Thank you so much for such a informative and very helpful post. I am new in blogging arena. This post really helped me to learn so much thing about Data Mining.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Nice information blog
    Sanjary Kids is one of the best play school and preschool in Hyderabad,India. The motto of the Sanjary kids is to provide good atmosphere to the kids.Sanjary kids provides programs like Play group,Nursery,Junior KG,Serior KG,and provides Teacher Training Program.We have the both indoor and outdoor activities for your children.We build a strong value foundation for your child on Psychology and Personality development.
    Preschool in hyderabad

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Good information sharing and updating I liked it
    Best QA / QC Course in India, Hyderabad. sanjaryacademy is a well-known institute. We have offer professional Engineering Course like Piping Design Course, QA / QC Course,document Controller course,pressure Vessel Design Course, Welding Inspector Course, Quality Management Course, #Safety officer course.
    QA / QC Course in Hyderabad

    ReplyDelete
  9. Excellent explanation of the topic

    Pressure Vessel Design Course is one of the courses offered by Sanjary Academy in Hyderabad. We have offer professional Engineering Course like Piping Design Course,QA / QC Course,document Controller course,pressure Vessel Design Course,Welding Inspector Course, Quality Management Course, #Safety officer course.
    Quality Management Course
    Quality Management Course in India

    ReplyDelete
  10. Great Post. Thanks for sharing this with us. Your blog posts are really very interesting and useful. Hopefully this may also help your readers to do affordable online shopping. Jockey Discount Codes

    ReplyDelete
  11. Thanks for sharing valuable information
    "Yaaron media is one of the rapidly growing digital marketing company in Hyderabad,india.Grow your business or brand name with best online, digital marketing companies in ameerpet, Hyderabad. Our Services digitalmarketing, SEO, SEM, SMO, SMM, e-mail marketing, webdesigning & development, mobile appilcation.
    "
    Best web designing companies in Hyderabad
    Best web designing & development companies in Hyderabad
    Best web development companies in Hyderabad

    ReplyDelete
  12. Great post thanks for sharing.
    We are the best waterproofing services in Hyderabad.We are providing all kinds of leakage services which includes bathroom,roof,wash area,water tank,wall cracks,kitchen leakage services in Hyderabad. With trust and honest, we solve the issue as quick as possible.We serve you better compared to others.
    Best waterproofing services in hyderabad
    bathroom leakage services in hyderabad
    roof leakage services in hyderabad
    water tank leakage services in hyderabad
    kitchen leakage services in hyderabad
    Hyderabad waterproofing services

    ReplyDelete
  13. It is truly a nice and helpful piece of info. I¡¦m happy that you just shared this helpful info with us. Please keep us up to date like this. Thanks for sharing.
    Clipping path Best

    ReplyDelete
  14. Feeling good to read such a informative blog, mostly i eagerly search for this kind of blog. I really found your blog informative and unique, waiting for your new blog to read.
    Digital marketing Service in Delhi
    SMM Services
    PPC Services in Delhi
    Website Design & Development Packages
    SEO Services Packages
    Local SEO services
    E-mail marketing services
    YouTube plans

    ReplyDelete
  15. Exploring in Yahoo I at last stumbled upon this website.

    Vikram University BA 2nd Year Result

    ReplyDelete
  16. Thanks for sharing such great information. bsc 2nd year time table Hope we will get regular updates from your side.

    ReplyDelete
  17. This is a very good blog and the information related to it is very important if you want to know something about server hosting then seeYou should know about USA VPS Hosting and how it can be important for this modern world. Thanks once again.

    ReplyDelete
  18. नमस्ते! मुझे पता है कि यह एक तरह की b a final exam date Private and Regular Students वेबसाइट है जो निश्चित रूप से सभी के लिए मददगार होगी।

    ReplyDelete
  19. This is a very good blog and the information related to it is very important if you want to know something about server hosting then seeYou should know about USA VPS Hosting and how it can be important for this modern world. Thanks once again

    Waterproofing contractors in Hyderabad

    ReplyDelete

  20. I want the world to know about where to invest their hard earned money and get fruitful returns. If one is looking forward of investing he can go into investment of crypto coins.
    You can invest in Fudxcoin company that deals in the selling and purchasing of Crypto Currency. It is a reliable company. One need not doubt in investing in it as i have also bought crypto currency from it and feeling very satisfied with their services.
    crypto currency block chain technology

    ReplyDelete
  21. Fast-track your data analytics and machine learning course with guaranteed placement opportunities. Most extensive, industry-approved experiential learning program ideal for future Data Scientists.

    ReplyDelete
  22. I thoroughly enjoyed reading your article on integrating Spark MLlib into Weka. The step-by-step guide provided valuable insights into seamlessly combining these powerful tools for enhanced data science capabilities. Your clear explanations and detailed instructions make it accessible for both beginners and experienced practitioners. This integration opens up exciting possibilities for leveraging the strengths of both Spark and Weka in tandem.

    Additionally, I would like to highlight Imarticus Learning's Data Science Course as an excellent complement to your article. Imarticus Learning has a stellar reputation for offering comprehensive and industry-relevant data science training. Their program covers a wide array of topics, including machine learning, big data analytics, and more, providing students with the skills needed to excel in the ever-evolving field of data science. I believe readers interested in your article would find Imarticus Learning's course to be a valuable resource for furthering their knowledge and expertise in this dynamic field.

    ReplyDelete