Mark Hall on Data Mining & Weka: Integrating Spark MLlib into Weka

Thursday, 27 July 2017

Integrating Spark MLlib into Weka

The distributed Weka for Spark package has been available in Weka for several years now. One nice thing about the package is that it allows any Weka classifier to be trained in Spark. However, aside from averaging being implemented for regression and naive Bayes, most classifiers are learned in the cluster by using the embarrassingly simple "Dagging" (Disjoint Aggregation) ensemble approach. Spark's infrastructure provides the disjoint data chunks (RDD partitions) automatically, and standard desktop Weka classifiers are trained on each chunk/partition and then combined into a final voted ensemble via the Vote meta classifier. This can work fairly well, but partition size is another tuning parameter to consider and available RAM on worker nodes will enforce a hard upper limit. If partitions are too small relative to the total dataset size, then individual ensemble classifiers might fail to capture enough of the structure of the problem, in turn leading to lower predictive performance.

Spark's machine learning library (MLlib) has a small set of algorithms, but each has been designed to operate efficiently in a fully distributed fashion and produces a single final model. This could have an accuracy advantage over the Dagging approach for more complicated problems, and definitely has an advantage when interpretability of the final model is important.

This blog entry takes a look at how MLlib algorithms have been integrated into the latest version of the distributedWekaSpark package. Or, to be precise, there is now a new version of this package available with a new name: distributedWekaSparkDev. This has been done so that the old version of the package can continue to be used with (and remain consistent with) what is shown in the Advanced Data Mining With Weka MOOC. The new version of the package adds support for Spark data-frame based data sources (CSV, Avro and Parquet) as well as MLlib classifiers and regressors.

MLlib in desktop Weka

The new distributedWekaSparkDev package adds Weka wrapper classifiers for the major MLlib supervised learning schemes. These are designed to work just like any other Weka classifier, and operate on datasets that fit into main memory on the desktop client. They can be run from the command line, Explorer, Knowledge Flow and Experimenter interfaces. This allows MLlib schemes to be used within Weka's standard evaluation framework, used as base classifiers in meta learners, be combined with arbitrary Weka preprocessing filters in the FilteredClassifier, dropped into standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter. It continues Weka's interoperability theme that started with R (MLR) and CPython (Scikit-learn) integration. Now it is possible to run an experiment in the Experimenter that involves implementations from four different ML tools, safe in the knowledge that results are fully comparable due to the fact that the same data splits and evaluation metrics are used in each case.

10-fold cross-validation of an MLlib decision tree in Weka's Explorer

Under the hood the Weka wrappers for each MLlib classifier accept a standard Weka Instances object via the Classifier.buildClassifier() method. Standard Weka filters are applied where necessary to prepare the data for the underlying MLlib algorithm. For example, the MLlibNaiveBayes wrapper automatically discretizes any numeric fields if the user has selected a Bernoulli model. A utility class is then used to extract a list of individual Instance objects and then parallelize this into an RDD[Instance] via SparkContext.parallelize(). From here, the RDD[Instance] is converted into an RDD[LabeledPoint] that the underlying MLlib implementations can work with. During this conversion process auxiliary data structures, such as maps of categorical features, are computed for schemes that require them.

Comparing Weka's NaiveBayes to MLlib NaiveBayes in the Knowledge Flow

Default options for the MLlib wrapper classifiers result in a local Spark cluster getting started on-the-fly to perform the learning. Spark's local mode runs in the same JVM as Weka and utilizes the processing cores of the CPU as workers. However, there is nothing to stop the user specifying an external cluster to perform the processing.

Comparing MLlib algorithms to native Weka implementations in Weka's Experimenter UI

One nice thing about this approach is that standard Weka filters can be used for any data transformations needed, rather that using MLlib's transformers. The last mile involves invoking just the MLlib learning algorithm which, in turn, results in a model object for the type of classifier applied. These model objects can be used to predict individual LabeledPoint instances. LabeledPoint is a simple data structure that does not require the Spark distributed processing framework, so MLlib models can be applied to score data rapidly in a streaming fashion without requiring a cluster (local mode or otherwise). The following screenshots show Weka wrapped Spark MLlib decision tree model being used to score data in Pentaho Data Integration.

Weka wrapped Spark MLlib decision tree classifier loaded into the Weka Scoring step in Pentaho Data Integration

Previewing data scored using the Spark MLlib decision tree model in Pentaho Data Integration

MLlib in distributed Weka

The MLlib classifiers can also be applied in the distributed Weka for Spark framework on a real Spark cluster. The difference, compared to the desktop case, is that Spark's data sources are used to read large datasets directly into data frames in the distributed environment (rather than parallelizing a data set that has been read into Weka on the local machine). From here, a data frame is converted to an RDD[Instance], and the to an RDD[LabeledPoint]. During this process arbitrary Weka filters can be used to preprocess the data (prior to its conversion to LabeledPoints), as long as those filters are ones that are Streamable - i.e. do not require all the data to be seen as a batch before producing output. This is because the results of transforming the data in each partition must be consistent in terms of the structure of the data, in order to facilitate aggregation. Following this, an MLlib classifier is trained as per normal.

The distributedWekaSparkDev package also implements hold-out and cross-validation evaluation for MLlib classifiers when run in the cluster. In the case of cross-valdiation, it produces training and test folds that are consistent with those used when cross-validating Weka classifiers in the Spark cluster. This entails some fancy shuffling of the data for training MLlib classifiers because maximum parallelism during cross-validation in the Dagging and model averaging approach for Weka classifiers is achieved by building all training fold classifiers in one pass over the data. To do this, distributed Weka treats each partition of the RDD as containing part of each cross-validation fold (as shown in the following figure). On the other hand, cross-validation for MLlib classifiers is basically the sequential case - ie., each fold is processed in turn, albeit in parallel fashion by the learning algorithm, as a separate training dataset. So, in order to be comparable, a sequential training fold processed by MLlib during cross-validation needs to be constructed by assembling the data for that fold that is spread across the partitions of the RDD that distributed Weka processes during its cross-validation routine.

Cross-validation phase 1 for Weka classifiers when running in distributed Weka: building models for all training folds simultaneously

Cross-validation phase 2 for Weka classifiers when running in distributed Weka: evaluating models for all test folds simultaneously

Conclusion

Integration of Spark MLlib algorithms continues Weka's interoperability theme and expands the variety of schemes available to the user. It also provides, as is the case with Weka's R and Python integration, convenient no-coding access to the machine learning algorithms from MLlib. Weka's Interoperability with different languages and tools provides a convenient unified framework for experimental comparison across different implementations of the same algorithm. This simplifies the data scientist's job and reduces their workload when considering multiple tools for solving a particular predictive problem.

24 comments:

neha29 November 2018 at 01:46
Hi
Mark,
I want to use weka tool on spark cluster of 4 nodes ,so will you please tell me how can we configure for this problem.
ReplyDelete
Replies
Unknown5 March 2019 at 04:45
Thank you so much for such a informative and very helpful post. I am new in blogging arena. This post really helped me to learn so much thing about Data Mining.
ReplyDelete
Replies
kirankumar21 June 2019 at 01:55
This comment has been removed by the author.
ReplyDelete
Replies
kirankumar21 June 2019 at 23:32
Nice information blog
Sanjary Kids is one of the best play school and preschool in Hyderabad,India. The motto of the Sanjary kids is to provide good atmosphere to the kids.Sanjary kids provides programs like Play group,Nursery,Junior KG,Serior KG,and provides Teacher Training Program.We have the both indoor and outdoor activities for your children.We build a strong value foundation for your child on Psychology and Personality development.
Preschool in hyderabad
ReplyDelete
Replies
Anonymous18 July 2019 at 06:23
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous18 July 2019 at 22:24
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous18 July 2019 at 22:35
This comment has been removed by the author.
ReplyDelete
Replies
kirankumar24 July 2019 at 23:57
Good information sharing and updating I liked it
Best QA / QC Course in India, Hyderabad. sanjaryacademy is a well-known institute. We have offer professional Engineering Course like Piping Design Course, QA / QC Course,document Controller course,pressure Vessel Design Course, Welding Inspector Course, Quality Management Course, #Safety officer course.
QA / QC Course in Hyderabad
ReplyDelete
Replies
kirankumar30 July 2019 at 00:12
Excellent explanation of the topic

Pressure Vessel Design Course is one of the courses offered by Sanjary Academy in Hyderabad. We have offer professional Engineering Course like Piping Design Course,QA / QC Course,document Controller course,pressure Vessel Design Course,Welding Inspector Course, Quality Management Course, #Safety officer course.
Quality Management Course
Quality Management Course in India
ReplyDelete
Replies
Biswajit Das14 August 2019 at 12:59
Free Online Free HEX to RGBA Tool Click here.
ReplyDelete
Replies
Zenith Coupons24 October 2019 at 22:52
Great Post. Thanks for sharing this with us. Your blog posts are really very interesting and useful. Hopefully this may also help your readers to do affordable online shopping. Jockey Discount Codes
ReplyDelete
Replies
Chandra Sekhar Reddy15 November 2019 at 22:13
Thanks for sharing valuable information
"Yaaron media is one of the rapidly growing digital marketing company in Hyderabad,india.Grow your business or brand name with best online, digital marketing companies in ameerpet, Hyderabad. Our Services digitalmarketing, SEO, SEM, SMO, SMM, e-mail marketing, webdesigning & development, mobile appilcation.
"
Best web designing companies in Hyderabad
Best web designing & development companies in Hyderabad
Best web development companies in Hyderabad
ReplyDelete
Replies
Ekalavya Kotha23 February 2020 at 22:58
Great post thanks for sharing.
We are the best waterproofing services in Hyderabad.We are providing all kinds of leakage services which includes bathroom,roof,wash area,water tank,wall cracks,kitchen leakage services in Hyderabad. With trust and honest, we solve the issue as quick as possible.We serve you better compared to others.
Best waterproofing services in hyderabad
bathroom leakage services in hyderabad
roof leakage services in hyderabad
water tank leakage services in hyderabad
kitchen leakage services in hyderabad
Hyderabad waterproofing services
ReplyDelete
Replies
Clipping Path Best19 September 2020 at 10:43
It is truly a nice and helpful piece of info. I¡¦m happy that you just shared this helpful info with us. Please keep us up to date like this. Thanks for sharing.
Clipping path Best
ReplyDelete
Replies
Online Front1 December 2020 at 04:58
Feeling good to read such a informative blog, mostly i eagerly search for this kind of blog. I really found your blog informative and unique, waiting for your new blog to read.
Digital marketing Service in Delhi
SMM Services
PPC Services in Delhi
Website Design & Development Packages
SEO Services Packages
Local SEO services
E-mail marketing services
YouTube plans
ReplyDelete
Replies
Ruhi Sukhla14 December 2020 at 23:29
Exploring in Yahoo I at last stumbled upon this website.

Vikram University BA 2nd Year Result
ReplyDelete
Replies
Monika Skekhawat27 February 2021 at 00:42
Thanks for sharing such great information. bsc 2nd year time table Hope we will get regular updates from your side.
ReplyDelete
Replies
Anonymous5 July 2021 at 03:02
This is a very good blog and the information related to it is very important if you want to know something about server hosting then seeYou should know about USA VPS Hosting and how it can be important for this modern world. Thanks once again.
ReplyDelete
Replies
BASANT KUMAR31 October 2021 at 02:26
नमस्ते! मुझे पता है कि यह एक तरह की b a final exam date Private and Regular Students वेबसाइट है जो निश्चित रूप से सभी के लिए मददगार होगी।
ReplyDelete
Replies
PA waterproof17 November 2021 at 04:02
This is a very good blog and the information related to it is very important if you want to know something about server hosting then seeYou should know about USA VPS Hosting and how it can be important for this modern world. Thanks once again

Waterproofing contractors in Hyderabad
ReplyDelete
Replies
abc17 February 2022 at 01:18

I want the world to know about where to invest their hard earned money and get fruitful returns. If one is looking forward of investing he can go into investment of crypto coins.
You can invest in Fudxcoin company that deals in the selling and purchasing of Crypto Currency. It is a reliable company. One need not doubt in investing in it as i have also bought crypto currency from it and feeling very satisfied with their services.
crypto currency block chain technology
ReplyDelete
Replies
Imarticus1 March 2022 at 01:35
Fast-track your data analytics and machine learning course with guaranteed placement opportunities. Most extensive, industry-approved experiential learning program ideal for future Data Scientists.
ReplyDelete
Replies
Imarticus24 January 2024 at 02:15
I thoroughly enjoyed reading your article on integrating Spark MLlib into Weka. The step-by-step guide provided valuable insights into seamlessly combining these powerful tools for enhanced data science capabilities. Your clear explanations and detailed instructions make it accessible for both beginners and experienced practitioners. This integration opens up exciting possibilities for leveraging the strengths of both Spark and Weka in tandem.

Additionally, I would like to highlight Imarticus Learning's Data Science Course as an excellent complement to your article. Imarticus Learning has a stellar reputation for offering comprehensive and industry-relevant data science training. Their program covers a wide array of topics, including machine learning, big data analytics, and more, providing students with the skills needed to excel in the ever-evolving field of data science. I believe readers interested in your article would find Imarticus Learning's course to be a valuable resource for furthering their knowledge and expertise in this dynamic field.
ReplyDelete
Replies
kanchikamakoti3 March 2025 at 21:21
KKCTH excels in pediatric burns and plastic surgery in chennai, offering top-notch medical and surgical care for children. From burn wound management to corrective surgeries, their team provides compassionate, child-friendly treatments that promote healing and enhance confidence in young patients.
ReplyDelete
Replies