The
distributed Weka for Spark package has been available in Weka for several years now. One nice thing about the package is that it allows any Weka classifier to be trained in Spark. However, aside from averaging being implemented for regression and naive Bayes, most classifiers are learned in the cluster by using the embarrassingly simple "Dagging" (Disjoint Aggregation) ensemble approach. Spark's infrastructure provides the disjoint data chunks (RDD partitions) automatically, and standard desktop Weka classifiers are trained on each chunk/partition and then combined into a final voted ensemble via the
Vote meta classifier. This can work fairly well, but partition size is another tuning parameter to consider and available RAM on worker nodes will enforce a hard upper limit. If partitions are too small relative to the total dataset size, then individual ensemble classifiers might fail to capture enough of the structure of the problem, in turn leading to lower predictive performance.
Spark's machine learning library (MLlib) has a small set of algorithms, but each has been designed to operate efficiently in a fully distributed fashion and produces a single final model. This could have an accuracy advantage over the Dagging approach for more complicated problems, and definitely has an advantage when interpretability of the final model is important.
This blog entry takes a look at how MLlib algorithms have been integrated into the latest version of the
distributedWekaSpark package. Or, to be precise, there is now a new version of this package available with a new name:
distributedWekaSparkDev. This has been done so that the old version of the package can continue to be used with (and remain consistent with) what is shown in the
Advanced Data Mining With Weka MOOC. The new version of the package adds support for Spark data-frame based data sources (CSV, Avro and Parquet) as well as MLlib classifiers and regressors.
MLlib in desktop Weka
The new
distributedWekaSparkDev package adds Weka wrapper classifiers for the major MLlib supervised learning schemes. These are designed to work just like any other Weka classifier, and operate on datasets that fit into main memory on the desktop client. They can be run from the command line,
Explorer,
Knowledge Flow and
Experimenter interfaces. This allows MLlib schemes to be used within Weka's standard evaluation framework, used as base classifiers in meta learners, be combined with arbitrary Weka preprocessing filters in the
FilteredClassifier, dropped into standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter. It continues Weka's interoperability theme that started with
R (MLR) and
CPython (Scikit-learn) integration. Now it is possible to run an experiment in the Experimenter that involves implementations from four different ML tools, safe in the knowledge that results are fully comparable due to the fact that the same data splits and evaluation metrics are used in each case.
|
10-fold cross-validation of an MLlib decision tree in Weka's Explorer |
Under the hood the Weka wrappers for each MLlib classifier accept a standard Weka
Instances object via the
Classifier.buildClassifier() method. Standard Weka filters are applied where necessary to prepare the data for the underlying MLlib algorithm. For example, the
MLlibNaiveBayes wrapper automatically discretizes any numeric fields if the user has selected a Bernoulli model. A utility class is then used to extract a list of individual
Instance objects and then parallelize this into an RDD[
Instance] via
SparkContext.parallelize(). From here, the RDD[
Instance] is converted into an RDD[
LabeledPoint] that the underlying MLlib implementations can work with. During this conversion process auxiliary data structures, such as maps of categorical features, are computed for schemes that require them.
|
Comparing Weka's NaiveBayes to MLlib NaiveBayes in the Knowledge Flow |
Default options for the MLlib wrapper classifiers result in a local Spark cluster getting started on-the-fly to perform the learning. Spark's local mode runs in the same JVM as Weka and utilizes the processing cores of the CPU as workers. However, there is nothing to stop the user specifying an external cluster to perform the processing.
|
Comparing MLlib algorithms to native Weka implementations in Weka's Experimenter UI |
One nice thing about this approach is that standard Weka filters can be used for any data transformations needed, rather that using MLlib's transformers. The last mile involves invoking just the MLlib learning algorithm which, in turn, results in a model object for the type of classifier applied. These model objects can be used to predict individual
LabeledPoint instances.
LabeledPoint is a simple data structure that does not require the Spark distributed processing framework, so MLlib models can be applied to score data rapidly in a streaming fashion
without requiring a cluster (local mode or otherwise). The following screenshots show Weka wrapped Spark MLlib decision tree model being used to score data in
Pentaho Data Integration.
|
Weka wrapped Spark MLlib decision tree classifier loaded into the Weka Scoring step in Pentaho Data Integration |
|
Previewing data scored using the Spark MLlib decision tree model in Pentaho Data Integration |
MLlib in distributed Weka
The MLlib classifiers can also be applied in the distributed Weka for Spark framework on a real Spark cluster. The difference, compared to the desktop case, is that Spark's data sources are used to read large datasets directly into data frames in the distributed environment (rather than parallelizing a data set that has been read into Weka on the local machine). From here, a data frame is converted to an RDD[Instance], and the to an RDD[LabeledPoint]. During this process arbitrary Weka filters can be used to preprocess the data (prior to its conversion to LabeledPoints), as long as those filters are ones that are Streamable - i.e. do not require all the data to be seen as a batch before producing output. This is because the results of transforming the data in each partition must be consistent in terms of the structure of the data, in order to facilitate aggregation. Following this, an MLlib classifier is trained as per normal.
The
distributedWekaSparkDev package also implements hold-out and cross-validation evaluation for MLlib classifiers when run in the cluster. In the case of cross-valdiation, it produces training and test folds that are consistent with those used when cross-validating Weka classifiers in the Spark cluster. This entails some fancy shuffling of the data for training MLlib classifiers because maximum parallelism during cross-validation in the Dagging and model averaging approach for Weka classifiers is achieved by building
all training fold classifiers in one pass over the data. To do this, distributed Weka treats each partition of the RDD as containing
part of each cross-validation fold (as shown in the following figure). On the other hand, cross-validation for MLlib classifiers is basically the sequential case - ie., each fold is processed in turn, albeit in parallel fashion by the learning algorithm, as a separate training dataset. So, in order to be comparable, a sequential training fold processed by MLlib during cross-validation needs to be constructed by assembling the data for that fold that is spread across the partitions of the RDD that distributed Weka processes during its cross-validation routine.
|
Cross-validation phase 1 for Weka classifiers when running in distributed Weka: building models for all training folds simultaneously |
|
Cross-validation phase 2 for Weka classifiers when running in distributed Weka: evaluating models for all test folds simultaneously |
Conclusion
Integration of Spark MLlib algorithms continues Weka's interoperability theme and expands the variety of schemes available to the user. It also provides, as is the case with Weka's R and Python integration, convenient no-coding access to the machine learning algorithms from MLlib. Weka's Interoperability with different languages and tools provides a convenient unified framework for experimental comparison across different implementations of the same algorithm. This simplifies the data scientist's job and reduces their workload when considering multiple tools for solving a particular predictive problem.
Hi
ReplyDeleteMark,
I want to use weka tool on spark cluster of 4 nodes ,so will you please tell me how can we configure for this problem.
Thank you so much for such a informative and very helpful post. I am new in blogging arena. This post really helped me to learn so much thing about Data Mining.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteNice information blog
ReplyDeleteSanjary Kids is one of the best play school and preschool in Hyderabad,India. The motto of the Sanjary kids is to provide good atmosphere to the kids.Sanjary kids provides programs like Play group,Nursery,Junior KG,Serior KG,and provides Teacher Training Program.We have the both indoor and outdoor activities for your children.We build a strong value foundation for your child on Psychology and Personality development.
Preschool in hyderabad
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGood information sharing and updating I liked it
ReplyDeleteBest QA / QC Course in India, Hyderabad. sanjaryacademy is a well-known institute. We have offer professional Engineering Course like Piping Design Course, QA / QC Course,document Controller course,pressure Vessel Design Course, Welding Inspector Course, Quality Management Course, #Safety officer course.
QA / QC Course in Hyderabad
Excellent explanation of the topic
ReplyDeletePressure Vessel Design Course is one of the courses offered by Sanjary Academy in Hyderabad. We have offer professional Engineering Course like Piping Design Course,QA / QC Course,document Controller course,pressure Vessel Design Course,Welding Inspector Course, Quality Management Course, #Safety officer course.
Quality Management Course
Quality Management Course in India
Free Online Free HEX to RGBA Tool Click here.
ReplyDeleteGreat Post. Thanks for sharing this with us. Your blog posts are really very interesting and useful. Hopefully this may also help your readers to do affordable online shopping. Jockey Discount Codes
ReplyDeleteThanks for sharing valuable information
ReplyDelete"Yaaron media is one of the rapidly growing digital marketing company in Hyderabad,india.Grow your business or brand name with best online, digital marketing companies in ameerpet, Hyderabad. Our Services digitalmarketing, SEO, SEM, SMO, SMM, e-mail marketing, webdesigning & development, mobile appilcation.
"
Best web designing companies in Hyderabad
Best web designing & development companies in Hyderabad
Best web development companies in Hyderabad
Great post thanks for sharing.
ReplyDeleteWe are the best waterproofing services in Hyderabad.We are providing all kinds of leakage services which includes bathroom,roof,wash area,water tank,wall cracks,kitchen leakage services in Hyderabad. With trust and honest, we solve the issue as quick as possible.We serve you better compared to others.
Best waterproofing services in hyderabad
bathroom leakage services in hyderabad
roof leakage services in hyderabad
water tank leakage services in hyderabad
kitchen leakage services in hyderabad
Hyderabad waterproofing services
It is truly a nice and helpful piece of info. I¡¦m happy that you just shared this helpful info with us. Please keep us up to date like this. Thanks for sharing.
ReplyDeleteClipping path Best
Feeling good to read such a informative blog, mostly i eagerly search for this kind of blog. I really found your blog informative and unique, waiting for your new blog to read.
ReplyDeleteDigital marketing Service in Delhi
SMM Services
PPC Services in Delhi
Website Design & Development Packages
SEO Services Packages
Local SEO services
E-mail marketing services
YouTube plans
Exploring in Yahoo I at last stumbled upon this website.
ReplyDeleteVikram University BA 2nd Year Result
Thanks for sharing such great information. bsc 2nd year time table Hope we will get regular updates from your side.
ReplyDeleteThis is a very good blog and the information related to it is very important if you want to know something about server hosting then seeYou should know about USA VPS Hosting and how it can be important for this modern world. Thanks once again.
ReplyDeleteनमस्ते! मुझे पता है कि यह एक तरह की b a final exam date Private and Regular Students वेबसाइट है जो निश्चित रूप से सभी के लिए मददगार होगी।
ReplyDeleteThis is a very good blog and the information related to it is very important if you want to know something about server hosting then seeYou should know about USA VPS Hosting and how it can be important for this modern world. Thanks once again
ReplyDeleteWaterproofing contractors in Hyderabad
ReplyDeleteI want the world to know about where to invest their hard earned money and get fruitful returns. If one is looking forward of investing he can go into investment of crypto coins.
You can invest in Fudxcoin company that deals in the selling and purchasing of Crypto Currency. It is a reliable company. One need not doubt in investing in it as i have also bought crypto currency from it and feeling very satisfied with their services.
crypto currency block chain technology
Fast-track your data analytics and machine learning course with guaranteed placement opportunities. Most extensive, industry-approved experiential learning program ideal for future Data Scientists.
ReplyDeleteI thoroughly enjoyed reading your article on integrating Spark MLlib into Weka. The step-by-step guide provided valuable insights into seamlessly combining these powerful tools for enhanced data science capabilities. Your clear explanations and detailed instructions make it accessible for both beginners and experienced practitioners. This integration opens up exciting possibilities for leveraging the strengths of both Spark and Weka in tandem.
ReplyDeleteAdditionally, I would like to highlight Imarticus Learning's Data Science Course as an excellent complement to your article. Imarticus Learning has a stellar reputation for offering comprehensive and industry-relevant data science training. Their program covers a wide array of topics, including machine learning, big data analytics, and more, providing students with the skills needed to excel in the ever-evolving field of data science. I believe readers interested in your article would find Imarticus Learning's course to be a valuable resource for furthering their knowledge and expertise in this dynamic field.