The first new package is called distributedWekaBase. It provides base "map" and "reduce" tasks that are not tied to any specific distributed platform. The second, called distributedWekaHadoop, provides Hadoop-specific wrappers and jobs for these base tasks. In the future there could be other wrappers - one based on the Spark platform would be cool.
Base map and reduce tasks
distributedWekaBase version 1.0 provides tasks for:- Determining a unified ARFF header from separate data chunks in CSV format. This is particularly important because, as Weka users know, Weka is quite particular about metadata - especially when it comes to nominal attributes. At the same time this task computes some handy summary statistics (that are stored as additional "meta attributes" in the header), such as count, sum, sum squared, min, max, num missing, mean, standard deviation and frequency counts for nominal values. These summary statistics come in useful for some of the other tasks listed below.
- Computing a correlation or covariance matrix. Once the ARFF header job has been run, then computing a correlation matrix can be completed in just one pass over the data given our handy summary stats. The matrix produced by this job can be read by Weka's Matrix class. Map tasks compute a partial matrix of covariance sums. The reduce tasks aggregates individual rows of the matrix in order to produce the final matrix. This means that parallelism can be exploited in the reduce phase by using as many reducers as there are rows in the matrix.
- Training a Weka classifier (or regressor). The map portion of this task can train any Weka classifier (batch or incremental) on a given data chunk and then the reduce portion will aggregate the individual models in various ways, depending on the type of classifier. Recently, a number of classifiers in Weka 3.7 have become Aggregateable. Such classifiers allow one final model, of the same type, to be produced from several separate models. Examples include: naive Bayes, naive Bayes multinomial, various linear regression models (learned by SGD) and Bagging. Other, non-Aggregateable, classifiers can be combined by forming a voted ensemble using Weka's Vote meta classifier. The classifier task also has various handy options such as allowing reservoir sampling to be used with batch learners (so that a maximum number of instances processed by the learning algorithm in a given map can be enforced), normal Weka filters to be used for pre-processing in each map (the task takes care of using various special subclasses of FilteredClassifier for wrapping the base classifier and filters depending on whether the base learner is Aggregateable and/or incremental), forcing batch learning for incremental learners (if desired), and for using a special "pre-constructed" filter (see below).
- Evaluating a classifier or regressor. This task handles evaluating a classifier using either the training data, a separate test set or cross-validation. Because Weka's Evaluation module is Aggregateable, and computes statistics incrementally, this is fairly straightforward. The process makes use of the classifier training task to learn an aggregated classifier in one pass over the data and then evaluation proceeds in a second pass. In the case of cross-validation, the classifiers for all folds are learned in one go (i.e. one aggregated classifier per fold) and then evaluated. In this case, the learning phase can make use of up to k reducers (one per fold). In the batch learning case, the normal process of creating folds (using Instances.train/testCV()) is used and the order of the instances in each map gets randomised first. In the case of incremental learning, instances are processed in a streaming fashion and a modulus operation is used to pull out the training/test instances corresponding to a given fold of the cross-validation.
- Scoring using a trained classifier or regressor. This is fairly simple and just takes uses a trained model to make predictions. No reducer is needed in this case. The task outputs input instances with predicted probability distributions appended. The user can specify which of the input attribute values to output along with the predictions. It also builds a mapping between the attributes in the incoming instances and those that the model is expecting, with missing attributes or type mismatches replaced with missing values.
- PreconstructedPCA. This is not a distributed task as such; instead it is a filter that can accept a correlation matrix or covariance matrix (as produced by the correlation matrix task) and produces a principal components analysis. The filter produces the same textual analysis output as Weka's standard PCA (in the attribute selection package) and also encapsulates the transformation for data filtering purposes. Once constructed, it can be used with the classifier building task.
Hadoop wrappers and jobs
distributedWekaHadoop version 1.0 provides a number of utilities for configuration/HDFS, mappers and reducers that wrap the base tasks, and jobs to orchestrate everything against Apache Hadoop 1.x (in particular, it has been developed and tested against Hadoop 1.1.2 and 1.2.1).
Getting datasets in and out of HDFS
The first thing this package provides is a "Loader" and "Saver" for HDFS. These can batch transfer or stream data in and out of HDFS using any base Loader or Saver - so any data format that Weka already supports can be written or read to/from HDFS. Because the package uses Hadoop's TextInputFormat for delivering data to mappers, we work solely with CSV files that have no header row. The CSVSaver in Weka 3.7.10 has a new option to omit the header row when writing a CSV file. The new HDFSSaver and HDFSLoader can be used from the command line or the Knowledge Flow GUI:
ARFF header creation job
The first job that the distributedWekaHadoop package provides is one to create a unified ARFF header + summary statistics from the input data. All Weka Hadoop jobs have an extensive command line interface (to facilitate scripting etc.) and a corresponding step in the Knowledge Flow GUI. The jobs also take care of making sure that all Weka classes (and dependencies) are available to map and reduce tasks executing in Hadoop. It does this by installing the Weka jar file (and other dependencies) in HDFS and then adding them to the distributed cache and classpath for the job.
java weka.Run ArffHeaderHadoopJob \ -hdfs-host palladium.local -hdfs-port 9000 \ -jobtracker-host palladium.local -jobtracker-port 9001 \ -input-paths /users/mhall/input/classification \ -output-path /users/mhall/output \ -names-file $HOME/hypothyroid.names -max-split-size 100000 \ -logging-interval 5 \ -user-prop mapred.child.java.opts=-Xmx500m
The job has options for specifying Hadoop connection details and input/output paths. It also allows control over the number of map tasks that actually get executed via the max-split-size option (this sets dfs.block.size) as Hadoop's default of 64Mb may not be appropriate for batch learning tasks, depending on data characteristics. The classifier job, covered in the next instalment of this series, has a pre-processing option to create a set of randomly shuffled input data chunks, which gives greater control over the number and size of the data sets processed by the mappers. The ARFF header job also has a set of options for controlling how the CSV input file gets parsed and processed. It is possible to specify attribute (column) names directly or have them read from a "names" file (one attribute name per line; not to be confused with the C4.5 ".names" file format) stored on the local file system or in HDFS.
As other Weka Hadoop jobs use the ARFF job internally, and it is not necessary to repeat it for subsequent jobs that process the same data set, it is possible to prevent the job from executing by providing a path to an existing ARFF header (in or out of HDFS) to use.
This ends the first part of our coverage of the new distributed Weka functionality. In part two I'll cover the remaining Hadoop jobs for learning and evaluating classifiers and performing a correlation analysis.
The latest distributedWekaHadoop doesn't build with mvn clean package. Maven says opencsv is the problem.
ReplyDelete-- Brian
ok really its true and your post looking good keep it up.
Deletejoin our latest active whatsapp related stuff.
Latest Active Whatsapp Groups Links For 2019
Girls WhatsApp Group Link
Whatsapp Status 2019
I've just committed a fix to the pom.xml in distributedWekaBase. Thanks for pointing this out.
ReplyDeleteCheers,
Mark.
Please how can I install distributed WEKA in my windows environment. I am always getting an error message when I click on install.
DeletePlease, is there any step by step instruction on how to do it?
Which version of Weka are you using, and what is the error message? If Weka can't connect to sourceforge, then you might be behind a proxy. In this case there is instructions on configuring Weka to use a proxy at:
Deletehttp://weka.wikispaces.com/How+do+I+use+the+package+manager%3F#GUI package manager-Using a HTTP proxy
If you are using Weka <= 3.8.0, then you will need to upgrade to Weka >= 3.8.1, due to an issue with sourceforge generating redirects from download links. The package manager in 3.8.1 has been fixed to deal with this.
Cheers,
Mark.
Hi Mark,
ReplyDeleteThanks! I still don't get past 'mvn clean package'. Was this tested on a clean machine without access to any local maven repositories? There are dependency resolution problems afoot.
-- Brian
-- Brian
Thanks for sharing the valuable information
ReplyDeleteHadoop Online Training
Hi - thanks for making the library available.
ReplyDeleteMight it be possible to also post an example of a short java running a weka clusterer/classifier as a mapreduce job?
Thanks.
congratulations guys, quality information you have given!!! Big Data and Analytics
ReplyDeletegood to see the best information about the big data. and u can read all Hadoop Interview Questions here
ReplyDeleteHi Mark!
ReplyDeleteThanks for the package. Very interesting.
One question. How I can set for ARFF header creation job (and other jobs, especially WekaClassifierHadoopJob, that source CSV file has header row ? I tried it, but is recognizes first row as data row, so adds me unnecessary values for every parameter.
I didn't find this option. Can you help me ?
With best regards
Pavel Dvorkin
No, I’m afraid that it is not possible to use a CSV file with a header row. This is because Hadoop will split the file up for processing by multiple mappers and only one mapper will get the chunk that contains the header row.
ReplyDeleteYou will need to remove the header row in your file before processing by any of the distributed Weka jobs. Note that the CSVSaver in Weka has an option to omit the header row when writing a CSV file, so you could try reading file (incrementally if it is large) via the HDFSLoader+CSVLoader and then writing it back into HDFS as a new CSV file (minus header row) via HDFSSaver+CSVSaver.
Cheers,
Mark.
Hi Mark!
ReplyDeleteCan you give me plugin ( distributedWekaHadoop) installation algorithm? I can not get it to connect to Weka
Hi Sergey,
ReplyDeleteIt is designed to work with Weka 3.7. If you have Weka 3.7.10 then you can install distributedWekaHadoop via the built-in package manager (GUIChooser-->Tools).
Cheers,
Mark.
Hi Mark,
DeleteThank you very mach !
I installed the plugin, but got an error. There are no modules (JDBC). Where to get and how to install? Help me, please!
https://www.dropbox.com/s/7ywqrcqqkwhdfdv/Getting%20Started%20%28MacBook-Pro-Nero%27s%20conflicted%20copy%202014-03-22%29.pdf
These are just warning to let you know that there are some missing JDBC drivers. This has no impact on distributed Weka.
ReplyDeleteCheers,
Mark.
Thank you, Mark !!!!!!!
DeleteHi Mark,
ReplyDeleteI'm don't see "HDFSServer".
Help me please !
Sergey
https://www.dropbox.com/s/cfvw8flvl6rqkuz/%D0%A1%D0%BD%D0%B8%D0%BC%D0%BE%D0%BA_%D1%8D%D0%BA%D1%80%D0%B0%D0%BD%D0%B0_24_03_14__22_28.jpg
I can't see a "Hadoop" folder on the left-hand-side in your screenshot. Have you installed "distributedWekaHadoop" via the package manager (GUIChooser-->Tools)?
ReplyDeleteOnce installed correctly, you will find HDFSLoader under "DataSources" and HDFSSaver under "DataSinks".
Cheers,
Mark.
heyyy,, can u plz give me the solution....why weka is not working on a large dataset using multilayer perceptron classifier but its working on naive bayes using same data
ReplyDeleteYou might need to expand a bit on "not working". What are you doing exactly and what is happening (i.e. errors, exceptions etc.)?
ReplyDeleteCheers,
Mark.
hi Mark,
ReplyDeletewhat about Hadoop 2.2 ? I tried to use HDFSSaver and got the following:
11:52:57: [Saver] HDFSSaver$254396694|-dest / -saver "weka.core.converters.CSVSaver -F , -M ? -decimal 6" -hdfs-host 10.165.140.57 -hdfs-port 8020| problem saving. org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
It will work on Hadoop 2 (at least it did for me when I ran a few quick tests a while back). You will have to swap out the Hadoop 1.2.x client libraries that come with the distributedWekaHadoop package for the 2.x versions though.
ReplyDeleteAlso, the UI isn't really set up for Yarn. The job tracker settings don't apply anymore (but you will still need to have something in there so that Weka doesn't complain). In addition you will need to set a few properties in the User properties part of the dialog. In particular, I set:
yarn.nodemanager.aux-services=mapreduce_shuffle
mapreduce.framework.name=yarn
After this was set, I managed to run all Weka job types successfully - on a local single node psuedo-distributed setup at least.
Cheers,
Mark.
I just started use your package and want to make it work with my hadoop2.x.
DeleteCould you please specify which library file to replace and how to change the user properties?
Thanks in advance
I have managed to build distributedWekaHadoop against Hadoop 2.2.0 libraries and test it successfully both locally and on AWS in fully distributed mode. If Mark is ok with that i can post a link of the package.
DeleteHi ArisKK,
DeleteSounds great! Please share the link to the package with the community (and any other settings/properties you used). Was it necessary to re-compile against Hadoop 2.2.0? 1.2 is supposed to be binary compatible with 2.x, and it worked for me with just a straight swap of jar files.
Cheers,
Mark.
This comment has been removed by the author.
DeleteCould you specify the steps clearly on which library files to swap and where? I am new to Hadoop and need to run this job immediately using Hadoop 2.4. Kindly help
Deletewhere to set the properties and in which file?, Kindly help
DeleteArisKK, could you please supply your package. I can't get my package to work with yarn.
ReplyDeleteWhen I first added the “-user-prop mapreduce.framework.name=yarn” the program got stuck looking for the resourcemanager at 0.0.0.0, when I added the “user-prop yarn.resourcemanager.address=IP_of_the_resourcemanager”, I got it to work.. though all mapper jobs fail.
I think that this must be something that the program doesn't find the $HADOOP_CONF_DIR or $YARN_HOME. Then if I run the application without yarn it seems to work quite fine, but then it's running locally..
Another problem is that when I use a predefined header (the @relation in the top) my program freeze, if I remove @relation I get the exception that it's supposed to be @relation in the top and the program continues to create a new header.
/Peter
Thank you provide valuable informations and iam seacrching same informations,and saved my time SAS Online Training
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi all,
ReplyDeleteWhen I use a CSVLoader to provide instances to HDFSSaver, the CSVLoader automatically add a header att1..att5 like this:
att1,att2,att3,att4,att5
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
If I establish the value of noHeaderRowPresent=False I get the next error:
[Loader] CSVLoader$1560672686|-M ? -B 100 -E ",' -F ,| Attribute names are not unique! Causes: '85'.
For that reason when I use CSVtoARFF HeaderHadoopJob I get this arff header:
@relation 'A relation name'
@attribute att1 {att1,overcast,rainy,sunny}
@attribute att2 {64,65,68,69,70,71,72,75,80,81,83,85,att2}
@attribute att3 {65,70,75,80,85,86,90,91,95,96,att3}
@attribute att4 {FALSE,TRUE,att4}
@attribute att5 {att5,no,yes}
@attribute arff_summary_att1 {att1_1.0,overcast_4.0,rainy_5.0,sunny_5.0,**missing**_0.0}
@attribute arff_summary_att2 {64_1.0,65_1.0,68_1.0,69_1.0,70_1.0,71_1.0,72_2.0,75_2.0,80_1.0,81_1.0,83_1.0,85_1.0,att2_1.0,**missing**_0.0}
@attribute arff_summary_att3 {65_1.0,70_3.0,75_1.0,80_2.0,85_1.0,86_1.0,90_2.0,91_1.0,95_1.0,96_1.0,att3_1.0,**missing**_0.0}
@attribute arff_summary_att4 {FALSE_8.0,TRUE_6.0,att4_1.0,**missing**_0.0}
@attribute arff_summary_att5 {att5_1.0,no_5.0,yes_9.0,**missing**_0.0}
@data
So my questions are: How can I save a csv file without header? and how can I specify the type(numeric, nominal,etc)of each attribute? (everyone appear as nominal and when I use CorrelationMatrixHadoopJob the hadoop log show me this error:DistributedWekaException: No numeric attributes in the input data!)
Thanks in advance ;)
You should leave the CSVLoader options at their defaults (i.e. read the header row). The HDFSSaver uses a CSVSaver internally when it writes the data to HDFS. There is an option in the CSVSaver to omit writing the header row - you need to make sure that this is turned on so that the CSV file in HDFS does not have a header row. See the "Getting datasets in and out of HDFS" section in the post.
DeleteCheers,
Mark.
Thanks Mark!
DeleteI didn't realize that HDFSSaver has his own CSVSaver internally. I just have changed noHeaderRow parameter to "False" and everything works fine.
Cheers,
Yari.
Thanks for your such valuable information. But could you please kindly do me a favor?
ReplyDeleteI am a new weka and hadoop learner, and I've successfully installed the distrubutedwekabase and distrubutedwekahadoop packages just used the method you mentioned before:through the GUI->tools.
When finished my installment,I can see a "Hadoop" folder, but I cannot find either "distrubutedwekabase" or "distrubutedwekahadoop" in the "available" list in the PackageManager. I tried to Run the "CSVtoArffHeaderHadoopJob", and it always prompted that "no customer class".
p.s. I checked the configuration by clicked "more" and I noticed that in the "CSVtoArffHeaderHadoopJob" configuration pannel, there is no additionwekajar item showed.
So could you kindly give me some advice?
Thanks in advance ;)
Hi,
ReplyDeleteI re-built the package using CDH 4.3.0 version of hadoop. I tried running a WekaClassifierHadoopJob on my cluster in yarn, the jars seem to get copied correctly into HDFS but the job fails. Looks like the same problem as faced by Peter.
Also while trying the same in a separate classic map reduce cluster, I provided the mapreduce.framework.name user property as classic, but there seemed some problem with the cluster intialization and getting the generic error "Please check your configuration for mapreduce.framework.name and the correspond server addresses." Am I missing something here?
Thanks in advance!
-Kuntal
I'll have to let others comment if someone has been successful with CDH + Yarn.
ReplyDeleteI ran the package successfully (without recompilation) under mapreduce version 1 on a CDH 4.4 quick-start VM. All that was required in this case was a jar swap - CDH client jars for the Apache jars that come with distributedWekaHadoop.
I've also been successful with running against Apache Hadoop 2.x under Yarn. This was map-reduce managed by Yarn and required the Apache 2.x jars. I used:
mapreduce.framework.name=yarn
yarn.nodemanager.aux-services=mapreduce_shuffle
ArisKK has also run the package successfully against Apache Hadoop 2.x.
Cheers,
Mark.
Hi Mark,
ReplyDeleteThanks for your quick reply. I tried running on classic as well, but with no luck. Could tell me what user properties you had provided or what other configuration changes you had made while running the job in classic mapreduce on CDH 4?
Thanks in advance!
-Kuntal
I didn't have to set any properties at all to run under MR v1 on CDH 4.4. Admittedly, the only service I had running was MR v1, and other services (including YARN) were shut down.
ReplyDeleteWhat error/exceptions are you seeing?
Cheers,
Mark.
When I am trying to run a WekaClassifierHadoopJob from Knowledge Flow, on a cluster running classic MRv1 on CDH4, I am getting the following in the weka log:
ReplyDeleteCombined: -weka-jar "C:\\Program Files\\Weka-3-7\\weka.jar" -logging-interval 10 -hdfs-host 10.5.5.82 -hdfs-port 8020 -jobtracker-host 10.5.5.82 -jobtracker-port 8021 -input-paths /nbeg/kuntal/cpu.csv -output-path /nbeg/kuntal/weka -A MAA,MBB,MCC,MDD,MEE,MFF,CLLL -M ? -E ' -F , -model-file-name outputModel.model -num-iterations 1 -class last -W weka.classifiers.functions.LinearRegression -fold-number -1 -total-folds 1 -seed 1 -- -S 0 -R 1.0E-8
Combined: -weka-jar "C:\\Program Files\\Weka-3-7\\weka.jar" -logging-interval 10 -hdfs-host 10.5.5.82 -hdfs-port 8020 -jobtracker-host 10.5.5.82 -jobtracker-port 8021 -input-paths /nbeg/kuntal/cpu.csv -output-path /nbeg/kuntal/weka -A MAA,MBB,MCC,MDD,MEE,MFF,CLLL -M ? -E ' -F , -model-file-name outputModel.model -num-iterations 1 -class last -W weka.classifiers.functions.LinearRegression -fold-number -1 -total-folds 1 -seed 1 -- -S 0 -R 1.0E-8
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using jobtracker: 10.5.5.82:8021
Setting logging interval to 10
weka.distributed.DistributedWekaException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
weka.distributed.hadoop.WekaClassifierHadoopJob.runJob(WekaClassifierHadoopJob.java:1047)
weka.gui.beans.AbstractHadoopJob.runJob(AbstractHadoopJob.java:238)
weka.gui.beans.AbstractHadoopJob.start(AbstractHadoopJob.java:278)
weka.gui.beans.FlowRunner$1.run(FlowRunner.java:125)
at weka.distributed.hadoop.WekaClassifierHadoopJob.runJob(WekaClassifierHadoopJob.java:1047)
at weka.gui.beans.AbstractHadoopJob.runJob(AbstractHadoopJob.java:238)
at weka.gui.beans.AbstractHadoopJob.start(AbstractHadoopJob.java:278)
at weka.gui.beans.FlowRunner$1.run(FlowRunner.java:125)
Caused by: weka.distributed.DistributedWekaException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at weka.distributed.hadoop.ArffHeaderHadoopJob.runJob(ArffHeaderHadoopJob.java:945)
at weka.distributed.hadoop.WekaClassifierHadoopJob.initializeAndRunArffJob(WekaClassifierHadoopJob.java:815)
at weka.distributed.hadoop.WekaClassifierHadoopJob.runJob(WekaClassifierHadoopJob.java:1029)
... 3 more
Caused by: weka.distributed.DistributedWekaException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at weka.distributed.hadoop.HadoopJob.runJob(HadoopJob.java:570)
at weka.distributed.hadoop.ArffHeaderHadoopJob.runJob(ArffHeaderHadoopJob.java:920)
... 5 more
Caused by: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1239)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1235)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1234)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1263)
at weka.distributed.hadoop.HadoopJob.runJob(HadoopJob.java:543)
... 6 more
However, all other (non-weka) jobs are running fine on my cluster.
-Kuntal
Are you sure that you've copied the correct CDH client jars into ${user.home}/wekafiles/packages/distributedWekaHadoop/lib? I took the jars from:
ReplyDelete/usr/lib/hadoop/client-0.20
There seemed to be more jars than needed in that directory (two copies of each in fact - one with *cdh4* in the name and one without), but that didn't cause a problem for me.
Of course, you'll need to make sure your cluster is running MR v1.
There are a number of posts on various forums around about the "Cannot initialize Cluster" error. Take a look at:
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/HoIXbmDnAFY
http://stackoverflow.com/questions/19043970/cannot-initialize-cluster-please-check-your-configuration-for-mapreduce-framewo
Cheers,
Mark.
We copied all the jars from CDH4 installed version of Hadoop into distributedWekaHadoop library
ReplyDeletebut we are encountering the following error when we tried to create knowledge flow using HDFS Loader or HDFS Saver
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream
aused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataOutputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Please help me debug the error encountered and mention the versions of Java and Hadoop compatiable with the WEKAHadoop implementation
Did you remove the existing jars from
ReplyDelete${user.home}/wekafiles/packages/distributedWekaHadoop/lib
and then copy all jars from
CDH's /usr/lib/hadoop/client-0.20 into
${user.home}/wekafiles/packages/distributedWekaHadoop/lib
This process has worked successfully for a several people using CDH 4.4 and 4.7. Java version 1.6 or 1.7 should be fine.
Cheers,
Mark.
Thank you for your quick response
ReplyDeleteWe followed the same procedure and replaced the jar's in
${user.home}/wekafiles/packages/distributedWekaHadoop/lib with
jars from ../cloudera/parcels/hadoop/lib/client-0.20
After the procedure we encounter the following error during the startup of WEKA
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:190)
weka.core.ClassDiscovery.find(ClassDiscovery.java:344)
weka.gui.GenericPropertiesCreator.generateOutputProperties(GenericPropertiesCreator.java:532)
weka.gui.GenericPropertiesCreator.execute(GenericPropertiesCreator.java:629)
weka.gui.GenericPropertiesCreator.(GenericPropertiesCreator.java:162)
weka.core.WekaPackageManager.refreshGOEProperties(WekaPackageManager.java:1144)
weka.core.WekaPackageManager.loadPackages(WekaPackageManager.java:1134)
weka.core.WekaPackageManager.loadPackages(WekaPackageManager.java:1047)
weka.gui.GenericObjectEditor.determineClasses(GenericObjectEditor.java:177)
weka.gui.GenericObjectEditor.(GenericObjectEditor.java:247)
weka.gui.GUIChooser.(GUIChooser.java:714)
weka.gui.GUIChooser.createSingleton(GUIChooser.java:260)
weka.gui.GUIChooser.main(GUIChooser.java:1573)
We also altered the java version to 1.6 when we encountered error with version 1.7
We have Cloudera 4.7 running on our cluster
Regards,
Harsha
It looks like there are still core Hadoop classes missing. Can you try copying jars from
ReplyDelete/usr/lib/hadoop/client-0.20
rather than
../cloudera/parcels/hadoop/lib/client-0.20?
Cheers,
Mark.
Hello,
ReplyDeleteI have an error when i start weka package manager (weka developper version 3.7.11) :
java.net.SocketTimeoutException: connect timed out weka package manager
Do you have any idea ?
Best Regards,
Said SI KADDOUR
Do you have to go through a proxy for internet access? See:
ReplyDeletehttp://weka.wikispaces.com/How+do+I+use+the+package+manager%3F#GUI package manager-Using a HTTP proxy
Cheers,
Mark.
Hi Mark,
ReplyDeletethank you for the very nice tutorial. I have a question regarding
WekaClassifierEvaluationTest in 1.0.7:
When you set up the evaluator, you make it like this, using the iris dataset:
double[] priors = { 50.0, 50.0, 50.0 };
evaluator.setup(new Instances(train, 0), priors, 150, 1L, 0);
Why are you defining "count" as 150? From the Javadoc, one can read count refers to "the total number of class values seen (with respect to the priors)" As for your Iris example, I understand it should be 3: the number of classes in the dataset. I tried both with 3 and 150 and (apparently), there is no difference in the obtained results.
Am I getting sth wrong?
Thank you in advance!
Alberto
Hi Alberto,
ReplyDeleteSorry, the javadoc is a little misleading here. That value is actually the sum of the instance weights from which the class prior counts were computed. I should really just simplify the API because it's just the sum of the priors[] :-)
See weka.classifiers.evaluation.AggregateableEvaluationWithPriors and its superclass (in the main weka distribution jar) weka.classifiers.evaluation.Evaluation.
Cheers,
Mark.
Ok, I see now :)
DeleteThanks!
Alberto
This comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteCould you specify the steps to be followed to make the distributedwekaPackage work with hadoop 2.4 ? I am unable to connect my hadoop node with weka although they are running
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteHi Mark
ReplyDeleteI am facing a problem.
When i try to use WekaClassifierEvaluationHadoopJob for 24 lakh instances in multinode cluster(2 nodes) then only the different types of errors comes as the output not the accuracy percentage or the confusion matrix.
So what to do in order to get the accuracy percentage also.
Waiting for your reply
Perhaps your class attribute is numeric? In this case there are only error metrics and correlation coefficient output.
ReplyDeleteCheers,
Mark.
Yeah
DeleteAfter i converted the class attribute to nominal values i got the accuracy percentage also.
Hi mark i got another problem
ReplyDeleteI have installed the SMOTE preprocessor through package manager in ubuntu and it is loaded,but when i use this preprocessor or any other package installed through package manager, an error is thrown
java.io.IOException:java.lang.ClassNotFoundException:weka.filters.supervised.instance.SMOTE
so how to rectify this problem
Non-core Weka code needs to find its way into the distributed cache and the classpath for jobs. There is a configuration option in the Weka jobs called "additionalWekaPackages" that takes a comma-separated list of package names - this can be use to make sure package jar files (and any dependent libraries) get included in the job's classpath.
DeleteCheers,
Mark.
This comment has been removed by the author.
Deleteafter including the package in additionalWekaPackages configuration,it is still showing the same error
Deleteand the classpath is pointing to the weka.jar file
Blast! There is a bug in the GUI that's resulting in the additionalWekaPackages property not getting set. I've just released a new version of the distributedWekaHadoop package that fixes this.
DeleteCheers,
Mark.
Hi Mark.
DeleteIn the classifier building stage it is not showing error but in the classifier evaluation stage it is showing the same error "Can't Find the class PSOSearch"
And PSOSearch.jar has been copied from packages in wekafiles to HDFS
DeleteAlso if i want to use GeneticSearch which is present in attributeSelectionSearchMethods package,it is showing error in the classifier building stage as it is not copying the attributeSelectionSearchMethods.jar to hdfs
DeleteYou are correct. There is still a problem with the evaluation stage - additional package libraries are not getting into the classpath of the job at that point. I'll take a look at it when I get a chance.
DeleteThe attributeSelectionSearchMethods jar gets copied over fine for me (at least for the classifier building stage).
Cheers,
Mark.
Hi Mark
ReplyDeleteI am facing a problem . I am trying to integrate hadoop and weka in java code. So that i have included the hadoop and distributedWekaHadoop jars into eclipse and started to code. I want to store a file which is present in HDFS as an Instances object, So that i can pass the dataset for the weka classification/clustering algorithm easily. I have created an object for HDFSLoader class and tried to access setSource( ) with a file as its parameter. But i am getting an error like "Setting file as source is not supported ".
Kindly give me some idea to recover from this error.
Thanks in advance
You should call setHDFSPath() on HDFSLoader. This option can take either an absolute file path or a URL (hdfs://:/...); in the case of the former, you need to also specify the HDFS host and port via getConfig().setHDFSHost() and getConfig().setHDFSPort().
DeleteCheers,
Mark.
Thanks for your reply Mark. It works.
ReplyDelete
ReplyDeletesir I dnt know how to combine weka with hadoop i want to know detail explanation about how to combine weka with hadoop . sir i just install hadoop and download distributedWekabase please give me detail explanation
This comment has been removed by the author.
ReplyDeleteHi, I am a Java/Scala developer looking for a project that will deepen my understanding of Weka. I have tinkered with Apache Spark a little, and am wondering if there is an ongoing effort (to develop a Weka Spark plugin) that I can contribute to. If not might it be a good idea to start such a project?
ReplyDeleteThanks for your views.
- Sanjay Dasgupta
Yes, there is a distributedWekaSpark plugin. Take a look at:
Deletehttp://markahall.blogspot.co.nz/2015/03/weka-and-spark.html
The code can be found at:
https://svn.cms.waikato.ac.nz/svn/weka/trunk/packages/internal/distributedWekaSpark/
Cheers,
Mark.
Hi Mark,
ReplyDeleteWe are trying to parallelize an InformationGain calculation to build decision trees.
We read the data into the weka Instance data type to create attributes and then
load the values to build the dataset in the Java program. However it seems that Hadoop Mapreduce does not recognize the Instance data type and processes everything serially (?) So, is there a way to transfer the Instance data to the HDFS format? Or some other solution which avoids reading and writing arff files?
Thankyou and regards,
David Nettleton.
Hi David,
ReplyDeleteYes, Mapreduce in Hadoop (for plain text sources at least) does stream rows into the map tasks. MR tasks in distributed Weka processe CSV data (without header rows) and converts each row streamed in by the Hadoop framework to an Instance object internally. In order to do this, it needs an initial MR pass over the data in order to infer data types and build an ARFF header containing the attribute information. This ARFF header is then loaded by subsequent MR jobs in order to convert the CSV data into Instances. No ARFF files, beyond the ARFF header file are written or read from HDFS. I'd expect that the same approach should work for your application. You should be able to leverage the existing ArffHeaderHadoop job to accomplish the header creation.
Cheers,
Mark.
Hi Mark,
DeleteThankyou for your reply and explanation. This enabled us to get a little further, but now we have an error when trying to create the arff header. Below are the details. I would be very grateful if you could tell me
where we are going wrong!
We are using:
Hadoop 2.5.2
Weka 3.7.12
Distributed Weka Hadoop 1.0.15
Distributed Weka Base 1.0.12
Thankyou and regards, David Nettleton.
We are trying to run a ArffHeaderHadoopJob with this code:
String opts = "-hdfs-host 127.0.0.1 -hdfs-port 9000 -jobtracker-host 127.0.0.1 -jobtracker-port 8021 -input-paths hdfs://localhost:9000/user/iris/test.csv -output-path ./test -max-split-size 100000 -logging-interval 5 -user-prop mapred.child.java.opts=-Xmx500m";
ArffHeaderHadoopJob arffjob = new ArffHeaderHadoopJob();
arffjob.setOptions(weka.core.Utils.splitOptions(opts));
arffjob.setAttributeNames("id,age,sex");
arffjob.runJob();
The output error indicates NullPointerException in the weka.distributed.CSVToARFFHeaderMapTask.makeStructure.
2015-05-20 16:42:13,111 INFO [Thread-30] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2015-05-20 16:42:13,165 WARN [Thread-30] mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local516291587_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
at weka.distributed.CSVToARFFHeaderMapTask.makeStructure(CSVToARFFHeaderMapTask.java:1430)
at weka.distributed.CSVToARFFHeaderMapTask.getHeader(CSVToARFFHeaderMapTask.java:1183)
at weka.distributed.hadoop.CSVToArffHeaderHadoopMapper.cleanup(CSVToArffHeaderHadoopMapper.java:203)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
In the CSVToARFFHeaderMapTask.makeStructure method the m_attributeTypes are null so this is causing an NullPointerException.
for (int i = 0; i < m_attributeTypes.length; i++) {
if (m_attributeTypes[i] == TYPE.UNDETERMINED) {
// type conflicts due to all missing values are handled
// in the reducer by checking numeric types against nominal/string
m_attributeTypes[i] = TYPE.NUMERIC;
}
}
Hi David,
ReplyDeleteHere is an example that runs on the iris data in my local setup. Note that input and output paths are always in HDFS (there is no local filesystem support in distributedWekaHadoop), therefore there is no need to supply URL in the input and output paths. Furthermore, the paths are relative to your home directory in HDFS, unless fully qualified.
import weka.distributed.hadoop.*;
public class HadoopTest {
public static void main(String[] args) {
try {
String opts = "-hdfs-host palladium.local -hdfs-port 9000 -jobtracker-host palladium.local -jobtracker-port 9001 -input-paths input/classification2 -output-path output -A petallength,petalwidth,sepallength,sepalwidth,class";
ArffHeaderHadoopJob arffjob = new ArffHeaderHadoopJob();
arffjob.setOptions(weka.core.Utils.splitOptions(opts));
arffjob.runJob();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
Oh, I also meant to add that the input-paths are paths to directories that contain your input csv files. This is how Hadoop's TextInputFormat works, it processes directories.
ReplyDeleteCheers,
Mark.
Hello Mark,
DeleteI am trying to use your code in order to test the distributedWeka. I receive the folllwing error:
Exception in thread "main" java.lang.NoSuchMethodError: weka.distributed.CSVToARFFHeaderMapTask.getComputeSummaryStats()Z
at weka.distributed.hadoop.ArffHeaderHadoopJob.runJob(ArffHeaderHadoopJob.java:628)
at org.sailendra.jmilliettest.DistributedWeka.main(DistributedWeka.java:21)
Do you have any idea?
Thanks.
Hello,
ReplyDeletecould you please confirm the compatibility between the Weka 3.7 and Hadoop 2.5.2 versions?
Best,
Iris
I have run distributedWekaHadoop against Hadoop 2.2.0 and 2.6.0 without having to recompile the Weka code. Read the discussion in previous comments for information on swapping jars and configuration properties to run against map-reduce under YARN.
ReplyDeleteCheers,
Mark.
Hello Mark,
ReplyDeletewe have been following your advices to try to integrate distributedWekaHadoop into our Hadoop-2.5.2 installation. These are the steps we've done:
- Changed OpenJDK to Oracle's JDK 1.8.
- Installed and verified Hadoop 2.5.2. All hadoop components work and are able to submit and execute jobs
- Installed a copy of the distributedWekaHadoop .jar
- Changed all .jar files within distributedWekaHadoop who are provided with newer versions on the Hadoop-2.5.2. The files are simply copied from the Hadoop-2.5.2 to the distributedWekaHadoop and the old files, provided by distributedWekaHadoop are deleted.
- However, there are three files that we are not sure of:
--> hsqldb, who is provided bz Hadoop but it's on the 'examples' folder
-->oro; there is no trace of 'oro' anywhere on Hadoop
--> hadoop-core, who is also not present on hadoop-2.5.2
Last, we generated a Eclipse to run our code, and we are always getting the JavaNullPointerExcpetion taht David Nettleton posts on 20-May.
Are we missing something? Besides, in a previous post you state that 'replacing jar files' AND 'configuring properties' is enough to make distributedWeka work on Hadoop 2.5.2. What does 'configuring properties' mean? Do we need to do something else than copying files?
Thank you,
Hi,
ReplyDeleteBasically you need all the jars in the subdirectories of share/hadoop. There are several properties that need to be set (for a simple YARN setup) for distributedWekaHadoop versions < 1.0.16:
yarn.resourcemanager.scheduler.address=
yarn.nodemanager.aux-services=mapreduce_shuffle
mapreduce.framework.name=yarn
yarn.resourcemanager.hostname=
You will still need to set a non-empty string for the jobtracker option (even though this is not needed for running against YARN) as Weka checks for it.
Note that for more complicated YARN clusters it is just easier to place the cluster config dir (etc/hadoop) in the classpath when starting Weka. This way, Weka picks up all the cluster settings when it creates Configuration objects, and you no longer have to specify the above properties via the -user-prop option.
Cheers,
Mark.
Hi Mark,
ReplyDeleteI have actually tryin to install an image of hadoop using this link (https://nabisaheb.wordpress.com/2013/04/15/hadoop-installation-on-windows/)
I tried to configure arrfloader hdssaver..however, the flow in Weka runs but never finished i nor get any excpetion.
Can u help me what is wrong id I use any image for weka haddop..i sucessfully aded wekhaddop 1. extensions BTW
Are you executing Weka from the Windows side (to talk to Hadoop in the VM), or Weka from the Hadoop VM? In either case, you should check both the Weka log file (~/wekafiles/weka.log) and the logs in Hadoop to see if there are exceptions.
ReplyDeleteCheers,
Mark.
Hi Mark,
ReplyDeleteI am trying to convert Arff to CSV using the ArffLoader to HDFSSaver. However, everytime I run the workflow, it only runs the ArffLoader part, the later part never runs. Also, is there a specific way of creating a .names file in order to specify attribute values for the later step?
It sounds like there are problems connecting to HDFS on your cluster. If you start Weka from the command prompt, and then try executing, are there errors/exceptions printed? You will need to have installed the correct version of distributedWekaHadoop - distributedWekaHadoop for 1.x clusters or distributedWekaHadoop2 for 2.x clusters.
ReplyDeleteConfiguration is more complicated if you are running against Hadoop 2.x. Weka attempts to set various properties for 2.x programatically, however, for non-trivial cluster configurations its best to actually include the cluster configuration directory in the classpath when launching Weka (this way the hadoop classes will read your config files directly).
Cheers,
Mark.
Hi Mark,
ReplyDeleteThank you for replying. I had permission issues on the cluster. I got past that issue. I am now stuck on how to create the .names file. Is there a particular way we create it? Can you provide a sample of a .names file?
Hi,
ReplyDeleteThe names file format is simple - just one attribute name per line, in the order that the columns occur in the CSV file.
Cheers,
Mark.
Hi Mark,
ReplyDeleteis it possible to run the Multilayer Perceptron algorithm with Weka Spark or Hadhoop modules?
Yes, it should be possible. Note that there isn't a distributed version of the MLP, so you will get an ensemble of voted MLP classifiers as the result.
DeleteCheers,
Mark.
Hi Mark,
ReplyDeleteI have another question, is it possible to connect Weka Spark and Hadhoop modules to the google cloud platform? Do you see contraindications?
I'm afraid I don't have any experience with the the Google cloud platform. Given that you can run an Linux OS in their VMs, I would hazard a guess that it should be possible to get distributed Weka working. I know folks have got it working on Amazon's offering.
DeleteCheers,
Mark.
Ok, thanks a lot for your clear answers. I am trying to run Weka using a Hadoop cloud provider, I cannot deal with the effort needed to build an homemade Hadoop installation and management. I will follow your advise and I will try with Amazon. Thanks again! Alessandro
DeleteHi Alessandro ,I want to know if you happened to use weka on Amazon please because I am in the same case I am hesitant to use Distributed weka or another tool on Amazon
DeleteHello Mark
ReplyDeleteI have install distributed Weka hadoop in weka successfully ,but when I try to do your example above I get an error:
Call to localhost/127.0.0.1:8021 failed on connection exception: java.net.ConnectException: Connection refused: no further information
Could please help me
Thanks
Hello Mark
ReplyDeleteThank you so much for sharing this information. I am using Apache Hadoop 2.x, from where i get distributedWekaHadoop2 package, please guide me to proceed into my PhD work
This comment has been removed by a blog administrator.
ReplyDeleteHi Mark,
ReplyDeletei have already installed Hadoop 2.6.0 on ubuntu.
Can i install Weka and distributed Weka hadoop on it.
Thanks
Jasleen
Hi Mark, Can you alos write a post how to try this on Amazaon EMR?
ReplyDeleteI've never experimented with EMR. However, there are some community members who have used distributed Weka on EMR. They might comment if they are reading this blog.
DeleteCheers,
Mark.
Hi Mahdi ,I want to know, if you happened to use weka on EMR please because I am in the same case, I am hesitant to use Distributed weka or another tool on EMR
DeleteHi Mark, After integrating Hadoop with weka, I tried to save the data in HDFS. But Weka showing error like "05:24:43: [Saver] HDFSSaver$28977129|-dest /user/iris.csv -saver "weka.core.converters.CSVSaver -F , -M ? -N -decimal 6" -hdfs-host localhost -hdfs-port 8020| problem saving. java.io.IOException: Mkdirs failed to create /user"
ReplyDeletePlease help me to resolve this problem
How did u integrate hadoop with weka??
DeleteYou probably don't have permissions to create files in /user in HDFS. Try writing to your own user directory.
DeleteCheers,
Mark.
How to integrate weka with hadoop and also spark. I want a detailed step by step process for the same.
ReplyDeletethanks
Hello,
ReplyDeleteI am trying to run CSVToArffHeaderHadoop job. I was successful in executing HDFSsaver. But when I try to submit Header job it gets stuck in "ACCEPTED: waiting for AM container to be allocated, launched and register with RM. ". I tried to analyse user logs for the job but it has "log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /usr/local/hadoop_latest/logs/userlogs/application_1462905273984_0004/container_1462905273984_0004_01_000001 (Is a directory)" error on it.
Please help me to resolve this issue
thanks
Did you install the distributedWekaHadoop2 package? This is for Apache Hadoop 2.x clusters. If you are using another distribution (such as Cloudera or Hortonworks) then you'll need to replace the libraries in the lib directory of distributedWekaHadoop2Libs with the ones that come with your Hadoop distribution. Configuration is somewhat more complicated for Hadoop 2.x clusters than it is for 1.x. The best bet is to make sure that the configuration directory for you cluster is included in the CLASSPATH when you start Weka.
DeleteCheers,
Mark.
Hi Mark
ReplyDeleteI need to run this example at my "hadoop" user, but when I try, I get this error: Exception in thread "main" java.lang.InternalError: Can't connect to X11 window server using ':0' as the value of the DISPLAY variable.
If I run with the "comp1" user, it works perfectly. But I need to run with the "hadoop" user.
I already try to change the values of the variable DISPLAY, but did not work.
Can you help me please?
Thank you
Perhaps you can ssh into the hadoop user account with X-forwarding? Of if you are using sudo or su then perhaps this will help:
Deletehttps://debian-administration.org/article/494/Getting_X11_forwarding_through_ssh_working_after_running_su
Cheers,
Mark.
Ok, it worked!
DeleteThank you
Hi Mark! I am getting an EOFException error for HDFSSaver and CSVToARFFHeaderHadoopJob that I am unable to resolve. The log details are as follows:
ReplyDelete15:44:27: [FlowRunner] launching flow start points in parallel...
15:44:27: [FlowRunner] Launching flow 1...
15:44:27: [FlowRunner] Launching flow 2...
15:44:27: [FlowRunner] Launching flow 3...
15:44:27: [FlowRunner] Launching flow 4...
15:44:27: [Loader] CSVLoader$210879646|-M ? -B 100 -E ",' -F ,| loaded iris_noHeader
15:44:32: [Saver] HDFSSaver$1444764012|-dest /Users/atul/Desktop/iris_noHeader.csv -saver "weka.core.converters.CSVSaver -F , -M ? -N -decimal 6" -hdfs-host localhost -hdfs-port 17500| problem saving. java.io.EOFException: End of File Exception between local host is: "DhrumelMac.local/63.139.218.179"; destination host is: "localhost":17500; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
15:44:32: CSVToARFFHeaderHadoopJob$1507644908|ERROR: End of File Exception between local host is: "DhrumelMac.local/63.139.218.179"; destination host is: "localhost":17500; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..
ReplyDeleteThanks for your valuable post ofHadoop Online Training is very informaive and useful for who wants to learn about Hadoop
ReplyDeleteVisit :http://www.trainingbees.com/
Hi
ReplyDeleteI'm looking for distributed Weka documentation to understand what are exactly the commands -num-folds, -total-folds, -num-nodes, -logging-interval, -max-split-size, -randomized-chunks, -user-prop, and others. But I could not find the informations about these commands yet.
Where can I find the documentation about these commands?
Thank you in advance
This comment has been removed by the author.
ReplyDeleteThis article describes the Hadoop Software, All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. This post gives great idea on Hadoop Certification for beginners. Also find best Hadoop Online Training in your locality at StaygreenAcademy.com
ReplyDeleteHi Mark
ReplyDeleteWe have been following your advices to try to use the distributedWekaHadoop with Hadoop 2.6.0, but we are with this error:
"weka.distributed.DistributedWekaException: Call From hadoopmaster/10.112.10.11 to hadoopmaster:54311 failed on connection exception: java.net.ConnectException: Conexão recusada;"
These are the steps we've done:
- My etc/hosts file already has: 10.112.10.11 hadoopmaster
- Copied all the distributedweka jars to the subdirectories of share/hadoop.
- Edited the yarn-site.xml with "yarn.nodemanager.aux-services=mapreduce_shuffle", "yarn.resourcemanager.scheduler.address=", "yarn.resourcemanager.hostname=" and the mapred-site.xml with "mapreduce.framework.name=yarn".
*hadoopmaster:54311 is the value for the property "mapred.job.tracker" that is on the mapred-site.xml file.
I didn't understand, on above answer: "You will still need to set a non-empty string for the jobtracker option (even though this is not needed for running against YARN) as Weka checks for it.", you refer a non-empty string in mapred-site.xml or in the commands to run the distributed Weka? We try to use this configuration on both situations, but didn't work yet.
Are we missing something? Please, we need your help!
Thanks in advance
Hmm. I'm not too sure what is going on here. Are you using YARN and the new mapreduce? Things are pretty confusing in Hadoop - there is the old API (org.apache.hadoop.mapred) and the new one (org.apache.hadoop.mapreduce). Weka's implementations are all on the new API.
ReplyDeleteIf you are running YARN/Hadoop 2 then I'm pretty sure that mapred.job.tracker is not used, and yarn.resourcemanager.* properties are the important ones. Weka accommodates both Hadoop 2 and 1 via the "jobTracker" field in the GUI. When the distributedWekaHadoop2 package is installed the tool tip for this field should show "jobtracker/resource manager" (or something to that effect), so you should enter the host that your resource manager is running on here. It is highly recommended that your hadoop conf directory is in the CLASSPATH when starting Weka when using YARN. There are so many more configuration properties for YARN when running non-trivial clusters that Weka does not attempt to provide options/config dialogs for everything, and instead relies on the Hadoop classes grabbing stuff that they need from the config files in the classpath.
Cheers,
Mark.
Great! It is working!
ReplyDeleteHow can we solve the problem of heap space? This is the error: "[ClassifierMapTask] Memory (free/total/max.) in bytes: 10.090.400 / 200.015.872 / 200.015.872".
These are the steps we've done:
1º Added -Xmx2048m in the command (but didn't work)
"java -Xmx2048m -classpath CLASSPATH:weka.jar:/usr/local/hadoop/etc/hadoop/* weka.Run ....."
2º Added the following property in mapred-site.xml: (but didn't work)
mapred.child.java.opts
-Xmx2048m
3º Added this another property in mapred-site.xml: (but didn't work)
mapreduce.reduce.memory.mb
1024
mapreduce.map.memory.mb
1024
Are we missing something else?
Thanks
Hi Mark
DeleteBy using "mapred.child.java.opts=-Xmx1024m" in the command it worked!
The reason that it wasn't working maybe was because the classpath was not configured correctly, because this same property was set at mapred-site.xml. Anyway, now it is working.
Thanks
Hi Mark
ReplyDeleteI would like to understand, If we have 8 nodes, each one with 3GB of RAM, Why is not possible to consider the sum of all available memory (24GB) in the cluster when using -Xmx command?
It is just possible to use less than 3GB, like: -Xmx2048m (2GB). If we use, for example, -Xmx20480m (20GB) in the commands it doesn't work.
Thanks in advance
Hadoop is not a shared memory architecture. You can't just add together the available memory on each node. Each map task has to be able to execute within the confines of memory on a particular node.
DeleteCheers,
Mark.
Ok, thanks!
DeleteHi Mark
ReplyDeleteIn the first moment we solve this heap problem by adding "-user-prop mapred.child.java.opts=-Xmx1024m" in the command. But now It is necessary to set more than one configuration and we are trying to follow your recomendation "to place the cluster config dir (etc/hadoop) in the classpath when starting Weka". But it is with error, because there aren't jar and zip files in this directory, that is necessary in classpath. The type of config files are .sh and the other files are .xml and .cmd.
This is the command used: "java -Xmx2048m -classpath weka.jar;/usr/local/hadoop/etc/hadoop/*.* weka.Run ..."
What are missing to set this classpath? Do you have an example? Please, we need your help!
Thanks
Adding the config directory to the classpath should not cause any errors or exceptions. Classpath entries do not have to contain just class files or jar files. Hadoop reads configuration files from the classpath. If none are present in the classpath it will use default settings. This is the easiest way to ensure that Weka Hadoop jobs have the correct configuration settings for your cluster.
DeleteCheers,
Mark.
Hi Mark
ReplyDeleteIs there any published paper about distributed Weka available? How would be the correct citation for this tool?
Thanks
Hi Rodrigo,
ReplyDeleteThere is no publication for distributed Weka specifically I'm afraid. Probably the best thing to do is to cite the data mining book.
Cheers,
Mark.
Hi Mark
DeleteOk, Thanks!
plese give documentbor links for hadoop integrated to Weka
ReplyDeleteHi Mark, can you provide an example about distributed weka package, namely distributedWekaBase without hadoop or spark?
ReplyDeleteThanks in advance
Hello, I want to run a parallel algorithm (an ensemble method for example 'bagging') on cloud computing and since these algorithms are essentially formed as sets, in which
ReplyDeleteThe subsets of data are trained individually but instead of merging them into a final model during
the reduce phase, they are combined using voting techniques. So I wantto know if this type of algorithms are able to 'Being parallelized using Distributed weka hadoop
Bagging in Weka implements an interface called ParallelIteratedSingleClassifierEnhancer. Distributed Weka checks for this interface and then has each worker build num iterations/num workers base classifiers. Bagging combines the base classifiers via voting (actually averaging predicted probability distributions in the Weka implementation).
DeleteCheers,
Mark.
Thank you for your answer, I tried to use distributedwekaspark and distributedwekahadoop in an EMR cluster under amazon but I found obstacles, in short I want to know please if there are tutorials that explains how to use distributedwekaspark (or distributedwekahadoop) in EMR
DeleteHello markahall,
ReplyDeleteI have installed hadoop in windows 8, is it possible to use weka in in it. If yes I have weka 3.6, do I need to upgrade it.
Actually I have a modified algorithm in weka version 3.6 and i want to test it and run it in hadoop environment how can do it.
Excellent…Amazing…. I’m satisfied to find so many helpful information here within the put up,for latest php jobs in near me. we want work out extra strategies in this regard, thanks for sharing.
ReplyDeleteIt is nice blog Thank you provide important information and i am searching for same information to save my timeHadoop Administration Online Training Hyderabad
ReplyDelete
ReplyDeleteHi Your Blog is very nice!!
Get All Top Interview Questions and answers PHP, Magento, laravel,Java, Dot Net, Database, Sql, Mysql, Oracle, Angularjs, Vue Js, Express js, React Js,
Hadoop, Apache spark, Apache Scala, Tensorflow.
Mysql Interview Questions for Experienced
php interview questions for freshers
php interview questions for experienced
python interview questions for freshers
tally interview questions and answers
codeingniter interview questions
cakephp interview questions
express Js interview questions
react js interview questions
laravel Interview questions and answers
This comment has been removed by the author.
ReplyDeletePositive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work. R Programming institutes in Chennai | R Programming Training in Chennai | R Programming Course Fees
ReplyDelete
ReplyDeleteGreat efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums. big data training in Velachery | Hadoop Training in Chennai | big data Hadoop training and certification in Chennai | Big data course fees |
Great post! I am actually getting ready to across this information, It’s very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.Best Ptyhon Training Institute In Chennai | Best AWS Training Institute In Chennai | Best Devops Training Institute In Chennai | Best Data Science Training Institute In Chennai
ReplyDeleteThank you for this awesome post. Thanks for sharing.
ReplyDeleteCorporate Training in Chennai | Corporate Training institute in Chennai | Corporate Training Companies in Chennai | Corporate Training Companies | Corporate Training Courses | Corporate Training
hi admin i am stuck with an error plz help me out
ReplyDelete09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Executing ARFF Job....
09:10:55: [Basic] HDFSSaver$798862737|-dest /user/abq/input/trains.csv -saver "weka.core.converters.CSVSaver -F , -M ? -N -decimal 6" -hdfs-host iksenode2 -hdfs-port 8020|Save successful
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /repository/weka-3-9-3/weka.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaHadoopCore/distributedWekaHadoopCore.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/distributedWekaBase.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/lib/opencsv-2.3.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/lib/jfreechart-1.0.13.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/lib/jcommon-1.0.16.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/lib/colt-1.2.0.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/lib/la4j-0.4.5.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Copying /home/abq/wekafiles/packages/distributedWekaBase/lib/t-digest-3.1.jar to HDFS
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|Submitting job: ARFF instances header job [-names-file, trains.names, -M, ?, -E, ', -F, ,, -compression, 50.0, -decimal-places, 2]
09:10:55: WekaClassifierEvaluationHadoopJob$461293594|ARFF instances header job [-names-file, trains.names, -M, ?, -E, ', -F, ,, -compression, 50.0, -decimal-places, 2] Setup: 0.0 Map: 0.0 Reduce: 0.0
09:11:05: WekaClassifierEvaluationHadoopJob$461293594|Unable to continue - creating the ARFF header failed!
09:11:05: [ERROR] WekaClassifierEvaluationHadoopJob$461293594|Job failed
09:11:05: [Low] WekaClassifierEvaluationHadoopJob$461293594|Interrupted
09:11:05: [Low] TextViewer$1631109379|Interrupted
09:11:05: [Low] ArffLoader$491842255|Interrupted
09:11:05: [Low] HDFSSaver$798862737|-dest /user/abq/input/trains.csv -saver "weka.core.converters.CSVSaver -F , -M ? -N -decimal 6" -hdfs-host iksenode2 -hdfs-port 8020|Interrupted
Excellent blog!!! I got to know the more useful information by reading your blog. Thanks for posting this blog.
ReplyDeleteSpoken English Classes in Chennai
Spoken English in Chennai
English Language Training Chennai
English Spoken Class in Chennai
English Learning Institute in Chennai
English Speaking Classes near me
Chennai Spoken English Class
It was really an interesting blog, Thank you for providing unknown facts.
ReplyDeleteaviation courses in Bangalore
aviation institute in Bangalore
aviation institute in India
aviation training academy
Very useful information, Keep posting more blog like this, Thank you.
ReplyDeleteAir hostess training in Chennai
Air Hostess Training Institute in chennai
cabin crew training institute in chennai
cabin crew training chennai
Very good information provided, Thanks a lot for sharing such useful information.
ReplyDeleteGuest posting sites
Education
Thanks for sharing this valuable information to our vision. You have posted a worthy blog keep sharing.
ReplyDeleteHadoop Training in Chennai
Big Data Training in Chennai
Big Data Course in Chennai
Big Data Hadoop Training in Chennai
Hadoop Course in Chennai
best hadoop training in bangalore
bigdata and hadoop training in bangalore
big data training institutes in bangalore
Great stuff! Thanks for sharing this wonderful article. Looking forward for more posts from you.
ReplyDeleteMobile Testing Training in Chennai | Mobile Testing Course in Chennai | Mobile Automation Testing Training in Chennai | Mobile Testing Training | Mobile Application Testing Training | Mobile Apps Testing Training | Mobile Application Testing Training in Chennai | Mobile Appium Training in Chennai
This information is impressive; I am inspired with your post. Keep posting like this, This is very useful.Thank you so much. Waiting for more blogs like this.
ReplyDeleteAirport management courses in chennai
airline management courses in chennai
aircraft maintenance course in chennai
diploma in airport management in chennai
Thanks for providing wonderful information with us. Thank you so much.
ReplyDeleteairport ground staff training courses in chennai
airport ground staff training in chennai
ground staff training in chennai
Amazing Post. The choice of words is very unique. Interesting idea. Looking forward for your next post.
ReplyDeleteHadoop Admin Training in Chennai
Hadoop Administration Training in Chennai
Hadoop Administration Course in Chennai
Hadoop Administration Training
Big Data Administrator Training
IELTS coaching in Chennai
IELTS Training in Chennai
SAS Training in Chennai
SAS Course in Chennai
This comment has been removed by the author.
ReplyDeleteAmazing work. Extra-ordinary way of capturing the details. Thanks for sharing. Waiting for your future updates.
ReplyDeleteSpoken English Classes in Chennai
Best Spoken English Classes in Chennai
Spoken English Class in Chennai
Spoken English in Chennai
Node JS Training in Chennai
Node JS Course in Chennai
Node JS Advanced Training
Node JS Training Institute in chennai
Very good information about DevOps clear explanation thanks for sharing
ReplyDeleteanyone want to learn advance devops tools or devops online training visit:
DevOps Online Training
DevOps Training institute in Hyderabad
DevOps Training in Ameerpet
career information
ReplyDeletegood information and nice article
ReplyDeleteDevOps Training in Hyderabad
Salesforce Training in Hyderabad
SAP ABAP Online Training
SEO Training in Hyderabad
Great Article. Thanks for sharing info.
ReplyDeleteDigital Marketing Course in Hyderabad
Digital Marketing Training in Hyderabad
AWS Training in Hyderabad
Workday Training in Hyderabad
Nice Article ..Thanks for providing information that was worth reading & sharing
ReplyDeleteielts coaching in Hyderabad
Machine Learning Course in Hyderabad
Power bi training Hyderabad
Python training in Hyderabad
i'm Here to learn hadoop, Thanks For Sharing
ReplyDeleteDevOps Training
DevOps Training in Ameerpet
DevOps Training institute in Hyderabad
https://www.visualpath.in/devops-online-training contact Us: 9704455959
Great article thank you.
Big Data Hadoop Training in Hyderabad
Data Science Course in Hyderabad
AngularJS Training in Hyderabad
Advanced Digital Marketing Training Institute in Hyderabad
awesome article thanks for sharing
ReplyDeletedevops online training
python online traning
power bi online traning
machine learning online course
nice blog and great article,Thanks for sharing
ReplyDeleteAWS Training in
Hyderabad
Digital
Marketing Training in Hyderabad
Big Data
Hadoop Training in Hyderabad
Digital Marketing
Course in Hyderabad
Very Informative, Thanks for Sharing.
ReplyDeleteDigital Marketing Courses in Hyderabad
SEO Training in Hyderabad Ameerpet
SAP ABAP Training Institute in Hyderabad
Salesforce CRM Training in Hyderabad
Very interesting, good job and thanks for sharing information .Keep on updates.
ReplyDeleteAffiliate Marketing Training in Hyderabad
Online Reputation Management in Hyderabad
Email Marketing Course in Hyderabad
E-Commerce Marketing Training in Hyderabad
Usefull Article. Thanks for sharing info.
ReplyDeleteDigital Marketing training in Hyderabad
IELTS training
in hyderabad
sap sd online
training
sap fico online
training
The best Article that I have never seen before with useful content and very informative.Thanks for sharing info.
ReplyDeleteSocial Media Marketing Training in Hyderabad
Adwords Training in Hyderabad
Google Analytics Training in Hyderabad
Google AdSense Training in Hyderabad
Great Article. Thanks for sharing info.
ReplyDeleteDigital Marketing Course in Hyderabad
Top Digital Marketing Courses with the live projects by a real-time trainer
line Digital Marketing Courses in Hyderabad
SEO Training in Hyderabad
Worthful Hadoop tutorial. Appreciate a lot for taking up the pain to write such a quality content on Hadoop tutorial. Just now I watched this similar Hadoop tutorial and I think this will enhance the knowledge of other visitors for sureHadoop Online Training
ReplyDeletethe blog is good.im really satisfied to read the blog.keep sharing like this type of information.thanking you.
ReplyDeleteAngularjs Training institute in Chennai | Best Angularjs Training in Chennai | Angular Training in Chennai | UiPath Courses in Chennai | UiPath Training in Chennai | Angularjs Training in Anna Nagar | Angularjs Training in T Nagar
The blog you have posted is more informative for us... thanks for sharing with us...
ReplyDeleterpa training in bangalore
robotics courses in bangalore
rpa course in bangalore
robotics classes in bangalore
Selenium Training in Bangalore
Java Training in Madurai
Oracle Training in Coimbatore
PHP Training in Coimbatore
Awesome post!!! Thanks for your blog... waiting for your upcoming data.
ReplyDeleteAWS Training in Bangalore
Best AWS Training in Bangalore
Java Training in Bangalore
Python Training in Bangalore
IELTS Coaching in Madurai
IELTS Coaching in Coimbatore
Java Training in Coimbatore
the idea is good and its help for my study.i searched this type of article.thankyou.
ReplyDeleteccna Training in Chennai
ccna course in Chennai
Python course in Chennai
ccna Training in Velachery
ccna Training in Tambaram
Excellent post, it will be definitely helpful for many people. Keep posting more like this.
ReplyDeleteBlue Prism Training in Chennai
Blue Prism Training Institute in Chennai
UiPath Training in Chennai
Robotics Process Automation Training in Chennai
RPA Training in Chennai
Data Science Course in Chennai
Blue Prism Training in OMR
Blue Prism Training in Adyar
very good post!!! Thanks for sharing with us... It is more useful for us...
ReplyDeleteSEO Training in Coimbatore
seo course in coimbatore
RPA training in bangalore
Selenium Training in Bangalore
Java Training in Madurai
Oracle Training in Coimbatore
PHP Training in Coimbatore
ReplyDeleteI have read your article; it is very instructive and valuable to me. I admire the valuable information you offer in your articles. Thanks for posting it.
data science online training
best data science online training
data science online training in Hyderabad
data science online training in india
ReplyDeleteThe best Article that I have never seen before with useful content and very informative.Thanks for sharing info.
power bi training in hyderabad
best power bi class room training in hyderabad
power bi class room training in hyderabad
power bi training in india
Very informative and well written post! Quite interesting and nice topic chosen for the post.
ReplyDeleteI am hoping the same best effort from you in the future as well. In fact your creative writing skills has inspired me
oneplus mobile service center in chennai
oneplus mobile service center
oneplus mobile service centre in chennai
oneplus mobile service centre
oneplus service center near me
oneplus service
oneplus service centres in chennai
oneplus service center velachery
oneplus service center in vadapalani
whatsapp group links list
ReplyDelete
ReplyDeleteNice blog..! I really loved reading through this article... Thanks for sharing such an amazing post with us and keep blogging...
data science online training
best data science online training
data science online training in Hyderabad
data science online training in india
Learning the ropes of big data and data science will advance your career and have a positive influence on your life, both personally and professionally. The need for data science professionals will not fade in the coming years. In fact, it is expected to show an upward trend in the future.
ReplyDeleteData Science Course in Hyderabad
Thank you for excellent article.
ReplyDeletePlease refer below if you are looking for best project center in coimbatore
soft skill training in coimbatore
final year projects in coimbatore
Spoken English Training in coimbatore
final year projects for CSE in coimbatore
final year projects for IT in coimbatore
final year projects for ECE in coimbatore
final year projects for EEE in coimbatore
final year projects for Mechanical in coimbatore
final year projects for Instrumentation in coimbatore
Nice post. Thanks for sharing! I want people to know just how good this information is in your article. It’s interesting content and Great work.
ReplyDeleteThanks & Regards,
VRIT Professionals,
No.1 Leading Web Designing Training Institute In Chennai.
And also those who are looking for
Web Designing Training Institute in Chennai
SEO Training Institute in Chennai
Photoshop Training Institute in Chennai
PHP & Mysql Training Institute in Chennai
Android Training Institute in Chennai
pubg whatsapp group link
ReplyDeletelucky patcher app apk
thanks for Providing a Great Info
ReplyDeleteanyone want to learn advance devops tools or devops online training visit:
DevOps Training
DevOps Online Training
DevOps Training institute in Hyderabad
DevOps Training in Ameerpet
GREAT ARTICLE MAN KEEP DOING THE GOOD WORK AIR INDIA FLIGHT STATUS
ReplyDeleteI prefer to study this kind of material. Nicely written information in this post, the quality of content is fine and the conclusion is lovely. Things are very open and intensely clear explanation of issues
ReplyDeleteBest Spring Classroom Training Institute
Best Devops Classroom Training Institute
Best Corejava Classroom Training Institute
Best Advanced Classroom Training Institute
Best Spring Classroom Training Institute
Best Devops Classroom Training Institute
Best Corejava Classroom Training Institute
Best Advanced Classroom Training Institute
Best Hadoop Training Institute
Thanks For Sharing The Information The Information Shared Is Very Valuable Please Keep Updating
ReplyDeleteUs Time Just Went On Reading The article Hadoop Online Course
Amazing Post. Looking for this kind of information for a long time. Thanks for Posting.
ReplyDeleteInformatica Training in Chennai
Informatica Training Center Chennai
Informatica Training chennai
Informatica Training institutes in Chennai
Informatica Training in Adyar
Informatica Training in Velachery
Thanks For Sharing The Information The Information Shared Is Very Valuable Please Keep Updating Us Time Just Went On Reading The article Python Online Course Hadoop Online Course Aws Online Course Data Science Online Course
ReplyDeleteI love this post.
ReplyDeleteโปรโมชั่นGclub ของทางทีมงานตอนนี้แจกฟรีโบนัส 50%
เพียงแค่คุณสมัคร Gclub กับทางทีมงานของเราเพียงเท่านั้น
ร่วมมาเป็นส่วนหนึ่งกับเว็บไซต์คาสิโนออนไลน์ของเราได้เลยค่ะ
สมัครสมาชิกที่นี่ >>> Gclub online
Very cool!
ReplyDeleteโปรโมชั่นGclub ของทางทีมงานตอนนี้แจกฟรีโบนัส 50%
เพียงแค่คุณสมัคร Gclub กับทางทีมงานของเราเพียงเท่านั้น
ร่วมมาเป็นส่วนหนึ่งกับเว็บไซต์คาสิโนออนไลน์ของเราได้เลยค่ะ
สมัครสมาชิกที่นี่ >>> Gclub online
Thank you for sharing the article. The data that you provided in the blog is informative and effective.
ReplyDeleteBest Hadoop Training Institute