Tuesday, 15 October 2013

Weka and Hadoop Part 1

How to handle large datasets with Weka is a question that crops up frequently on the Weka mailing list and forums. This post is the first of three that outlines what's available, in terms of distributed processing functionality, in several new packages for Weka 3.7. This series of posts is continued in part 2 and part 3.

The first new package is called distributedWekaBase. It provides base "map" and "reduce" tasks that are not tied to any specific distributed platform. The second, called distributedWekaHadoop, provides Hadoop-specific wrappers and jobs for these base tasks. In the future there could be other wrappers - one based on the Spark platform would be cool.

Base map and reduce tasks

distributedWekaBase version 1.0 provides tasks for:

  1. Determining a unified ARFF header from separate data chunks in CSV format. This is particularly important because, as Weka users know, Weka is quite particular about metadata - especially when it comes to nominal attributes. At the same time this task computes some handy summary statistics (that are stored as additional "meta attributes" in the header), such as count, sum, sum squared, min, max, num missing, mean, standard deviation and frequency counts for nominal values.  These summary statistics come in useful for some of the other tasks listed below.
  2. Computing a correlation or covariance matrix. Once the ARFF header job has been run, then computing a correlation matrix can be completed in just one pass over the data given our handy summary stats. The matrix produced by this job can be read by Weka's Matrix class. Map tasks compute a partial matrix of covariance sums. The reduce tasks aggregates individual rows of the matrix in order to produce the final matrix. This means that parallelism can be exploited in the reduce phase by using as many reducers as there are rows in the matrix.
  3. Training a Weka classifier (or regressor). The map portion of this task can train any Weka classifier (batch or incremental) on a given data chunk and then the reduce portion will aggregate the individual models in various ways, depending on the type of classifier. Recently, a number of classifiers in Weka 3.7 have become Aggregateable. Such classifiers allow one final model, of the same type, to be produced from several separate models. Examples include: naive Bayes, naive Bayes multinomial, various linear regression models (learned by SGD) and Bagging. Other, non-Aggregateable, classifiers can be combined by forming a voted ensemble using Weka's Vote meta classifier. The classifier task also has various handy options such as allowing reservoir sampling to be used with batch learners (so that a maximum number of instances processed by the learning algorithm in a given map can be enforced), normal Weka filters to be used for pre-processing in each map (the task takes care of using various special subclasses of FilteredClassifier for wrapping the base classifier and filters depending on whether the base learner is Aggregateable and/or incremental), forcing batch learning for incremental learners (if desired), and for using a special "pre-constructed" filter (see below).
  4.  Evaluating a classifier or regressor. This task handles evaluating a classifier using either the training data, a separate test set or cross-validation. Because Weka's Evaluation module is Aggregateable, and computes statistics incrementally, this is fairly straightforward. The process makes use of the classifier training task to learn an aggregated classifier in one pass over the data and then evaluation proceeds in a second pass. In the case of cross-validation, the classifiers for all folds are learned in one go (i.e. one aggregated classifier per fold) and then evaluated. In this case, the learning phase can make use of up to k reducers (one per fold). In the batch learning case, the normal process of creating folds (using Instances.train/testCV()) is used and the order of the instances in each map gets randomised first. In the case of incremental learning, instances are processed in a streaming fashion and a modulus operation is used to pull out the training/test instances corresponding to a given fold of the cross-validation.
  5. Scoring using a trained classifier or regressor. This is fairly simple and just takes uses a trained model to make predictions. No reducer is needed in this case. The task outputs input instances with predicted probability distributions appended. The user can specify which of the input attribute values to output along with the predictions. It also builds a mapping between the attributes in the incoming instances and those that the model is expecting, with missing attributes or type mismatches replaced with missing values.
  6. PreconstructedPCA. This is not a distributed task as such; instead it is a filter that can accept a correlation matrix or covariance matrix (as produced by the correlation matrix task) and produces a principal components analysis. The filter produces the same textual analysis output as Weka's standard PCA (in the attribute selection package) and also encapsulates the transformation for data filtering purposes. Once constructed, it can be used with the classifier building task.

Hadoop wrappers and jobs

distributedWekaHadoop version 1.0 provides a number of utilities for configuration/HDFS, mappers and reducers that wrap the base tasks, and jobs to orchestrate everything against Apache Hadoop 1.x (in particular, it has been developed and tested against Hadoop 1.1.2 and 1.2.1).

Getting datasets in and out of HDFS

The first thing this package provides is a "Loader" and "Saver" for HDFS. These can batch transfer or stream data in and out of HDFS using any base Loader or Saver - so any data format that Weka already supports can be written or read to/from HDFS. Because the package uses Hadoop's TextInputFormat for delivering data to mappers, we work solely with CSV files that have no header row. The CSVSaver in Weka 3.7.10 has a new option to omit the header row when writing a CSV file. The new HDFSSaver and HDFSLoader can be used from the command line or the Knowledge Flow GUI:


ARFF header creation job

The first job that the distributedWekaHadoop package provides is one to create a unified ARFF header + summary statistics from the input data. All Weka Hadoop jobs have an extensive command line interface (to facilitate scripting etc.) and a corresponding step in the Knowledge Flow GUI. The jobs also take care of making sure that all Weka classes (and dependencies) are available to map and reduce tasks executing in Hadoop. It does this by installing the Weka jar file (and other dependencies) in HDFS and then adding them to the distributed cache and classpath for the job.


java weka.Run ArffHeaderHadoopJob \
-hdfs-host palladium.local -hdfs-port 9000 \
-jobtracker-host palladium.local -jobtracker-port 9001 \
-input-paths /users/mhall/input/classification \
-output-path /users/mhall/output \
-names-file $HOME/hypothyroid.names -max-split-size 100000 \
-logging-interval 5 \
-user-prop mapred.child.java.opts=-Xmx500m







The job has options for specifying Hadoop connection details and input/output paths. It also allows control over the number of map tasks that actually get executed via the max-split-size option (this sets dfs.block.size) as Hadoop's default of 64Mb may not be appropriate for batch learning tasks, depending on data characteristics. The classifier job, covered in the next instalment of this series, has a pre-processing option to create a set of randomly shuffled input data chunks, which gives greater control over the number and size of the data sets processed by the mappers. The ARFF header job also has a set of options for controlling how the CSV input file gets parsed and processed. It is possible to specify attribute (column) names directly or have them read from a "names" file (one attribute name per line; not to be confused with the C4.5 ".names" file format) stored on the local file system or in HDFS. 

As other Weka Hadoop jobs use the ARFF job internally, and it is not necessary to repeat it for subsequent jobs that process the same data set, it is possible to prevent the job from executing by providing a path to an existing ARFF header (in or out of HDFS) to use. 

The image below shows what the job produces for the UCI hypothyroid dataset. Given the configuration for this job shown above, the header gets stored as /users/mhall/output/arff/hypothyroid.arff in HDFS. It also gets displayed by the TextViewer in the Knowledge Flow. "Class" is the last of the actual data attributes and the ones that occur after that are the summary meta attributes that correspond to each of the nominal or numeric attributes in the data.


This ends the first part of our coverage of the new distributed Weka functionality. In part two I'll cover the remaining Hadoop jobs for learning and evaluating classifiers and performing a correlation analysis.

51 comments:

  1. The latest distributedWekaHadoop doesn't build with mvn clean package. Maven says opencsv is the problem.

    -- Brian

    ReplyDelete
  2. I've just committed a fix to the pom.xml in distributedWekaBase. Thanks for pointing this out.

    Cheers,
    Mark.

    ReplyDelete
  3. Hi Mark,
    Thanks! I still don't get past 'mvn clean package'. Was this tested on a clean machine without access to any local maven repositories? There are dependency resolution problems afoot.

    -- Brian

    -- Brian

    ReplyDelete
  4. Thanks for sharing the valuable information

    Hadoop Online Training

    ReplyDelete
  5. Hi - thanks for making the library available.
    Might it be possible to also post an example of a short java running a weka clusterer/classifier as a mapreduce job?
    Thanks.

    ReplyDelete
  6. congratulations guys, quality information you have given!!! Big Data and Analytics

    ReplyDelete
  7. good to see the best information about the big data. and u can read all Hadoop Interview Questions here

    ReplyDelete
  8. Hi Mark!
    Thanks for the package. Very interesting.
    One question. How I can set for ARFF header creation job (and other jobs, especially WekaClassifierHadoopJob, that source CSV file has header row ? I tried it, but is recognizes first row as data row, so adds me unnecessary values for every parameter.
    I didn't find this option. Can you help me ?

    With best regards
    Pavel Dvorkin

    ReplyDelete
  9. No, I’m afraid that it is not possible to use a CSV file with a header row. This is because Hadoop will split the file up for processing by multiple mappers and only one mapper will get the chunk that contains the header row.

    You will need to remove the header row in your file before processing by any of the distributed Weka jobs. Note that the CSVSaver in Weka has an option to omit the header row when writing a CSV file, so you could try reading file (incrementally if it is large) via the HDFSLoader+CSVLoader and then writing it back into HDFS as a new CSV file (minus header row) via HDFSSaver+CSVSaver.

    Cheers,
    Mark.

    ReplyDelete
  10. Hi Mark!
    Can you give me plugin ( distributedWekaHadoop) installation algorithm? I can not get it to connect to Weka

    ReplyDelete
  11. Hi Sergey,

    It is designed to work with Weka 3.7. If you have Weka 3.7.10 then you can install distributedWekaHadoop via the built-in package manager (GUIChooser-->Tools).

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. Hi Mark,

      Thank you very mach !
      I installed the plugin, but got an error. There are no modules (JDBC). Where to get and how to install? Help me, please!

      https://www.dropbox.com/s/7ywqrcqqkwhdfdv/Getting%20Started%20%28MacBook-Pro-Nero%27s%20conflicted%20copy%202014-03-22%29.pdf

      Delete
  12. These are just warning to let you know that there are some missing JDBC drivers. This has no impact on distributed Weka.

    Cheers,
    Mark.

    ReplyDelete
  13. Hi Mark,
    I'm don't see "HDFSServer".
    Help me please !
    Sergey

    https://www.dropbox.com/s/cfvw8flvl6rqkuz/%D0%A1%D0%BD%D0%B8%D0%BC%D0%BE%D0%BA_%D1%8D%D0%BA%D1%80%D0%B0%D0%BD%D0%B0_24_03_14__22_28.jpg

    ReplyDelete
  14. I can't see a "Hadoop" folder on the left-hand-side in your screenshot. Have you installed "distributedWekaHadoop" via the package manager (GUIChooser-->Tools)?

    Once installed correctly, you will find HDFSLoader under "DataSources" and HDFSSaver under "DataSinks".

    Cheers,
    Mark.

    ReplyDelete
  15. heyyy,, can u plz give me the solution....why weka is not working on a large dataset using multilayer perceptron classifier but its working on naive bayes using same data

    ReplyDelete
  16. You might need to expand a bit on "not working". What are you doing exactly and what is happening (i.e. errors, exceptions etc.)?

    Cheers,
    Mark.

    ReplyDelete
  17. hi Mark,
    what about Hadoop 2.2 ? I tried to use HDFSSaver and got the following:

    11:52:57: [Saver] HDFSSaver$254396694|-dest / -saver "weka.core.converters.CSVSaver -F , -M ? -decimal 6" -hdfs-host 10.165.140.57 -hdfs-port 8020| problem saving. org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

    ReplyDelete
  18. It will work on Hadoop 2 (at least it did for me when I ran a few quick tests a while back). You will have to swap out the Hadoop 1.2.x client libraries that come with the distributedWekaHadoop package for the 2.x versions though.

    Also, the UI isn't really set up for Yarn. The job tracker settings don't apply anymore (but you will still need to have something in there so that Weka doesn't complain). In addition you will need to set a few properties in the User properties part of the dialog. In particular, I set:

    yarn.nodemanager.aux-services=mapreduce_shuffle
    mapreduce.framework.name=yarn

    After this was set, I managed to run all Weka job types successfully - on a local single node psuedo-distributed setup at least.

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. I just started use your package and want to make it work with my hadoop2.x.
      Could you please specify which library file to replace and how to change the user properties?

      Thanks in advance

      Delete
    2. I have managed to build distributedWekaHadoop against Hadoop 2.2.0 libraries and test it successfully both locally and on AWS in fully distributed mode. If Mark is ok with that i can post a link of the package.

      Delete
    3. Hi ArisKK,

      Sounds great! Please share the link to the package with the community (and any other settings/properties you used). Was it necessary to re-compile against Hadoop 2.2.0? 1.2 is supposed to be binary compatible with 2.x, and it worked for me with just a straight swap of jar files.

      Cheers,
      Mark.

      Delete
    4. This comment has been removed by the author.

      Delete
  19. ArisKK, could you please supply your package. I can't get my package to work with yarn.
    When I first added the “-user-prop mapreduce.framework.name=yarn” the program got stuck looking for the resourcemanager at 0.0.0.0, when I added the “user-prop yarn.resourcemanager.address=IP_of_the_resourcemanager”, I got it to work.. though all mapper jobs fail.
    I think that this must be something that the program doesn't find the $HADOOP_CONF_DIR or $YARN_HOME. Then if I run the application without yarn it seems to work quite fine, but then it's running locally..

    Another problem is that when I use a predefined header (the @relation in the top) my program freeze, if I remove @relation I get the exception that it's supposed to be @relation in the top and the program continues to create a new header.

    /Peter

    ReplyDelete
  20. Thank you provide valuable informations and iam seacrching same informations,and saved my time SAS Online Training

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. Hi all,
    When I use a CSVLoader to provide instances to HDFSSaver, the CSVLoader automatically add a header att1..att5 like this:
    att1,att2,att3,att4,att5
    sunny,85,85,FALSE,no
    sunny,80,90,TRUE,no
    overcast,83,86,FALSE,yes
    rainy,70,96,FALSE,yes
    rainy,68,80,FALSE,yes

    If I establish the value of noHeaderRowPresent=False I get the next error:
    [Loader] CSVLoader$1560672686|-M ? -B 100 -E ",' -F ,| Attribute names are not unique! Causes: '85'.
    For that reason when I use CSVtoARFF HeaderHadoopJob I get this arff header:
    @relation 'A relation name'

    @attribute att1 {att1,overcast,rainy,sunny}
    @attribute att2 {64,65,68,69,70,71,72,75,80,81,83,85,att2}
    @attribute att3 {65,70,75,80,85,86,90,91,95,96,att3}
    @attribute att4 {FALSE,TRUE,att4}
    @attribute att5 {att5,no,yes}
    @attribute arff_summary_att1 {att1_1.0,overcast_4.0,rainy_5.0,sunny_5.0,**missing**_0.0}
    @attribute arff_summary_att2 {64_1.0,65_1.0,68_1.0,69_1.0,70_1.0,71_1.0,72_2.0,75_2.0,80_1.0,81_1.0,83_1.0,85_1.0,att2_1.0,**missing**_0.0}
    @attribute arff_summary_att3 {65_1.0,70_3.0,75_1.0,80_2.0,85_1.0,86_1.0,90_2.0,91_1.0,95_1.0,96_1.0,att3_1.0,**missing**_0.0}
    @attribute arff_summary_att4 {FALSE_8.0,TRUE_6.0,att4_1.0,**missing**_0.0}
    @attribute arff_summary_att5 {att5_1.0,no_5.0,yes_9.0,**missing**_0.0}

    @data

    So my questions are: How can I save a csv file without header? and how can I specify the type(numeric, nominal,etc)of each attribute? (everyone appear as nominal and when I use CorrelationMatrixHadoopJob the hadoop log show me this error:DistributedWekaException: No numeric attributes in the input data!)

    Thanks in advance ;)

    ReplyDelete
    Replies
    1. You should leave the CSVLoader options at their defaults (i.e. read the header row). The HDFSSaver uses a CSVSaver internally when it writes the data to HDFS. There is an option in the CSVSaver to omit writing the header row - you need to make sure that this is turned on so that the CSV file in HDFS does not have a header row. See the "Getting datasets in and out of HDFS" section in the post.

      Cheers,
      Mark.

      Delete
    2. Thanks Mark!
      I didn't realize that HDFSSaver has his own CSVSaver internally. I just have changed noHeaderRow parameter to "False" and everything works fine.

      Cheers,
      Yari.

      Delete
  23. Thanks for your such valuable information. But could you please kindly do me a favor?
    I am a new weka and hadoop learner, and I've successfully installed the distrubutedwekabase and distrubutedwekahadoop packages just used the method you mentioned before:through the GUI->tools.
    When finished my installment,I can see a "Hadoop" folder, but I cannot find either "distrubutedwekabase" or "distrubutedwekahadoop" in the "available" list in the PackageManager. I tried to Run the "CSVtoArffHeaderHadoopJob", and it always prompted that "no customer class".
    p.s. I checked the configuration by clicked "more" and I noticed that in the "CSVtoArffHeaderHadoopJob" configuration pannel, there is no additionwekajar item showed.

    So could you kindly give me some advice?
    Thanks in advance ;)

    ReplyDelete
  24. Hi,
    I re-built the package using CDH 4.3.0 version of hadoop. I tried running a WekaClassifierHadoopJob on my cluster in yarn, the jars seem to get copied correctly into HDFS but the job fails. Looks like the same problem as faced by Peter.
    Also while trying the same in a separate classic map reduce cluster, I provided the mapreduce.framework.name user property as classic, but there seemed some problem with the cluster intialization and getting the generic error "Please check your configuration for mapreduce.framework.name and the correspond server addresses." Am I missing something here?

    Thanks in advance!
    -Kuntal

    ReplyDelete
  25. I'll have to let others comment if someone has been successful with CDH + Yarn.

    I ran the package successfully (without recompilation) under mapreduce version 1 on a CDH 4.4 quick-start VM. All that was required in this case was a jar swap - CDH client jars for the Apache jars that come with distributedWekaHadoop.

    I've also been successful with running against Apache Hadoop 2.x under Yarn. This was map-reduce managed by Yarn and required the Apache 2.x jars. I used:

    mapreduce.framework.name=yarn
    yarn.nodemanager.aux-services=mapreduce_shuffle

    ArisKK has also run the package successfully against Apache Hadoop 2.x.

    Cheers,
    Mark.


    ReplyDelete
  26. Hi Mark,
    Thanks for your quick reply. I tried running on classic as well, but with no luck. Could tell me what user properties you had provided or what other configuration changes you had made while running the job in classic mapreduce on CDH 4?

    Thanks in advance!
    -Kuntal

    ReplyDelete
  27. I didn't have to set any properties at all to run under MR v1 on CDH 4.4. Admittedly, the only service I had running was MR v1, and other services (including YARN) were shut down.

    What error/exceptions are you seeing?

    Cheers,
    Mark.

    ReplyDelete
  28. When I am trying to run a WekaClassifierHadoopJob from Knowledge Flow, on a cluster running classic MRv1 on CDH4, I am getting the following in the weka log:

    Combined: -weka-jar "C:\\Program Files\\Weka-3-7\\weka.jar" -logging-interval 10 -hdfs-host 10.5.5.82 -hdfs-port 8020 -jobtracker-host 10.5.5.82 -jobtracker-port 8021 -input-paths /nbeg/kuntal/cpu.csv -output-path /nbeg/kuntal/weka -A MAA,MBB,MCC,MDD,MEE,MFF,CLLL -M ? -E ' -F , -model-file-name outputModel.model -num-iterations 1 -class last -W weka.classifiers.functions.LinearRegression -fold-number -1 -total-folds 1 -seed 1 -- -S 0 -R 1.0E-8
    Combined: -weka-jar "C:\\Program Files\\Weka-3-7\\weka.jar" -logging-interval 10 -hdfs-host 10.5.5.82 -hdfs-port 8020 -jobtracker-host 10.5.5.82 -jobtracker-port 8021 -input-paths /nbeg/kuntal/cpu.csv -output-path /nbeg/kuntal/weka -A MAA,MBB,MCC,MDD,MEE,MFF,CLLL -M ? -E ' -F , -model-file-name outputModel.model -num-iterations 1 -class last -W weka.classifiers.functions.LinearRegression -fold-number -1 -total-folds 1 -seed 1 -- -S 0 -R 1.0E-8
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    Using jobtracker: 10.5.5.82:8021
    Setting logging interval to 10
    weka.distributed.DistributedWekaException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
    weka.distributed.hadoop.WekaClassifierHadoopJob.runJob(WekaClassifierHadoopJob.java:1047)
    weka.gui.beans.AbstractHadoopJob.runJob(AbstractHadoopJob.java:238)
    weka.gui.beans.AbstractHadoopJob.start(AbstractHadoopJob.java:278)
    weka.gui.beans.FlowRunner$1.run(FlowRunner.java:125)

    at weka.distributed.hadoop.WekaClassifierHadoopJob.runJob(WekaClassifierHadoopJob.java:1047)
    at weka.gui.beans.AbstractHadoopJob.runJob(AbstractHadoopJob.java:238)
    at weka.gui.beans.AbstractHadoopJob.start(AbstractHadoopJob.java:278)
    at weka.gui.beans.FlowRunner$1.run(FlowRunner.java:125)
    Caused by: weka.distributed.DistributedWekaException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
    at weka.distributed.hadoop.ArffHeaderHadoopJob.runJob(ArffHeaderHadoopJob.java:945)
    at weka.distributed.hadoop.WekaClassifierHadoopJob.initializeAndRunArffJob(WekaClassifierHadoopJob.java:815)
    at weka.distributed.hadoop.WekaClassifierHadoopJob.runJob(WekaClassifierHadoopJob.java:1029)
    ... 3 more
    Caused by: weka.distributed.DistributedWekaException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
    at weka.distributed.hadoop.HadoopJob.runJob(HadoopJob.java:570)
    at weka.distributed.hadoop.ArffHeaderHadoopJob.runJob(ArffHeaderHadoopJob.java:920)
    ... 5 more
    Caused by: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
    at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
    at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
    at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1239)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1235)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapreduce.Job.connect(Job.java:1234)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1263)
    at weka.distributed.hadoop.HadoopJob.runJob(HadoopJob.java:543)
    ... 6 more

    However, all other (non-weka) jobs are running fine on my cluster.

    -Kuntal

    ReplyDelete
  29. Are you sure that you've copied the correct CDH client jars into ${user.home}/wekafiles/packages/distributedWekaHadoop/lib? I took the jars from:

    /usr/lib/hadoop/client-0.20

    There seemed to be more jars than needed in that directory (two copies of each in fact - one with *cdh4* in the name and one without), but that didn't cause a problem for me.

    Of course, you'll need to make sure your cluster is running MR v1.

    There are a number of posts on various forums around about the "Cannot initialize Cluster" error. Take a look at:

    https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/HoIXbmDnAFY
    http://stackoverflow.com/questions/19043970/cannot-initialize-cluster-please-check-your-configuration-for-mapreduce-framewo

    Cheers,
    Mark.

    ReplyDelete
  30. We copied all the jars from CDH4 installed version of Hadoop into distributedWekaHadoop library
    but we are encountering the following error when we tried to create knowledge flow using HDFS Loader or HDFS Saver

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream
    aused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataOutputStream
    at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:268)


    Please help me debug the error encountered and mention the versions of Java and Hadoop compatiable with the WEKAHadoop implementation

    ReplyDelete
  31. Did you remove the existing jars from

    ${user.home}/wekafiles/packages/distributedWekaHadoop/lib

    and then copy all jars from

    CDH's /usr/lib/hadoop/client-0.20 into
    ${user.home}/wekafiles/packages/distributedWekaHadoop/lib

    This process has worked successfully for a several people using CDH 4.4 and 4.7. Java version 1.6 or 1.7 should be fine.

    Cheers,
    Mark.

    ReplyDelete
  32. Thank you for your quick response

    We followed the same procedure and replaced the jar's in
    ${user.home}/wekafiles/packages/distributedWekaHadoop/lib with
    jars from ../cloudera/parcels/hadoop/lib/client-0.20

    After the procedure we encounter the following error during the startup of WEKA

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream
    java.lang.Class.forName0(Native Method)
    java.lang.Class.forName(Class.java:190)
    weka.core.ClassDiscovery.find(ClassDiscovery.java:344)
    weka.gui.GenericPropertiesCreator.generateOutputProperties(GenericPropertiesCreator.java:532)
    weka.gui.GenericPropertiesCreator.execute(GenericPropertiesCreator.java:629)
    weka.gui.GenericPropertiesCreator.(GenericPropertiesCreator.java:162)
    weka.core.WekaPackageManager.refreshGOEProperties(WekaPackageManager.java:1144)
    weka.core.WekaPackageManager.loadPackages(WekaPackageManager.java:1134)
    weka.core.WekaPackageManager.loadPackages(WekaPackageManager.java:1047)
    weka.gui.GenericObjectEditor.determineClasses(GenericObjectEditor.java:177)
    weka.gui.GenericObjectEditor.(GenericObjectEditor.java:247)
    weka.gui.GUIChooser.(GUIChooser.java:714)
    weka.gui.GUIChooser.createSingleton(GUIChooser.java:260)
    weka.gui.GUIChooser.main(GUIChooser.java:1573)

    We also altered the java version to 1.6 when we encountered error with version 1.7
    We have Cloudera 4.7 running on our cluster

    Regards,
    Harsha

    ReplyDelete
  33. It looks like there are still core Hadoop classes missing. Can you try copying jars from

    /usr/lib/hadoop/client-0.20

    rather than

    ../cloudera/parcels/hadoop/lib/client-0.20?

    Cheers,
    Mark.

    ReplyDelete
  34. Hello,

    I have an error when i start weka package manager (weka developper version 3.7.11) :

    java.net.SocketTimeoutException: connect timed out weka package manager

    Do you have any idea ?

    Best Regards,
    Said SI KADDOUR

    ReplyDelete
  35. Do you have to go through a proxy for internet access? See:

    http://weka.wikispaces.com/How+do+I+use+the+package+manager%3F#GUI package manager-Using a HTTP proxy

    Cheers,
    Mark.

    ReplyDelete
  36. Hi Mark,

    thank you for the very nice tutorial. I have a question regarding
    WekaClassifierEvaluationTest in 1.0.7:

    When you set up the evaluator, you make it like this, using the iris dataset:

    double[] priors = { 50.0, 50.0, 50.0 };
    evaluator.setup(new Instances(train, 0), priors, 150, 1L, 0);

    Why are you defining "count" as 150? From the Javadoc, one can read count refers to "the total number of class values seen (with respect to the priors)" As for your Iris example, I understand it should be 3: the number of classes in the dataset. I tried both with 3 and 150 and (apparently), there is no difference in the obtained results.

    Am I getting sth wrong?

    Thank you in advance!
    Alberto

    ReplyDelete
  37. Hi Alberto,

    Sorry, the javadoc is a little misleading here. That value is actually the sum of the instance weights from which the class prior counts were computed. I should really just simplify the API because it's just the sum of the priors[] :-)

    See weka.classifiers.evaluation.AggregateableEvaluationWithPriors and its superclass (in the main weka distribution jar) weka.classifiers.evaluation.Evaluation.

    Cheers,
    Mark.

    ReplyDelete
  38. This comment has been removed by a blog administrator.

    ReplyDelete
  39. This comment has been removed by a blog administrator.

    ReplyDelete
  40. Could you specify the steps to be followed to make the distributedwekaPackage work with hadoop 2.4 ? I am unable to connect my hadoop node with weka although they are running

    ReplyDelete
  41. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

    ReplyDelete