Thursday 1 March 2012

Sentiment analysis with Weka

With the ever increasing growth in online social networking, text mining and social analytics are hot topics in predictive analytics. The standard approach to learning a document classifier is to convert unstructured text documents into something called the bag-of-words representation and then apply a standard propositional learning scheme to the result. Essentially this means splitting documents up into their constituent words, building a dictionary for the corpus and then converting each document into a fixed length vector of either binary word presence/absence indicators or word frequency counts. In general, this involves two passes over the data (especially if further transformations such as TF-IDF are to be applied) - one to build the dictionary, and a second one to convert the text to vectors. Following this a classifier can be learned.

Certain types of classifiers lend themselves naturally to incremental streaming scenarios, and can perform the tokenization of text and construction of the model in one pass.  Naive Bayes multinomial is one such algorithm; linear support vector machines and logistic regression learned via stochastic gradient descent (SGD) are some others. These methods have the advantage of being "any time" algorithms - i.e. they can produce a prediction at any stage in the learning process. Furthermore, they scale linearly with the amount of data and can be considered "Big Data" methods.

Text classification is a supervised learning task, which means that each training document needs to have a category or "class" label  provided by a "teacher". Manually labeling training data is a labor intensive process and typical training sets are not huge. This seems to preclude the need for big data methods. Enter Twitter's endless data stream and the prediction of sentiment. The limited size of tweets encourage the use of emoticons as a compact way of indicating the tweeter's mood, and these can be used to automate the labeling of training examples [1].

So how can this be implemented in Weka [2]? Some new text processing components for Weka's Knowledge Flow and the addition of NaiveBayesMultinomialText and SGDText for learning models directly from string attributes make it fairly simple.


This example Knowledge Flow process incrementally reads a file containing some 850K tweets. However, using the Groovy Scripting step with a little custom code, along with the new JsonFieldExtractor step, it would be straightforward to connect directly to the Twitter streaming service and process tweets in real-time. The SGDText classifier component performs tokenization, stemming, stopword removal, dictionary pruning and the learning of a linear logistic regression model all incrementally. Evaluation is performed by interleaved testing and training, i.e. a prediction is produced for each incoming instance before it is incorporated into the model. For the purposes of evaluation this example flow discards all tweets that don't contain any emoticons, which results in most of the data being discarded. If evaluation wasn't performed then all tweets could be scored, with only labeled ones getting used to train the model.

SGDText is included in Weka 3.7.5. The SubstringReplacer, SubstringLabeler and NaiveBayesMultinomialText classifier (not shown in the screenshot above) will be included with Weka 3.7.6 (due out April/May 2012). In the meantime, interested folks can grab a nightly snapshot of the developer version of Weka.

Options for SGDText

References
[1] Albert Bifet and Eibe Frank. Sentiment knowledge discovery in Twitter streaming data. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 1-15. Springer, 2010.

[2] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.

47 comments:

  1. Hi, is the labelled twitter data available?
    Thanks..

    ReplyDelete
  2. Hi Jens,

    You can grab the labelled data in ARFF format from:

    https://docs.google.com/open?id=0B1pvkpCwTsiSd1pyTFZkdWVRdEs5Q1NiQW1mRmF1Zw

    Cheers,
    Mark.

    ReplyDelete
  3. Hi Mark,
    We are final year students doing a research on twitter sentiment analysis for one of our course module. We r planning to first separate tweets into polar or neural and then into + and -.

    We need some assistance on how to classify in Weak and how to create data set.Hope you will guide on this matter ...

    Cheers,
    Pri

    ReplyDelete
  4. Hi, will you be making the layout available for download by any chance?

    ReplyDelete
  5. Sure. You can grab the flow layout from:

    https://docs.google.com/open?id=0B1pvkpCwTsiSTHdfWHJ1ZHpJWms

    Cheers,
    Mark.

    ReplyDelete
  6. After downloading the arff file and the layout file, I started to run the application, but there was no results shown in the "TextViewer" token. Is there any procedures going wrong when I processing it?

    ReplyDelete
  7. Hi Ken,

    The data that I posted earlier was the labelled data (i.e. it had passed through the part of the flow that does the automatic labeling via substring matching). The flow is designed to work with the original 850K instance unlabeled data. You can get this from:

    https://docs.google.com/open?id=0B1pvkpCwTsiSV3dGX2huaHRpdHc

    You'll have to change the path in the ArffLoader to point to wherever you downloaded this file to.

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. Thanks Mark,

      Thanks for providing the original data set and I successfully operated it. Besides I have crawled some twitter data transforming into the 'csv' format, but failed to be loaded from the "arff viewer", where the specific errors said "java.io.IOException: wrong number of values. Read 2, expected 1, read Token[EOL], line 18". So, I wonder how changing the csv to arff format without the above errors?

      Ken

      Delete
    2. how could you configure the several components , runnig the kfml file needs to weka api or we can do it wuth the graphique interface ???

      Delete
  8. Can we use export the model file created after classification in PMML format?

    ReplyDelete
  9. Weka can consume certain PMML models, but doesn't have an export facility yet. This is on the roadmap for a future release.

    Cheers,
    Mark.

    ReplyDelete
  10. Hi Mark,

    I have just started using Weka, and I want to use it for a simple text classification problem with the features such as unigrams,word count,position with a naive bayes classifier. I have found some information online and in documentation but I am finding difficult to understand the difference between unigrams feature set and bag of words representation in arff files.

    suppose if I have input arff file such as the below one:

    @relation text_files

    @attribute review string
    @attribute sentiment {dummy,negative, positive}

    @data
    "this is some text", positive
    "this is some more text", positive
    "different stuff", negative

    after applying stringToWordVector and Reorder filters, the output arff will be:


    @relation 'bagOfWords'

    @attribute different numeric
    @attribute is numeric
    @attribute more numeric
    @attribute some numeric
    @attribute stuff numeric
    @attribute text numeric
    @attribute this numeric
    @attribute sentiment {dummy,negative,positive}

    @data

    {1 1,3 1,5 1,6 1,7 positive}
    {1 1,2 1,3 1,5 1,6 1,7 positive}
    {0 1,4 1,7 negative}

    suppose If I want to train the classifier with unigram count in each class, that is something like the following arff file:

    @relation 'bagOfWords WordCount'

    @attribute unigram string
    @attribute count numeric
    @attribute sentiment {dummy,negative,positive}

    @data
    "this",2,positive
    "is",2,positive
    "some",2,positive
    "text",2,positive
    "more",1,positive
    "different",1,negative
    "stuff",1,negative

    I understood the third way clearly and if I want to extend the feature set later (such as unigram position etc) seems to be easy relatively. But my doubt is, is this the correct way of representing the data in arff, for the classifier?

    With API, I got that, with stringToWordVector I can set options such as -C -T for word count and term frequencies.

    Suppose if I want to include other features in bagofwords arff file, how can I do that?


    Thank you in anticipation.

    Nirmala

    ReplyDelete
    Replies
    1. Hi Nimala,

      Your original input ARFF file (the one with the string attribute and class label) can have other features that you compute elsewhere. The StringToWordVector filter will only process the string attributes - other features will be left untouched.

      Cheers,
      Mark.

      Delete
  11. How can I create the arff file of the current twitter data?

    ReplyDelete
    Replies
    1. The original 850K tweet file in ARFF format is available for download - see the link in the earlier comments.

      Cheers,
      Mark.

      Delete
  12. Hi

    I don't understand the configuration of substring labeler filter and subtring replacer filter. If you can help me are very gratfull.

    Thanks

    Ana

    ReplyDelete
    Replies
    1. If you download the example Knowledge Flow layout file you can take a look at the configuration I used for processing the tweets.

      Cheers,
      Mark.

      Delete
  13. how can I incporporate the layout in weka to configure the several components , have I to run it with eclipse ?

    ReplyDelete
  14. Just launch the Knowledge Flow GUI from Weka's GUIChooser application. Alternatively, you can execute the flow layout from the command line or a script by invoking weka.gui.beans.FlowRunner.

    Cheers,
    Mark.

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. it's alright , I run it the the command line. My project is to do the same work but for statuses written in arabic , is it possible to do it with this template , and if yes what I've to change ?

    ReplyDelete
  18. Great article here.
    Could you please provide a simple example of streaming twitter data with groovy+jsonfieldextractor?

    ReplyDelete
  19. Hi Mark,

    I am new to Weka and trying to use this for sentiment analysis on twitter data. Using the labelled data file given above when I open that file in GUI of Weka and try to chose one of the Bayes classifier it disables/ grays out all the contents under it and doesn't allow me to select one. Actually I wanted to use labelled.arff as a training dataset with NaiveBayesMultinomialText. Can you please let me know if I am doing something terribly wrong or it is not supposed to woe that way?chose one of the Bayes classifier it disables/ grays out all the contents under it and doesn't allow me to select one. However I thought i would be able to use labelled.arff

    ReplyDelete
  20. The twitter datasets basically contain only a single string attribute (holding the text of each tweet) and the class (in the case of the labelled data). There are only two classifiers in Weka that can operate directly on string attributes - SGDText and NaiveBayesMultinomialText. If you want to use other classifiers then you will have to wrap your chosen classifier and the StringToWordVector filter in a weka.classifiers.meta.FilteredClassifier instance. StringToWordVector performs the bag-of-words transformation along with TF-IDF and various options for tokenising, stemming and stop-word application.

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. Thanks much for your response Mark! I am planning to use NaiveBayesMultinomialText for analysis.

      Delete
  21. Hi Mark,
    I was trying the labelled data for twitter (mentioned above) using NaiveBayesMultinomialText and SGD. I noticed some strange behaviour with SGD , it keeps running for the training data set and doesn't give any results. On the other hand NaiveBayesMultinomialText does gives the result in finite time. I tried using SGD from java code as well as through the GUI of Weka. Are there any known performance issues with SGD handling large training data text files ?

    ReplyDelete
  22. If you were using the Explorer then it will have been training SGD in batch mode. In batch mode the default number of epochs for training is 500. Try reducing this to something smaller. To train SGD incrementally (which is basically just one pass over the data or one epoch) use the KnowlegeFlow, command line interface or, in your own code: call buildClassifier() with an empty set of instances (i.e attribute information only) to initialize and then updateClassifier() for each instance to be trained on.

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. hi Mark
      can we classify the data u provided through SVM in weka .. if yes than plz tell how ??

      Delete
  23. Hi Mark!

    m actually doing a student level thesis on twitter sentiment analysis, at small level. actually i want dataset for such type of analysis to complete my experimental process. i have downloaded some training sets but that are not working on Weka 3.6, how can i find such kind of dataset which can directly be implemented. As, i am new to it so need some guidance.

    Hope u understand

    Thanking u!!

    ReplyDelete
  24. Hello Mark I have a directory(Reviews) with two sub directories (pos and neg) I want to convert all the text files inside sub directories into .arff. I am using "java weka.core.converters.TextDirectoryLoader" command for this but it says "Access is denied" I also changed the location of Reviews directory but no use, please help me out of this. Thanks

    ReplyDelete
  25. Thank you very much Mark, I am trying to analysis twitter data and store in three different folder 1.positive 2. negative 3. Neutral, can you please help me what change I need to do in KM flow ?

    ReplyDelete
    Replies
    1. You can use the "FlowByExpression" KnowledgeFlow step to route predicted instances to different ARFF (or CSV) Saver steps based on the predicted probability assigned to the positive (or negative) class. You can define thresholds for what constitutes a positive and negative case, and then use a small interval around 0.5 to define neutral.

      Cheers,
      Mark.

      Delete
  26. Hi Mark,

    I have used WEKA when I was a student in RGU, Aberdeen. Now I want to use WEKA for sentiment analysis of already downloaded 32,000 tweets on a given hashtag using TAGS. Please guide me on how i need to go about this, this is for my research work and I will be free to share the data.

    Thank you
    Mike
    email: awoleye@yahoo.co.uk
    twitter handle: @OMAWOLEYE

    ReplyDelete
  27. Hello Mark,
    Wonderful article really it will be helpful for my thesis task.I am using your labeled data as arff format but I can not apply Bayes Classifier rules in WEKA please give me a way to do that.

    Thanks in advance
    Oyaliul

    ReplyDelete
    Replies
    1. Try using the NaiveBayesMultinomialText classifier - this version of naive Bayes can operate directly on String attributes.

      Cheers,
      Mark.

      Delete
  28. Hi Mark,
    i'm new in weka,I work now on the recognition of handwritten digits our teacher we give a base image 10k in file.txt file but I don't know how to convert it to file.arff

    ReplyDelete
  29. hi
    Can you help me in understanding how to make the tool understand "not good' as negative , or 'never thought that it would be good' as positive. TIA

    ReplyDelete
  30. Hi Mark,

    I'm not able to download the flow and the data, as the link didn't work. Is there any other way to get both of them fast? This is my email, in case it is you'd like to send both data and flow:
    philippmayer687@gmail.com

    Thanks.

    ReplyDelete
  31. hi mark,
    This is Pranjali from Nasik City (INDIA) working on twitter data.
    i need twitter sentiments analysis dataset for MEKA...could u help plz

    ReplyDelete
  32. how can i downloading the data of tweets?plz send me link

    ReplyDelete
  33. Replies
    1. aiobjectives.com

      Digital image processing, What is Image Enhancement and Image Restoration
      In computer science, digital image processing is the use of a digital computer to process

      digital images through an algorithm. As a subcategory or field of digital signal processing,
      digital image processing has many advantages over analog image processing.

      Delete
  34. How to Make Application Using Deep Learning Architecture
    Deep Learning is fundamentally changing everything around us.
    A lot of people think that you need to be an expert to use power
    of deep learning in your applications. However, that is not the case.

    ReplyDelete