Mark Hall on Data Mining & Weka: Sentiment analysis with Weka

Thursday, 1 March 2012

Sentiment analysis with Weka

With the ever increasing growth in online social networking, text mining and social analytics are hot topics in predictive analytics. The standard approach to learning a document classifier is to convert unstructured text documents into something called the bag-of-words representation and then apply a standard propositional learning scheme to the result. Essentially this means splitting documents up into their constituent words, building a dictionary for the corpus and then converting each document into a fixed length vector of either binary word presence/absence indicators or word frequency counts. In general, this involves two passes over the data (especially if further transformations such as TF-IDF are to be applied) - one to build the dictionary, and a second one to convert the text to vectors. Following this a classifier can be learned.

Certain types of classifiers lend themselves naturally to incremental streaming scenarios, and can perform the tokenization of text and construction of the model in one pass. Naive Bayes multinomial is one such algorithm; linear support vector machines and logistic regression learned via stochastic gradient descent (SGD) are some others. These methods have the advantage of being "any time" algorithms - i.e. they can produce a prediction at any stage in the learning process. Furthermore, they scale linearly with the amount of data and can be considered "Big Data" methods.

Text classification is a supervised learning task, which means that each training document needs to have a category or "class" label provided by a "teacher". Manually labeling training data is a labor intensive process and typical training sets are not huge. This seems to preclude the need for big data methods. Enter Twitter's endless data stream and the prediction of sentiment. The limited size of tweets encourage the use of emoticons as a compact way of indicating the tweeter's mood, and these can be used to automate the labeling of training examples [1].

So how can this be implemented in Weka [2]? Some new text processing components for Weka's Knowledge Flow and the addition of NaiveBayesMultinomialText and SGDText for learning models directly from string attributes make it fairly simple.

This example Knowledge Flow process incrementally reads a file containing some 850K tweets. However, using the Groovy Scripting step with a little custom code, along with the new JsonFieldExtractor step, it would be straightforward to connect directly to the Twitter streaming service and process tweets in real-time. The SGDText classifier component performs tokenization, stemming, stopword removal, dictionary pruning and the learning of a linear logistic regression model all incrementally. Evaluation is performed by interleaved testing and training, i.e. a prediction is produced for each incoming instance before it is incorporated into the model. For the purposes of evaluation this example flow discards all tweets that don't contain any emoticons, which results in most of the data being discarded. If evaluation wasn't performed then all tweets could be scored, with only labeled ones getting used to train the model.

SGDText is included in Weka 3.7.5. The SubstringReplacer, SubstringLabeler and NaiveBayesMultinomialText classifier (not shown in the screenshot above) will be included with Weka 3.7.6 (due out April/May 2012). In the meantime, interested folks can grab a nightly snapshot of the developer version of Weka.

Options for SGDText

References
[1] Albert Bifet and Eibe Frank. Sentiment knowledge discovery in Twitter streaming data. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 1-15. Springer, 2010.

[2] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.

48 comments:

jens9 March 2012 at 03:06
Hi, is the labelled twitter data available?
Thanks..
ReplyDelete
Replies
Mark Hall14 March 2012 at 12:54
Hi Jens,

You can grab the labelled data in ARFF format from:

https://docs.google.com/open?id=0B1pvkpCwTsiSd1pyTFZkdWVRdEs5Q1NiQW1mRmF1Zw

Cheers,
Mark.
ReplyDelete
Replies
Priyan15 March 2012 at 09:21
Hi Mark,
We are final year students doing a research on twitter sentiment analysis for one of our course module. We r planning to first separate tweets into polar or neural and then into + and -.

We need some assistance on how to classify in Weak and how to create data set.Hope you will guide on this matter ...

Cheers,
Pri
ReplyDelete
Replies
Docs16 April 2012 at 23:13
Expert,It's useful for me
thanks
document finder
ReplyDelete
Replies
CL22 April 2012 at 20:35
Hi, will you be making the layout available for download by any chance?
ReplyDelete
Replies
Mark Hall27 April 2012 at 01:47
Sure. You can grab the flow layout from:

https://docs.google.com/open?id=0B1pvkpCwTsiSTHdfWHJ1ZHpJWms

Cheers,
Mark.
ReplyDelete
Replies
Ken Stone21 May 2012 at 11:01
After downloading the arff file and the layout file, I started to run the application, but there was no results shown in the "TextViewer" token. Is there any procedures going wrong when I processing it?
ReplyDelete
Replies
Mark Hall21 May 2012 at 18:30
Hi Ken,

The data that I posted earlier was the labelled data (i.e. it had passed through the part of the flow that does the automatic labeling via substring matching). The flow is designed to work with the original 850K instance unlabeled data. You can get this from:

https://docs.google.com/open?id=0B1pvkpCwTsiSV3dGX2huaHRpdHc

You'll have to change the path in the ArffLoader to point to wherever you downloaded this file to.

Cheers,
Mark.
ReplyDelete
Replies
Unknown2 June 2012 at 21:38
Can we use export the model file created after classification in PMML format?
ReplyDelete
Replies
Mark Hall4 July 2012 at 18:58
Weka can consume certain PMML models, but doesn't have an export facility yet. This is on the roadmap for a future release.

Cheers,
Mark.
ReplyDelete
Replies
nirmala29 December 2012 at 06:12
Hi Mark,

I have just started using Weka, and I want to use it for a simple text classification problem with the features such as unigrams,word count,position with a naive bayes classifier. I have found some information online and in documentation but I am finding difficult to understand the difference between unigrams feature set and bag of words representation in arff files.

suppose if I have input arff file such as the below one:

@relation text_files

@attribute review string
@attribute sentiment {dummy,negative, positive}

@data
"this is some text", positive
"this is some more text", positive
"different stuff", negative

after applying stringToWordVector and Reorder filters, the output arff will be:

@relation 'bagOfWords'

@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric
@attribute sentiment {dummy,negative,positive}

@data

{1 1,3 1,5 1,6 1,7 positive}
{1 1,2 1,3 1,5 1,6 1,7 positive}
{0 1,4 1,7 negative}

suppose If I want to train the classifier with unigram count in each class, that is something like the following arff file:

@relation 'bagOfWords WordCount'

@attribute unigram string
@attribute count numeric
@attribute sentiment {dummy,negative,positive}

@data
"this",2,positive
"is",2,positive
"some",2,positive
"text",2,positive
"more",1,positive
"different",1,negative
"stuff",1,negative

I understood the third way clearly and if I want to extend the feature set later (such as unigram position etc) seems to be easy relatively. But my doubt is, is this the correct way of representing the data in arff, for the classifier?

With API, I got that, with stringToWordVector I can set options such as -C -T for word count and term frequencies.

Suppose if I want to include other features in bagofwords arff file, how can I do that?

Thank you in anticipation.

Nirmala
ReplyDelete
Replies
Zeeshaan4 January 2013 at 08:50
How can I create the arff file of the current twitter data?
ReplyDelete
Replies
Luisa Luna8 January 2013 at 10:35
Hi

I don't understand the configuration of substring labeler filter and subtring replacer filter. If you can help me are very gratfull.

Thanks

Ana
ReplyDelete
Replies
Unknown22 February 2013 at 07:49
how can I incporporate the layout in weka to configure the several components , have I to run it with eclipse ?
ReplyDelete
Replies
Mark Hall22 February 2013 at 22:40
Just launch the Knowledge Flow GUI from Weka's GUIChooser application. Alternatively, you can execute the flow layout from the command line or a script by invoking weka.gui.beans.FlowRunner.

Cheers,
Mark.
ReplyDelete
Replies
Unknown23 February 2013 at 03:55
This comment has been removed by the author.
ReplyDelete
Replies
Unknown23 February 2013 at 08:02
This comment has been removed by the author.
ReplyDelete
Replies
Unknown24 February 2013 at 13:41
it's alright , I run it the the command line. My project is to do the same work but for statuses written in arabic , is it possible to do it with this template , and if yes what I've to change ?
ReplyDelete
Replies
Unknown12 April 2014 at 06:24
Great article here.
Could you please provide a simple example of streaming twitter data with groovy+jsonfieldextractor?
ReplyDelete
Replies
Unknown12 July 2014 at 07:57
Hi Mark,

I am new to Weka and trying to use this for sentiment analysis on twitter data. Using the labelled data file given above when I open that file in GUI of Weka and try to chose one of the Bayes classifier it disables/ grays out all the contents under it and doesn't allow me to select one. Actually I wanted to use labelled.arff as a training dataset with NaiveBayesMultinomialText. Can you please let me know if I am doing something terribly wrong or it is not supposed to woe that way?chose one of the Bayes classifier it disables/ grays out all the contents under it and doesn't allow me to select one. However I thought i would be able to use labelled.arff
ReplyDelete
Replies
Mark Hall12 July 2014 at 14:11
The twitter datasets basically contain only a single string attribute (holding the text of each tweet) and the class (in the case of the labelled data). There are only two classifiers in Weka that can operate directly on string attributes - SGDText and NaiveBayesMultinomialText. If you want to use other classifiers then you will have to wrap your chosen classifier and the StringToWordVector filter in a weka.classifiers.meta.FilteredClassifier instance. StringToWordVector performs the bag-of-words transformation along with TF-IDF and various options for tokenising, stemming and stop-word application.

Cheers,
Mark.
ReplyDelete
Replies
Unknown1 August 2014 at 08:16
Hi Mark,
I was trying the labelled data for twitter (mentioned above) using NaiveBayesMultinomialText and SGD. I noticed some strange behaviour with SGD , it keeps running for the training data set and doesn't give any results. On the other hand NaiveBayesMultinomialText does gives the result in finite time. I tried using SGD from java code as well as through the GUI of Weka. Are there any known performance issues with SGD handling large training data text files ?
ReplyDelete
Replies
Mark Hall3 August 2014 at 13:28
If you were using the Explorer then it will have been training SGD in batch mode. In batch mode the default number of epochs for training is 500. Try reducing this to something smaller. To train SGD incrementally (which is basically just one pass over the data or one epoch) use the KnowlegeFlow, command line interface or, in your own code: call buildClassifier() with an empty set of instances (i.e attribute information only) to initialize and then updateClassifier() for each instance to be trained on.

Cheers,
Mark.
ReplyDelete
Replies
Unknown16 October 2014 at 10:32
Hi Mark!

m actually doing a student level thesis on twitter sentiment analysis, at small level. actually i want dataset for such type of analysis to complete my experimental process. i have downloaded some training sets but that are not working on Weka 3.6, how can i find such kind of dataset which can directly be implemented. As, i am new to it so need some guidance.

Hope u understand

Thanking u!!
ReplyDelete
Replies
Shafaq Siddiqui24 October 2014 at 07:37
Hello Mark I have a directory(Reviews) with two sub directories (pos and neg) I want to convert all the text files inside sub directories into .arff. I am using "java weka.core.converters.TextDirectoryLoader" command for this but it says "Access is denied" I also changed the location of Reviews directory but no use, please help me out of this. Thanks
ReplyDelete
Replies
mytesting11 November 2015 at 08:45
Thank you very much Mark, I am trying to analysis twitter data and store in three different folder 1.positive 2. negative 3. Neutral, can you please help me what change I need to do in KM flow ?
ReplyDelete
Replies
Unknown26 November 2015 at 20:02
Hi Mark,

I have used WEKA when I was a student in RGU, Aberdeen. Now I want to use WEKA for sentiment analysis of already downloaded 32,000 tweets on a given hashtag using TAGS. Please guide me on how i need to go about this, this is for my research work and I will be free to share the data.

Thank you
Mike
email: awoleye@yahoo.co.uk
twitter handle: @OMAWOLEYE

ReplyDelete
Replies
Unknown1 December 2015 at 03:29
Hello Mark,
Wonderful article really it will be helpful for my thesis task.I am using your labeled data as arff format but I can not apply Bayes Classifier rules in WEKA please give me a way to do that.

Thanks in advance
Oyaliul
ReplyDelete
Replies
Unknown23 December 2015 at 04:20
Hi Mark,
i'm new in weka,I work now on the recognition of handwritten digits our teacher we give a base image 10k in file.txt file but I don't know how to convert it to file.arff
ReplyDelete
Replies
Anonymous3 June 2016 at 00:43
hi
Can you help me in understanding how to make the tool understand "not good' as negative , or 'never thought that it would be good' as positive. TIA
ReplyDelete
Replies
Unknown17 June 2016 at 10:49
Hi Mark,

I'm not able to download the flow and the data, as the link didn't work. Is there any other way to get both of them fast? This is my email, in case it is you'd like to send both data and flow:
philippmayer687@gmail.com

Thanks.
ReplyDelete
Replies
Prranjali Jadhav21 July 2017 at 10:18
hi mark,
This is Pranjali from Nasik City (INDIA) working on twitter data.
i need twitter sentiments analysis dataset for MEKA...could u help plz
ReplyDelete
Replies
Unknown1 October 2017 at 07:26
how can i downloading the data of tweets?plz send me link
ReplyDelete
Replies
कान्हा27 July 2019 at 02:46
Hey author can u tell me how write a perfect blog ?

Yamaha Rx100 modified

mini Bullet 200cc

splendor modified pictures

know about Traffic rules

Pulsar 220 modified

Tvs Radeon full review

Welcome to Biksdna.com clcik here
ReplyDelete
Replies
Best Digital Marketing Services in Lahore1 September 2020 at 23:03
How to Make Application Using Deep Learning Architecture
Deep Learning is fundamentally changing everything around us.
A lot of people think that you need to be an expert to use power
of deep learning in your applications. However, that is not the case.
ReplyDelete
Replies
periyannan20 April 2022 at 01:45
Interesting stuff to read and useful to improve knowledge.
Keep posting.
python internship | web development internship |internship for mechanical engineering students |mechanical engineering internships |java training in chennai |internship for 1st year engineering students |online internships for cse students |online internship for engineering students |internship for ece students|data science internships |

ReplyDelete
Replies
Siti Nafiatul Fauziah13 May 2025 at 02:07
This is a very insightful post! I really appreciate how you demonstrated the use of Weka for sentiment analysis using Twitter data. The idea of using emoticons to automatically label training data is clever and practical. Thank you for sharing such a hands on approach!

Best Regards,
Fauziah
ReplyDelete
Replies

Add comment