General Description: This is an ongoing research project for predicting US 2012 presidency election by analyzing conversations in Twitter.
Contributor: Kazem Jahanbakhsh
Implementation Period: September 25, 2012 - present
Problem Description, Motivation, & Methodology:
US election is happening in one month and there is a huge amount of conversations is going on in the web and in particular in social media. The question that I had in my mind was to tap into the available data from web and test to see if we can find any interetsing pattern in the ongoing conversations. This becomes very interesting when someone aims to use the public data out there to predict the future. For now, I just concentrate on some high level analysis to see if we can observe anything interetsing in data. My methodology is simple: (1) collect the data, (2) use machine learning algorithms to mine/analyze data, (3) visualize the results. In the following I'l go through these three steps one by one. Below is the list of contents that you are going to read in the rest of this article:
I- Collecting Data from Twitter
II- Sentiment Analysis of Tweets
III- Political Tweets Polarity Results
IV- More on Sentiment Analysis Algorithm
V- Frequencies of Political Tweets
VI- Geographical Distribution of Political Tweets
VII- Temporal Changes in Frequent Words
I- Collecting Data from Twitter:
II- Sentiment Analysis of Tweets:
The second step is to use statistical analysis and machine learning techniques to analyze the collected tweets. I was very interested in seeing how people tweet about both candidates: Obama and Romney. Thus, I decided to run sentiment analysis on top of the collected data in order to find out tweets polarity (if they are positive or negative). For doing so, I decided to use the "Naive Bayes" classifier in order to compute the sign of tweets (+/-) where positive means that a tweet supports the corresponding candidate whereas negative mean opposite. I use the NLTK Python package for NLP phase . To train the classifier I used the date collected by people from Cornell university where they have published polarity data for imdb reviews .
I faced a few challenges while doing sentiment analysis. First, the dictionary words people use to review movies differs from the language people use to express their opinion regard to a candidate in a condensed way. Thus, the first question is how accurate our perdiction will be if we use the imdb review dataset for training the classifier. Let's check an example tweet:
Tweet: "If Obama doesn't win I'm F..D"
If I feed above tweet to my classifier, it gives it a "-" sign for Obama; however, the above tweet is actually positive for Obama. The reason that our classifier failed to detect the right polarity is that the whole meaning of sentence is negative but considering the garmmer of sentence, it's actually positive. To fix this issue, we should use more sophisticated classifier than Naive Bayes. We actually need an algorithm that not only considers the polarity of words for compuring the whole tweet polarity, but it also has to take into account the relationship between words and the roles of words in the sentence.
III- Political Tweets Polarity Results:
After running the classifier using the collected tweets (from Sep 29, 2012 to Oct 04, 2012). I have got the following numbers regards to the number of +/- for each candidate. Almost 51% of tweets are around Obama whereas 49% around Romney.
IV- More on Sentiment Analysis Algorithm:
Alex Thompson (a friend of mine) asked me to give more details on how sentiment analysis algorithm works and when and why it can fail. Considering the limitations of computers in terms of learning and being intelligent, we always can come up with new examples that machine will fail to label them correctly at least the first time it sees those examples. However, we can correct its mistakes by labelling those new examples or patterns! Alex sent me a list of made up sentences to find out how algorithm labels them. This will give us better insight about the capabilities and limitations of the Naive Bayes classfier for sentiment analysis:
|obama better win, or else im going to live and be happy in another country.||-1|
|obama better win, or else im going to die and be angry in this country||-1|
|if obama wins, i will be happy||+1|
|if obama wins, i will not be happy||-1|
|if obama wins, i will be unhappy||+1|
|if obama wins, i will never be happy again.||+1|
|if obama wins, i will never be unhappy again.||+1|
|if obama wins, i will not be sad||+1|
|if obama wins, i will be so pissed drunk because i will party.||-1|
|i am not not going to vote for obama :)||-1|
Algorithm's core is based on Bayesian theorem. Let's say you want to predict the label of the following sentence:
input: "obama better win, or else im going to live and be happy in another country."
Before being able to determine an input label, we should train the algorithm. We have to use a training dataset with labels in order to train the classifier. As it has been mentioned earlier, I've used some available dataset from imdb reviews for movies for training. This dataset contains 1000 positive reviews plus 1000 negative reviews. Here, I give two examples of positive/negative reviews.
A negative review for a movie titled "Deep Impact (1998)": Deep Impact Review
A positive review for another movie named "As good it gets": As Good it Gets Review
In training step, the algorithm goes through all labelled reviews one by one and segmentizes it to its sentences and tokenizes the sentences to their words. Thus, at the end from each review we have a list of pairs (word, label) where "word" is one word in the review and label is the review label (+1 or -1). Having all of these pairs from all positive and negative reviews, for each word the algorithm counts the number of times that given word happens in positive reviews and number of times that word happens in negative reviews. For example, for the word="entertaining", it computes the following probabilities:
p("entertaining"|+) = what percentage of positive reviews contains "entertaining" word
p("entertaining"|-) = what percentage of negative reviews contains "entertaining" word
After training phase for all words in training data, we have computed all above likelihoods.
For labelling phase, the algorithm takes the given input to compute its label.
input="obama better win, or else im going to live and be happy in another country."
It goes through the same steps. First, it tokenizes the sentence to its words:
w1="obama", w2="better", w3="win", and ... .
Next, it computes the probability that this tweet is positive and negative as follows:
p(+|input) = probability that input is positive=p(w1|+)*p(w2|+)*p(w3|+)*....
p(-|input) = probability that input is negative=p(w1|-)*p(w2|-)*p(w3|-)*....
Depending on the likelihood that the constructing words in the input are positive or negative the algorithm generates a total likelihood for the input to be positive or negative. Comparing these two computed likelihood tells us what the label is (the label with greater likelihood).
Why sentiment analysis algorithm may fail?
1- The training dataset is imdb review and the testing dataset is political tweets. These two datasets have different probability distributions. The imdb reviews are written by expert people in movie industry where they use different words (dictionary) to explain their lengthy opinions about movies. However, tweets are written by normal people and they are very short usually one sentence (less descriptive). This makes it much harder for the algorithm to detect the label of tweets correctly.
2- In the case of our algorithm, it just considers words independently and ignores the relationship between words. This means that it ignores what role this word plays in a sentence. The example I gave in the beginning of this page highlight my point. Even if the sentence is negative by itself but considering the role "Obama" word plays in the sentence (grammatical role), it is a positive sentiment for him!
V- Frequencies of Political Tweets:
One interesting question is to test how different events influence the amount of conversation among people in social media. For example, what is effect of debates on the number of tweets and their contents. Analyzing tweets we can find out for instance how well the candidates performed in the debate. In the following figure, I have plotted the number of tweets per hour since Sep 29, 2012. We can observe a continious and preiodic pattern in the tweets frequency over time due to 24 hour daily life. We also observer a gradual increase in the number of tweets as we approach Oct 03, 2012 when all of sudden we observe a drastic increase in the number of tweets. This is aligned with the first US presdiency debate time. In particular the maximum number of tweets on Oct 03 is around 160K which is at least three times larger than the previous day (55K). We also observe a dampping trend in tweet numbers after we pass the debate event.
VI- Geographical Distribution of Political Tweets:
I filtered those tweets which were posted by smartphones (e.g. Android or iOS platform) in order to find out where those tweets are coming from. This can give us insights about distribution of tweets geolocation. You can click on GeoMap to view the geomap of political tweets. My current dataset has only information for tweets posted after Sep 30, 2012. After finding the (city,state) pair for each posted tweet, I just displayed those cities/states on the map if there were at least 20 tweets posted from that location (between Sep 20 and Oct 5). The GeoMap link may load slowy because of performance issues! However, in the following table I have listed the cities with the highest number of tweets.
|Washington D. C.,DC||575|
|New York City,NY||377|
VII- Temporal Changes in Frequent Words:
One interetsing subject to study is to find the most frequent words in tweets. I have computed the most frequent words (top #30) for three different time periods. The first time period is the tweets posted from Sep 29, 2012 to Oct 01, 2012. This time period was before the first presidency debate (Oct 03, 2012). The ollowing table shows the relative frequency distribution of the top 30 words. We have shown the results as a pair such as "w: f" where w represents the word and f is the relative frequencny of w. We have removed all english stop words, punctuation symbols, and hash tags before computing words frequencies. There are few interesting observations. The first frequent words are candidates names as we expected because we explicitly were searching for such words. However, "Obama" was mentioned almost twice more than "Romney" before the debate. We also see that "Ryan" also appears between the first 30 words. The set of verbs among frequent words are: vote, like, get, see, think, win, want, know, and said. We also see that "debate" ranks #29.
|obama: 9.70%||romney: 5.16%||mitt: 2.25%|
|election: 1.42%||vote: 1.17%||president: 1.16%|
|amp: 0.98%||barack: 0.68%||like: 0.65%|
|get: 0.61%||us: 0.56%||would: 0.51%|
|america: 0.50%||gt: 0.48%||people: 0.48%|
|campaign: 0.47%||voting: 0.45%||ca: 0.43%|
|see: 0.42%||ryan: 0.42%||republican: 0.42%|
|one: 0.41%||think: 0.39%||time: 0.39%|
|win: 0.38%||want: 0.38%||know: 0.37%|
|said: 0.37%||debate: 0.35%||even: 0.35%|
|romney: 8.73%||obama: 8.10%||mitt: 2.55%|
|debate: 2.31%||like: 1.25%||president: 1.02%|
|amp: 0.72%||gt: 0.71%||vote: 0.71%|
|people: 0.65%||tonight: 0.63%||say: 0.60%|
|election: 0.58%||get: 0.57%||won: 0.56%|
|said: 0.54%||think: 0.53%||know: 0.52%|
|would: 0.50%||got: 0.49%||fuck: 0.45%|
|wins: 0.44%||na: 0.43%||one: 0.42%|
|black: 0.42%||debates: 0.41%||voting: 0.41%|
|saying: 0.41%||gon: 0.40%||win: 0.39%|
|obama: 7.75%||romney: 7.44%||mitt: 2.99%|
|debate: 2.38%||election: 1.40%||barack: 1.01%|
|president: 0.98%||like: 0.91%||wins: 0.80%|
|vote: 0.80%||last: 0.80%||amp: 0.75%|
|night: 0.73%||games: 0.68%||hunger: 0.68%|
|volunteer: 0.65%||tribute: 0.65%||big: 0.63%|
|fillwerrell: 0.63%||get: 0.53%||bird: 0.50%|
|people: 0.48%||said: 0.46%||wrong: 0.45%|
|would: 0.44%||says: 0.42%||gt: 0.39%|
|completely: 0.37%||go: 0.36%||got: 0.34%|
 Twitter API
 Natural Language Toolkit
 Movie Review Data
 Visualization: Gmap
*** Please cite this article if you want to use any part of the results presented here.
You should follow Follow @kjahanbakhsh me on Twitter.