General description: A Real-Time software for measuring the healthiness of your place of living.
Implementation period: January 2013
Technologies: The backend layer is written in Python and uses NLTK package. The front layer uses HTML5, JS, and CSS for visualization purposes.
Event: Firefox OS App Day.
Contributors: Kazem Jahanbakhsh, Aras Balali Moghaddam, Li Yu
On January 26, 2013 I attended the Mozilla Hackathon event in Vancouver. This was a conference where Mozilla people introduced their mobile operating system and the development framework for building mobile apps for Firefox OS. They will release their Firefox cellphones running their own OS very soon.
In the last two months we have seen the Flu outbreak in US/Canada. There were a few people who were killed because of the Flu. So, we thought it would be great if people can have a mobile/web application on their mobile phones telling them the likelihood of catching the flu. We know that the spread of Flu is modelled by a random process where there is a chance p for a person to catch the flu if he comes to the close proximity of an infected person. Thus, the geographical distribution of sick people around you plays an important role in probability that you catch the flu. Now, the interesting question is from where we can collect data about people who currently have the Flu. I found a few research work where people used Twitter data for that. In fact, I noticed that some people tweet when they catch the Flu! A few examples of tweets that people posted on Twitter website:
Tweet#1: "If I'm coming down with the flu, lord jesus have mercy on my soul and give that sickness to someone else" ---> p=0.70
Tweet#2: "Definitely coming down with the flu everyone's had :(" ---> p=0.9
I started collecting tweets where people mentioned flu-related words in their tweets. Now, the first challenge is to use ML and NLP in order to find out who really has the flu and who just tweets a news about the flu by analyzing tweets. I trained a Naive Baye classifier and used it to predict the chance that a tweets is posted by a person who really has the flu. For Tweet#1 and Tweet#2, my classifier computed 70% and 90% chance, respectively. But, the analysis is not that much trivial as people can tweet in various ways. Just a few interesting examples to highlight my point:
Tweet#3: "Seeing the Patriots score makes me sick and want to vomit." ---> p=0.03
Tweet#4: "The great Boston influenza scare of 2012... #really? http://t.co/tjWfQ4SA;@;1;@;0.828821011434" ---> p=0.82
Tweet#5: "I'm praying that he don't have the flu" ---> p=0.97
Tweet#6: "@pegs_hanson @scascum yeh mate got back yest, we had a sick trip bitches!!! Yeh I'm keen for dinner during the week. Wednesday?............." -->p=0.97
As we see, Naive Bayes correctly classifies Tweet#3 as negative (it assigns a low probability that this is from an infected person: 3%). However, the classifier fails in the case of other examples! Although Tweet#4 contains "influenza" word, the tweet is just a news link tweeted by the author and it doesn't imply that he's sick. Tweet#5 also mentions the word "flu", but the author is concerned about another friend who might have had the flu not himself/herself. Finally, Tweet#6 also contains "sick" word but the author has used that word for completely different meaning (positive sentiment here).
Above example shows that Naive Bayes classifier plays moderately well after being trained by manually labelled tweets. However, it fails in more intricate examples because it doesn't take into account the dependency between words inside a tweet (e.g. the context). In particular, if I want to improve its performance, I have to augment ML classifier results with part of speech returned by a natural language parser. Having POS (part of speech), I can easily correct ML mistakes for Tweet#4 because "influenza" is linked with Boston city and doesn't relate to the tweeter author. The same thing will happen for Tweet#5 as "flu" relates to "he" not the author. Finally, in Tweet#6 POS for word "sick" is adjective which qualifies "trip" word in that example. Therefore, assuming that we have a perfect parser which can generate POS for all tweets without any mistakes, we will be able to fix those "False Positive" examples!
After classifying tweets in real-time, we built a web application which consumes the classified data by Naive Bayes and computes a probability for a person to catch the Flu in the near future by taking into account the geographical distribution of sick people around him/her. We built this App during Mozilla event, and we won three Firefox smartphones! You can check the web app from the following link:
Starling Flu Predictor Web App
*** You can download Starling flu predictor source code from its github repository: Starling flu predictor.
Google also has a web app for predicting the flu. This app is heavily based on terms people search (e.g. flu-related words) and use those nunmbers to predict a flu outbreak:
Google Flu Trend App
I've also read a very recent article where researchers have found a flaw in Google Flu Trend app:
Disruptions: Data Without Context Tells a Misleading Story
I think the main problem with google flu trend app is that it doesn't consider the context of search as it has been pointed out in above article. People may just search about the flu because everybody talks about the flu (media coverage) and that doesn't necessary mean that they have the flu! But, I don't think Twitter has the same issue as people may explicitly tweet when they have the flu. However, twitter might not be very good sample distribution for population because it's mostly is used by young people!
Tracking Social Media Trends and Their Influence on E-Commerce Markets:
Another interesting problem would be to collect Twitter data and test if we can detect social cascade behaviors early enough. In words, assuming that innovators and early adopters tweet about new ideas/events/products, one might be able to collect indicator signals before those behaviors, practices, opinions, conventions and technologies become epidemic.
The true positive signals about an epidemic behavior can be clearly used to give recommendations for online sellers inside eBay! Through a colleague who works for Terapeak company I got access to eBay selling data in 2012/2013. Having such data, we tested the correlation between eBay selling rate for flu-related medicines and the computed flu rate from our Twitter data. We wrote an article about tracking Twitter trend to find out what to sell on eBay which was published on Terapeak's blog. You can find the article from the following link:
Tracking Social Media Trends and Their Influence on E-Commerce Markets.
 The photo is taken by Roland Tanglao.
You should follow Follow @kjahanbakhsh me on Twitter.