Data Science Applications - Twitter Sentiment Analysis of 2012 VP Debate
While we here at Kwelia spend our days (and several nights) working to bring cutting-edge techniques in data science to residential real estate, every now and again, other interesting applications of our techniques arise. It’s fall of a presidential election year, so much of America is preoccupied with the impending presidential election. Few would disagree that the most entertaining components of the candidates’ campaigns are the debates. It’s always a good time to watch them verbally joust against each other to solidify positions on issues and manifest their campaign rhetoric.
Who is the Winner?
But although the debates can be fun to watch generally, whether to poke fun at a candidate’s hair or to yell and call another a liar, they tend to get frustrating because there is often so much dissonance as to who addressed a topic better or even who won overall. While the networks determine debate victory by polling citizens, there is typically crazy variance among the networks. This variance phenomenon was amplified during last Thursday’s Vice Presidential Debate. While no one disagrees that it was a close battle, who is to say (objectively) that one candidate completely pummeled the other candidate?
Well, this was what different networks told us according to their polling. According to the MediaMatter.org blog, “Snap polls released after the debate last night were mixed; a CBS poll of undecided voters found Biden winning 50-31, while CNN declared watchers “split” after their snap poll reported Ryan narrowly winning 48-44.” If this wasn’t biased (or utterly confusing) enough, the different sides are pointing to different (unscientific) polls as indicators of their side’s victory. For example, conservative media outlets are pointing to a CNBC.com poll that names Paul Ryan the winner – by a nose. But when you unsheathe the methodology behind the poll, it is nothing more than a popularity contest akin to that of a high school student government election. “Indeed, you can apparently vote multiple times across different browsers, and the results have fluctuated wildly over the past 15 hours. Last night, several conservative message boards and sites, including Free Republic and Tea Party Nation, posted links to the poll and encouraged their readers to vote in it. At various points, the poll has indicated that Ryan won the debate by twenty points and that Biden won the debate by 8. As of this writing, Ryan leads by 2 points with more than 190,000 votes cast.”
Let’s Find Another Way
So in order to decipher another way to objectively determine how the Vice Presidential fared through certain topics and even overall, our Chief Data Scientist decided to look beyond polling and analyze something more technologically forward…and even sexier. The answer was Twitter. His thought was that if you could measure the sentiment of all of the tweets transmitted during the debate, you could derive a fairly objective sense of what the sentiment is during certain topics. Further, perhaps it may be possible to aggregate positive sentiments and crown a victor as well.
Sentiment Analysis in Brief
For those that aren’t up to speed on sentiment analysis, Wikipedia describes it as analysis that “…aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).” Many of the techniques in sentiment analysis have been pioneered by renowned NLP professor Bing Liu. In fact, Professor Liu has authored together software packages that automatically parse words that determine positive or negative sentiment for tweets. These more or less set the standard for which words indicate sentiment. While analyzing tweets may come off as a simple exercise, it is instead rather cumbersome. Like our normal data routines, data must be collected, cleaned, and then ultimately presented in a format that facilitates further analysis. Please check out Chris’s blog for some insights into his process behind this.
The total data sample size for this experiment was 363,163 tweets, which was collected roughly every 60 seconds throughout the course of the debate. As we must do during out typical data collection work, we had to remove several tweets in order to clean the data. Duplicate tweets were removed, which left the final dataset tweetcount at 81,124 unique tweets whereby Biden had 52,303 tweets and Ryan got 28,821 tweets. Each point represents the series of tweets that were gathered each minute and intuitively, the farther above zero a point is, the higher the positive sentiment of the tweets (and vice versa).
Key Movements to Note
A quick analysis of the sentiment graph will demonstrate that there were some interesting peaks and troughs throughout the debate. We’ve gone through some of the most drastic ones to correlate it with what was going on in the debate when sentiments either rose or fell to such levels:
21:08 – This was during the foreign policy portion of the debate. You can note that Ryan’s sentiments were at lows during this early portion of the debate.
21:31 – This was during the piece when Biden accused Ryan of requesting stimulus funds. Ryan’s sentiments soared.
21:49 – This was during a Biden diatribe about what the Romney/Ryan camp may deem a small business (hedge funds perhaps?). Ryan’s positive sentiments soared again.
22:26 – This was during the closing statements for each candidate. Ryan’s negative sentiments reached lows.
Post-Debate - One interesting thing to make note of was the fact that although the debate only lasted an hour and a half, Chris was certain to continue the sentiment chart for an additional 30mins beyond the debate’s duration. As you can note above, there were some interesting movements for each candidate’s sentiment. It’s almost as if candidates’ respective sentiments were battling each other out for post-debate positioning.
While this exercise proved more tedious than expected, we were quite pleased with the outcome. If anything, it made watching the debates more entertaining. Stay tuned for more, however. Now that the code has been written, we will run the same analysis for the next two presidential debates – starting with tomorrow’s.