How we Built TVFMO #4 – Calculating Sentiment

Last time we talked about how we calculated the top countries measure in TVFMO. This time, we’re are going to have a look at how we calculate sentiment, more specifically how we rate a post as being positive or negative.

There are two ways of doing this, it can be done deterministically or heuristically. For technical reasons I have used the former method and that is what we are going to examine in this post. However, we may look at the heuristic method later, just for the sake of completeness.

The algorithm I use for deterministically scoring a Tweet as negative or positive is straightforward:

Take a sample of the tweets from the domain under analysis

Extract the adjectives

Weigh the adjectives as being positive (+1) or negative (-1)

For each tweet sent

Tokenize the words

For each word

Does it appear in the weighted dictionary?

Yes – retrieve that word’s weighting and add to accumulator

No – Continue.

If the accumulator for the Tweet is > 0 then increment positive total

If the accumulator for the Tweet is < 0 then increment the negative total

Cache positive and negative totals.

As you can see, in this particular analysis, we are not interested in neutral tweets, so they are ignored. Further notice that this algorithm can be extended such that instead of just weighting the words as positive or negative, we can weigh the words in terms of how positive or how negative they are, like so:

This would allow us, not only to state that the sentiment of the tweet was negative or positive, but also to state how negative or positive it was.

So the first thing we have to do is to gather a sample of tweets, say all from the last hour, extract the adjectives and write them to file to be weighted:

The function getTweetsForLastHour() is a helper function…

Which, in turn, calls two more helper functions, lastHourInTwitterDateFormat() and getTweetText(), which are self explanatory…

The rest of the harvester code just tokenizes the words, then tags their part of speech, before selecting those which are tagged as adjectives and don’t start with // (this is how we exclude all the shortened urls that are also tagged as adjectives by the default POS tagger in the nltk library).

The created file can then be opened in Excel and a weighting added to the adjectives.

After the adjectives are weighted, we can score each day’s tweets.

Here, you see, we create a datetime object from the day and month passed in, get the day range in Twitter date format, before retrieving the tweets for that day. Having done that, we then score all the Tweets and cache the positive and negative totals, by day, in Redis.

The code to get the day range, in Twitter format, is as follows.

Scoring the Tweets is done using the following function.

The word scores are retrieved from the cache, like so.

And, if required, the cache is build using this code.

In this post I’ve demonstrated how we deterministically calculate the sentiment for each Tweet in a given domain. If you have any questions please add them to the comments section or email me at gary.short@gibraltarsoftware.com.