Sentiment analysis on Twitter using word2vec and keras

1 - Introduction

In this post I am exploring a new way of doing sentiment analysis. I'm going to use word2vec.

word2vec is a group of Deep Learning models developed by Google with the aim of capturing the context of words while at the same time proposing a very efficient way of preprocessing raw text data. This model takes as input a large corpus of documents like tweets or news articles and generates a vector space of typically several hundred dimensions. Each word in the corpus is being assigned a unique vector in the vector space.

The powerful concept behind word2vec is that word vectors that are close to each other in the vector space represent words that are not only of the same meaning but of the same context as well.

What I find interesting about the vector representation of words is that it automatically embeds several features that we would normally have to handcraft ourselves. Since word2vec relies on Deep Neural Nets to detect patterns, we can rely on it to detect multiple features on different levels of abstractions.

Let's look at these two charts I found in this blog . They visualize some word vectors projected on 2D space after a dimensionality reduction.

A couple of things to notice:

On the right chart, the words of similar meaning, concept and context are grouped together. For example, niece, aunt and sister are close to each other since they describe females and family relationships. Similarly, countess, duchess and empress are grouped together because they represent female royalty.
The second thing to see from this chart is that the geometric distance between words translates a semantic relationship. For example, the vector woman - man is somewhat colinear to the vector queen - king something we would translate to "woman is to man as queen is to king". This means that word2vec is able to infer different relationships between words. Something that we human do naturally.

The chart on the left is quite similar to the one on the right except that it translates the syntaxic relationships between words. slow - slowest = short - shortest is such an example.

On a more general level, word2vec embeds non trivial semantic and syntaxic relationships between words. This results in preserving a rich context.

In this post we'll be applying the power of word2vec to build a sentiment classifier. We'll use a large dataset of 1.5 million tweets where each tweet is labeled 1 when it's positive and 0 when it's negative. The word2vec model will learn a represenation for every word in this corpus, a represenation that we'll use to transform tweets, i.e sentences, into vectors as well. Then we'll use this new represenation of tweets to train a Neural Network classifier by Keras (since we already have the labels.)

Do you see how useful word2vec is for this text classification problem? It provides enhanced feature engineering for raw text data (not the easiest form of data to process when building classifiers.)

Ok now let's put some word2vec in action on this dataset.

2 - Environment set-up and data preparation

Let's start by setting up the environment.

To have a clean installation that would not mess up my current python packages, I created a conda virtual environment named nlp on an Ubuntu 16.04 LTS machine. The python version is 2.7.

condacreate-nnlppython=2.7anaconda

Now activate the environment.

sourceactivatenlp

Inside this virtual environment, we'll need to install these libraries:

gensim is a natural language processing python library. It makes text mining, cleaning and modeling very easy. Besides, it provides an implementation of the word2vec model.

Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. We'll be using it to train our sentiment classifier. In this tutorial, it will run on top of TensorFlow.

TensorFlow is an open source software library for machine learning. It's been developed by Google to meet their needs for systems capable of building and training neural networks.

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets.

tqdm is cool progress bar utility package I use to monitor dataframes creation (Yes, It integrates with pandas) and loops.
Demo:

@jonah_bailey Sorry about the loss. I have been there and it sucks. Have a great day!

0

I think I pulled a pectoral muscle. And no, I'm not kidding.

1

My room is TRASHED

1

Raining.

0

at work the stupidst job in the world LoL I can't wait until my last day YAY!

The format of the SentimenText is not useful. It needs to be tokenized and cleaned.

We will limit to 1 million tweets.

Here's my tokenizing function that splits each tweet into tokens and removes user mentions, hashtags and urls. These elements are very common in tweets but unfortunately they do not provide enough semantic information for the task. If you manage to sucessfully integrate them in the final classification, please tell me your secret.

The results of the tokenization should now be cleaned to remove lines with 'NC', resulting from a tokenization error (usually due to weird encoding.)

defpostprocess(data,n=1000000):data=data.head(n)data['tokens']=data['SentimentText'].progress_map(tokenize)## progress_map is a variant of the map function plus a progress bar. Handy to monitor DataFrame creations.data=data[data.tokens!='NC']data.reset_index(inplace=True)data.drop('index',inplace=True,axis=1)returndatadata=postprocess(data)

The data is now tokenized and cleaned. We are ready to feed it in the word2vec model.

You can check: this is a 200-dimension vector. Of course, we can only get the vectors of the words of the corpus.

Let's try something else. We spoke earlier about semantic relationships. Well, the Word2Vec gensim implementation provides a cool method named most_similar.
Given a word, this method returns the top n similar ones. This is an interesting feature. Let's try it on some words:

For a given word, we get similar surrounding words of same context. Basically these words have a probability to be closer to that given word in most of the tweets.

It's interesting to see that our model gets facebook, twitter, skype together and bar, restaurant and cafe together as well. This could be useful for building a knowledge graph. Any thoughts about that?

How about visualizing these word vectors? We first have to reduce their dimension to 2 using t-SNE.
Then, using an interactive visualization tool such as Bokeh, we can map them directly on 2D plane and interact with them.

Here's the script, and the bokeh chart below.

# importing bokeh library for interactive datavizimportbokeh.plottingasbpfrombokeh.modelsimportHoverTool,BoxSelectToolfrombokeh.plottingimportfigure,show,output_notebook# defining the chartoutput_notebook()plot_tfidf=bp.figure(plot_width=700,plot_height=600,title="A map of 10000 word vectors",tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",x_axis_type=None,y_axis_type=None,min_border=1)# getting a list of word vectors. limit to 10000. each is of 200 dimensionsword_vectors=[tweet_w2v[w]forwintweet_w2v.wv.vocab.keys()[:5000]]# dimensionality reduction. converting the vectors to 2d vectorsfromsklearn.manifoldimportTSNEtsne_model=TSNE(n_components=2,verbose=1,random_state=0)tsne_w2v=tsne_model.fit_transform(word_vectors)# putting everything in a dataframetsne_df=pd.DataFrame(tsne_w2v,columns=['x','y'])tsne_df['words']=tweet_w2v.wv.vocab.keys()[:5000]# plotting. the corresponding word appears when you hover on the data point.plot_tfidf.scatter(x='x',y='y',source=tsne_df)hover=plot_tfidf.select(dict(type=HoverTool))hover.tooltips={"word":"@words"}show(plot_tfidf)

Zoom in, zoom out, place the cursor wherever you want and navigate in the graph. When clicking on a point, you can see the corresponding word. Convince yourself that grouped data points correspond to words of similar context.

4 - Building a sentiment classifier

Let's now get to the sentiment classification part. As for now, we have a word2vec model that converts each word from the corpus into a high dimensional vector. This seems to work fine according to the similarity tests and the bokeh chart.

In order to classify tweets, we have to turn them into vectors as well. How could we do this? Well, this task is almost done. Since we know the vector representation of each word composing a tweet, we have to "combine" these vectors together and get a new one that represents the tweet as a whole.

A first approach consists in averaging the word vectors together. But a slightly better solution I found was to compute a weighted average where each weight gives the importance of the word with respect to the corpus. Such a weight could the tf-idf score. To learn more about tf-idf, you can look at my previous article.

Now let's define a function that, given a list of tweet tokens, creates an averaged tweet vector.

defbuildWordVector(tokens,size):vec=np.zeros(size).reshape((1,size))count=0.forwordintokens:try:vec+=tweet_w2v[word].reshape((1,size))*tfidf[word]count+=1.exceptKeyError:# handling the case where the token is not# in the corpus. useful for testing.continueifcount!=0:vec/=countreturnvec

Now we convert x_train and and x_test into list of vectors using this function.
We also scale each column to have zero mean and unit standard deviation.

Almost 80% accuracy. This is not bad. We could eventually tune more parameters in the word2vec model and the neural network classifer to reach a higher precision score. Please tell me if you managed to do so.

5 - Conclusion

In this post we explored different tools to perform sentiment analysis: We built a tweet sentiment classifier using word2vec and Keras.

The combination of these two tools resulted in a 79% classification model accuracy.

This Keras model can be saved and used on other tweet data, like streaming data extracted through the tweepy API. It could be interesting to wrap this model around a web app with some D3.js visualization dashboard too.

Regarding the improvement of this classifier, we can investigate the doc2vec model that extracts vectors out of sentences and paragraphs. I have first tried this model but I got a lower accuracy score of 69%. So please tell me if you can get better.

I hope this tutorial was a good introductory start to word embedding. Since I'm still learning my way through this awesome topic I'm open to suggestion or any recommendation.