Following up on my earlier post, as the frequency-based models were not very accurate and a good rule-based model was very hard to elaborate, we implemented what we known to be state-of-the-art methods for sentiment analysis on short sentences and make a list of the pros and cons of these methods. We train all of them on a 10.000 sentences dataset. These sentences are classified as positive, neutral, and negative by human experts. We the benchmark the models on a hold out sample of 500 sentences.

Word representations in a vector space

Feature extraction

To build a deep-learning model for sentiment analysis, we first have to represent our sentences in a vector space. We studied frequency-based methods in a previous post. They represent a sentence either by a bag-of-words, which is a list of the words that appear in the sentence with their frequencies, or by a term frequency – inverse document frequency (tf-idf) vector where the word frequencies in our sentences are weighted with their frequencies in the entire corpus.

These methods are very useful for long texts. For example, we can describe very precisely a newspaper article or a book by its most frequent words. However, for very short sentences, it’s not accurate at all. First, because 10 words are not enough to aggregate. But also because the structure of the sentence is very important to analyze sentiment and tf-idf models hardly capture negations, amplifications, and concessions. For instance, “Very good food, but bad for service…” would have the same representation as “Bad for food, but very good service!”.

Word vectors

We represent our sentences with vectors that take into account both the words that appear and the semantic structure. A first way to do this is to represent every word with an n-feature vector, and to represent our sentence with a n*length matrix. We can for instance build a vector of the same size as the vocabulary (10.000 for instance), and to represent the i-th word with a 1 in the i-th position and 0 elsewhere.

He trains this model and represents the word “ants” by the output vector of the hidden layer. The features of these word vectors we obtain capture most of the semantic information, because it captures enough information to evaluate the statistical repartition of the word that follows “ants” in a sentence.

What we do is similar. We represent every word by an index vector. And we integrate in our deep learning model a hidden layer of linear neurons that transforms these big vectors into much smaller ones. We take these smaller vectors as an input of a convolutional neural network. We train the model as a whole, so that the word vectors we use are trained to fit the sentiment information of the words, i.e. so that the features we get capture enough information on the words to predict the sentiment of the sentence.

Sentence representations

Doc2vec

We want to build a representation of a sentence that takes into account not only the words that appear, but also the sentence’s semantic structure. The easiest way to do this is to superpose these word vectors and build a matrix that represents the sentence. There is another way to do it, that was also developed by Tomas Mikolov and is usually called Doc2Vec.

He modifies the neural network we used for Word2Vec, and takes as an input both the word vectors that come before, and a vector that depends on the sentence they are in. We will take the features of this word vector as parameters of our model and optimize them using a gradient descent. Doing that, we will have for every sentence a set of features that represent the structure of the sentence. These features capture most of the useful information on how the words follow each other.

Pros and cons for sentiment analysis

These document vectors are very useful for us, because the sentiment of a sentence can be deduced very precisely from these semantic features . As a matter of fact, users writing reviews with positive or negative sentiments will have completely different ways of composing the words. Feeding a logistic regression with these vectors and training the regression to predict sentiment is known to be one of the best methods for sentiment analysis, both for fine-grained (Very negative / Negative / Neutral / Positive / Very positive) and for more general Negative / Positive classification.

We implemented and benchmarked such a method but we chose not to productionalize it. As a matter of fact, building the document vector of a sentence is not an easy operation. For every sentence, we have to run a gradient descent in order to find the right coefficients for this vector. Compared to our other methods for sentiment analysis, where the preprocessing is a very short algorithm (a matter of milliseconds) and the evaluation is almost instantaneous, Doc2Vec classification requires a significant hardware investment and/or takes much longer to process. Before taking that leap, we decided to explore representing our sentences by a matrix of word vectors and to classify sentiments using a deep learning model.

Convolutional neural networks

Convolutional neural networks

The next method we explored for sentiment classification uses a multi-layer neural network with a convolutional layer, multiple dense layers of neurons with a sigmoid activation function, and additional layers designed to prevent overfitting. We explained how convolutional layers work in a previous article. It is a technique that was designed for computer vision, and that improves the accuracy of most image classification and object detection models.

The idea is to apply convolutions to the image with a set of filters, and to take the new images it produces as inputs of the next layer. Depending on the filter we apply, the output image will either capture the edges, or smooth it, or sharpen the key patterns. Training the filter’s coefficients will help our model build extremely relevant features to feed the next layers. These features work like local patches that learn compositionality. During the training, it will automatically learn the best patches depending on the classification problem we want to solve. The features it learns will be location-invariant. It will convolve exactly the same way an object that is at the bottom of the frame and an object that is at the top of the frame. This is key not only for object detection, but for sentiment analysis as well.

Convolution used for edge detection

Applications in Natural Language Processing

As these models became more and more popular in computer vision, a lot of people tried to apply them in other fields. They had significantly good results inspeech recognition and in natural language processing. In speech recognition, the trick is to build the frequency intensity distribution of the signal for every timestamp and to convolve these images.

For NLP tasks like sentiment analysis, we do something very similar. We build word vectors and convolve the image built by juxtaposing these vectors in order to build relevant features.

Intuitively, the filters will enable us to highlight the intensely positive or intensely negative words. They will enable us to understand the relation between negations and what follows, and things like that. It will capture relevant information about how the words follow each other. It will also learn particular words or n-grams that bear sentiment information. We then feed a fully connected deep neural network with the outputs of these convolutions. It selects the best of these features in order to classify the sentiment of the sentence. The results on our datasets are pretty good.

LSTM

We also studied, implemented and benchmarked the Long Short-Term Memory Recurrent Neural Network model. It has a very interesting architecture to process natural language. It works exactly as we do. It reads the sentence from the first word to the last one. And it tries to figure out the sentiment after each step. For example, for the sentence “The food sucks, the wine was worse.”. It will read “The”, then “food”, then “sucks”, “the” and “wine”. It will keep in mind both a vector that represents what came before (memory) and a partial output. For instance, it will already think that the sentence is negative halfway through. Then it will continue to update as it processes more data.

This RNN structure looks very accurate for sentiment analysis tasks. It performs well for speech recognition and for translation. However, it slows down the evaluation process considerably and doesn’t improve accuracy that much in our application so should be implemented with care.

They implement a model called the RNTN. It represents the words by vectors and takes a class of tensor-multiplication-based mathematical functions to describe compositionality. Stanford has a very large corpus of movie reviews turned into trees by their NLP libraries. Every node is classified from very negative to very positive by a human annotator. They trained the RNTN model on this corpus, and got very good results. Unfortunately, they train it on IMDB movie reviews data. But it doesn’t perform quite as well on our reviews.

The big advantage of this model is that it is very interpretable. We can understand very precisely how it works. We can visualize which words it detects to be positive or negative, and how it understands the compositions. However, we need to build an extremely large training set (around 10.000 sentences with fine-grain annotations on every node) for every specific application. As we continue to gather more and more detailed training data, this is just one of the types of models we are exploring to continue improving the sentiment models we have in production!

We give our tenants insights about their online reputation based on their online reviews and ratings. In doing so, one thing we try to do is pull apart the text of reviews to understand what the reviews are dealing with, and tell our clients what their customers are talking about and how happy those customers are with key aspects of our clients’ business.

So for example, we might identify 100 reviews for our client mentioning price, and leveraging the star rating of those reviews, we might discern that 80% of those reviews are positive and the average rating of those reviews is 4.0 stars. However, this method could be improved: a positive review mentioning price is not necessarily positive about price. For example:

The food was awesome, and the service absolutely excellent. The price was very high for a coffee-shop style restaurant.

This 5 star review is obviously negative about the price of the restaurant. We need a model that tells us the local sentiment of a sentence or a subsentence in order to be able to understand what elements drive the rating of the review. I’ll explain some of the techniques we have studied, implemented and benchmarked in order to build our Sentiment Mining Tool.

Naive Bayes Classifier

Naive Bayes is the first and the easiest method to classify sentiment in a text. It’s based on the Bayes formula for conditional probabilities:

We’ll represent a text by a Bag of Words, which is a set of features “the word w appears f times” for each word w in the sentence and f, the frequency of w in the sentence. Assuming the Naive Bayes assumption that these features are independent, this formula helps us deduce the probability that the sentence is positive (A) knowing that w appears f times (B) for every w. In fact, we can deduce from the frequencies in a large enough dataset the probability for a sentence to be positive (A), and the probabilities of every feature and then of their intersection (B). Training the model on a training set of 10,000 annotated sentences, we get a set of informative features that are helpful to predict whether a sentence is positive or negative. Here are the 10 most informative features we get:

Naive Bayes classifier’s informative features

This method is the easiest to implement and the big advantage is that it’s completely transparent. When we process it, we know that the classifier found a set of strongly positive or of strongly negative words, and that it is why we classified the sentence in such a way.

How to improve it

However, there are several drawbacks using this method.

First, it fails to identify the neutral class. As a matter of fact, words can have a positive or a negative meaning (“good”, “awesome”, ”horrible”, …) but no word has a neutral connotation. Often, it’s all about the absence of such positively or negatively meaningful words or about the structure of the sentence that reflects the absence of strong emotion. The Bag of Words representation doesn’t address this problem.

It also fails to understand intensity and negations. Comparing “good” and “quite good” for instance, the first one is more likely to appear in a positive sentence than the second one. We tried some methods to address this: adding a list of meaningful bigrams (which mean that we would read “quite good” as a single word for instance), or training the model on bigrams instead of training it on single words, but both didn’t improve our model very much. We also fail to identify negations most of the time, because this model doesn’t take the word order into account.

Most of all, the Naive Bayes model doesn’t perform very well in solving the local sentiment analysis problem. In a long text, having a high frequency of positive words: “sensational”, “tasty”, … makes it very likely that the author is expressing positive sentiment. But as our goal is to determine the local sentiment, we want to process the tool on short sentences and subsentences. (We already have a star rating that tells us the author’s overall sentiment.) We don’t have enough words in the sentence to aggregate so we need to understand very precisely the semantic structure.

The Bag of Words representation is a very bad way to do this. For instance, the sentence “The food could have been more tasty.”, we detect the word “tasty” that is related to a positive feeling, but we don’t understand that “could have been more” is a kind of negation or nuance. Many short sentences are like that, and looking at only a small sentence dataset reduced our accuracy from around 77% to less than 65%.

Rule-based sentiment models

To improve the Naive Bayes methods and make it fit the short sentences sentiment analysis challenge, we added some rules to take into account negations, intensity markers (“more”, “extremely”, “absolutely”, “the most”, …), nuance, and other semantic structures that appear very often near sentimental phrases and change their meanings. For instance, in “The food wasn’t very tasty”, we want to understand that “not very tasty” is less negative than “not tasty” or “not tasty at all”.

We leveraged the results of the Naive Bayes training to build a large vocabulary of positive and negative words. When we process a given sentence, we attribute every word a positive and a negative score, and calculate the overall scores by a precise analysis of the semantical structure based on the open-source library spacy’s pipelines for part-of-speech tagging and dependency parsing. We get a metric for positive, negative and neutral scores, the neutral score being defined as the proportion of words that are neither positive nor negative in the sentence. We used a deep-learning technique to deduce from our training set the relation between these scores and the sentiment. Here are the graphs we obtained for negative, neutral and positive sentences:

The model helps us decide very well whether an expressive sentence is positive or negative (we get around 75% accuracy), but struggles understanding a criteria for neutrality or absence of sentiment (on our test-set, it’s wrong 80% of the time). It’s much better than the Naive Bayes, but 75% is less than the state-of-art for positive/negative decision.

A couple of months ago, we looked at the relationship between review site rankings on a business’s local SERP/SEO and the number of reviews of the business on those sites. We found a significant positive relationship between the number of reviews and how highly those sites ranked in local search results.

Most of the analysis that time around centered on automobile dealers around the country and where their Facebook and DealerRater presences ranked in Google search results targeting that dealer. This time around we expanded that analysis to multiple review sites and multiple industries and found that the relationship between review volume and local search ranking varies wildly by domain and industry. We also dug a little deeper into the data to try to estimate the value of adding reviews on these sites over time, and we found that new reviews are valuable in two ways. First, new reviews help review sites rise on search engine results pages, and second, the more reviews that site acquires, the better its chances are of staying at the top of the results page as well.

Reviews and SEO across sites and industries

First, let’s look at the relationship between review volume and domain ranking in local search for that same set of automobile dealers. (Note: None of these analyses include Google, since the Google review presence is usually anchored on the right-hand side of the page.)

The Facebook and DealerRater lines here match up pretty well to the data we presented before. We also see a correlation between review volume and domain rank for the other domains, but it is notable that the apparent impact of additional reviews varies a bit by source. For instance, for four of these domains the average rank of the review site for a location with no reviews is between 8.5 and 10. Having 100 reviews on DealerRater brings the expected rank of that domain down to the top half of the first page, whereas for cars.com and Facebook, we would still expect 100 reviews to leave that site below the fold when someone is searching for that location. Edmunds.com is even worse. It seems no matter how many reviews you get, Google is determined to pin your Edmunds presence to the top of the 2nd page.

This data would lead us to hypothesize that, on average, an additional review on DealerRater is worth considerably more for a car dealer than an additional review on one of these other sites. But before we explore that hypothesis a little more, lets look at similar data for a few other industries. Next let’s look at hospitals:

These are the review site domains that most commonly showed up when we googled over 1000 US hospitals. Again we see the expected directional relationship, more reviews means a generally better SERP/SEO ranking. However, none of these curves are as steep as the steepest curves for auto dealers. It’s also very interesting to note how much Google seems to value a healthgrades.com page regardless of whether there are any reviews on it.

And here is the data for the Self-Storage Unit industry. There isn’t as much breadth in this industry, as there aren’t as many review sites with high volume, but it is very interesting to note that in the storage industry, Facebook has a very strong correlation between review volume and SERP/SEO rank.

All of this is very interesting, but it raises several questions. Most notably, what makes the SERP/SEO ranking of particular review sites seem to be so responsive to review volume in particular industries? And is there actually causation here or does something else explain why some of these correlations are so strong?

Review volume impact on local SEO over time

Let’s address the causation question by looking at some more dynamic data, specifically by looking at how these rankings and volumes change over time. This is still a long way from a controlled experiment, but it would be more compelling if we could show that as review volumes rise for a particular location on a particular site, then the SERP/SEO ranking of that location tends to fall.

Over the last couple of months we gathered SERP/SEO data once a week for several thousand US auto dealers. We then looked at the rankings for major review sites over time and how those changes correlated with total review volume and with changes in review volume. To model this, we fit a Markov Chain that predicted the probability of any weekly SERP/SEO ranking for a review site based upon the domain, that site’s ranking the previous week, the total number of reviews for that location on that site, and whether the number of reviews went up or not.

The first thing we wanted to measure was this – Does getting new reviews positively impact your search engine rank? According to our data, the answer would appear to be yes. In the graph below we plot the predicted impact of getting a new review on one review site according to our model.

According to our data, after we normalize for domain, rank, and total number of reviews prior, review sites that got at least one new review in a given week tended to be placed higher the following week than sites that did not get new reviews. Obviously this impact is much higher when you have no reviews or very few reviews (an average improvement of 1/3 of a spot for sites getting their first review!), and it levels off pretty quickly once you have around a dozen reviews.

Our model spit out one other interesting insight. It found that review volume is important not just for getting a site ranked highly on SERP, but for keeping it there as well. Review site rankings drift from week to week, and our Markov Chain model captures that drift. But what the model also found is that for review sites with a high volume of reviews, regardless of where they ranked the week before, they tended to drift more towards the top of the page (or were more likely to stay there) than review sites with very few reviews.

This graph plots how much an auto dealer’s review volume will impact the drift of that ranking on average. In other words, if you have no reviews, your review site page will lose one spot every three weeks, on average, relative to the norm. If you have 50+ reviews, it will gain 1 spot every 5 weeks on average, relative to the norm. You might ask, “how can I gain a spot if I am already at the top?” Well, links that are in the top spot tend to lose that spot about 20% of the time. If that site has 50+ reviews, it will be much less likely to do so.

Conclusions

Hopefully this analysis shines a light on the value of generating a healthy review volume on review sites that you want your customers to be able to find on Search Engines. And makes it clear that those reviews are valuable not just because they will help those review sites climb to the top of search engine results, but because they will help those sites stay there as well. Also beware that these effects can vary considerably from domain to domain, and the most responsive domains may also vary from industry to industry.

One of the goals of the Analytics team has been to provide newer, more in-depth ways to analyze the millions of comments that Reputation aggregates from various sources for each customer. One way to do this is through natural language processing (NLP) techniques like part-of-speech(POS) tagging, named entity recognition(NER), and stemming/lemmatization. Combining these NLP techniques with our existing segmentation tools allows us to begin comparing statistics across sets defined by the language content of those comments. For example, we could look at the set of Walgreens comments that mention Rite-Aid and see that these had higher than average ratings in comparison to the total set of Walgreens comments.

These evaluations, however, initially required us to load the set of comments that we wished to analyze into Python, then run each comment through a natural language parser one at a time locally each time we wanted to run an analysis. The overhead required to parse each of these reviews began to impede our ability to rapidly test different types of analyses, so we began to look into alternative methods for achieving this goal. What we were ultimately looking for was a pre-processed database that would allow us to look up a comment by id and receive a set of POS tags, named entities, and lemmas without having to re-parse each comment each time. This natural language pre-processing would need to be done retroactively to the tens of millions of comments already stored in our database, as well as incrementally on any new comments that have been pulled in every few days.

Since much of our analysis framework was already implemented in Python, we began adding this new NLP piece in Python as well. Of the various NLP libraries available to Python at the time of this writing, the one that seemed to work best on the 2-3 sentence reviews in our database was the CoreNLP library from Stanford. Essentially CoreNLP comes with a series of models that have been trained on a large corpus of sample words for different languages (presently English, Arabic, Chinese, French and German). These models are then used to evaluate the likely part-of-speech of new inputs based on patterns learned from the original training data. The library also uses similar processes to determine which words in a given input are references to some named entity (for example an organization, individual name, or location name) and to identify the stem form of each word for easier pattern analysis.

The downside of using CoreNLP, however, is that in order to run, it starts up a new, separate Java process which is then passed one comment at a time for parsing. Starting up this Java process creates 5-10 minutes of overhead for processing a set of comments of any size, and even once this separate process is running it can take a few minutes to fully parse an average length comment (3-5 sentences). Thus to run all the millions of historical comments through CoreNLP in a serial fashion would be computationally infeasible. Instead, we decided to use Apache Spark to bring up a distributed cluster to run these comments through CoreNLP in parallel.

Spark provides a set of libraries in either Python, Scala, R, or Java that handle the hassle of creating a distributed cluster of nodes and efficiently distributing data between them. While it can be used for a wide variety of purposes, we used it to take the set of comments that we needed to evaluate and figure out how to split those comments amongst clusters of varying sizes in order to reduce the time necessary to run all of our historical data through CoreNLP. Using Spark also provided the added bonus of easily integrating with AWS’ Elastic Map-Reduce (EMR) service, which has an easy-to-use command line interface for bringing up clusters of EC2 nodes. Amazon has preconfigured settings to automatically pass the relevant information about each EMR cluster through to Spark so that we can easily bring up any number of nodes with the same code. This makes it easy to setup a cron task to automatically parse the last few days worth of reviews on a regular basis.

Additionally, while we originally set out to create a Python application to interact with Spark and CoreNLP, we eventually discovered that we needed the ability to more carefully control which information CoreNLP passed to each Spark process. Since Spark is capable of running multiple threads on each node in order to better parallelize and since each thread runs a separate version of our Spark application, we noticed that each Python application in each thread was instantiating its own CoreNLP Java process. This meant that if we had 4 threads running on the same node, we would also have 4 CoreNLP Java processes running on that node, which would slow that node’s performance to a crawl. To get around this, we had to translate our application into Scala instead. Scala allows for the existence of transient variables, which allowed us to write our code in such a way that when multiple threads are running on the same node, they all use the same CoreNLP Java process, but whenever a new node is brought up it brings up a new process. (Thanks to Databrick’s Spark/CoreNLP wrapper for this idea!)

Below is some of the code from our Scala-based Spark application. It is designed to do the following:

Pull in some number of reviews from our Vertica database.

Distribute those reviews to a cluster of independent nodes.

Run each review through the CoreNLP process for that node.

Format CoreNLP’s output so that it can uploaded back into Vertica

Upload the natural language data (POS tags, NER tags, and lemmas) back into the database

Once our Spark application was working on local developer machines, we began testing running it through EMR’s distributed clusters instead. Initially we ran into some headaches getting Spark to fully utilize the resources made available to it through EMR. There is a line in the code above that talks about pulling in the number of nodes available through the Spark Config (val num_exec = sc.getConf.get(“spark.executor.instances”).toInt). This line tells spark how many nodes it has available so that it can partition the data accordingly. Below are two screenshots of the CPU usage per node in AWS from before this change and after it:

Before proper partitioning – Notice that in this case, the node in blue is the only one that appears to be actually doing any parsing. This is because Spark defaults to assuming a single data partition, so it runs all the comments through the master node.

After proper partitioning – By explicitly telling Spark how many nodes to use, we can see that it now runs some comments through all 8 nodes. (Thanks to Cloudera for explaining this and more about how to properly tune Spark jobs!)

Additionally we ran into some trouble getting EMR to communicate with Vertica through the database’s security restrictions, which involved playing with our VPN settings. Once these hurdles were dealt with though, we were able to begin testing the scaling power of this CoreNLP/Spark/EMR solution. The following graph shows the number of minutes it took Spark to run as dependent on the number of thousands of comments run per each instance in the EMR cluster. As you can see, the time to run increases linearly as a function of how many comments each node is required to run.

Minute to Run vs. # of thousands of comments per node in cluster – This graph shows the time it takes Spark to run our process as a function of number of comments per each distributed node in the cluster. It shows a linear relationship more or less up until the point where there are more than a million comments per node.

The outlier point at 1000 on the x-axis (= 1 million comments per node) is from when we ran all of our historical comments. Further research is required to figure out why performance seems to have degraded for that point.

Interestingly, we also found that it seems when the number of comments per node increases above a about a million or so, the EMR task would fail without outputting any errors in the logs (this is what happened with the rightmost datapoint on the above graph). This may be due to insufficient resources to run the number of comments assigned to that node(we used Amazon’s m3.xlarge instances for each node on each run), but we haven’t done enough analysis to confirm this. The short-term solution to this problem was simply to provide more nodes and get the ratio of comments per node back down to around 1 million or so.