We give our tenants insights about their online reputation based on their online reviews and ratings. In doing so, one thing we try to do is pull apart the text of reviews to understand what the reviews are dealing with, and tell our clients what their customers are talking about and how happy those customers are with key aspects of our clients’ business.

So for example, we might identify 100 reviews for our client mentioning price, and leveraging the star rating of those reviews, we might discern that 80% of those reviews are positive and the average rating of those reviews is 4.0 stars. However, this method could be improved: a positive review mentioning price is not necessarily positive about price. For example:

The food was awesome, and the service absolutely excellent. The price was very high for a coffee-shop style restaurant.

This 5 star review is obviously negative about the price of the restaurant. We need a model that tells us the local sentiment of a sentence or a subsentence in order to be able to understand what elements drive the rating of the review. I’ll explain some of the techniques we have studied, implemented and benchmarked in order to build our Sentiment Mining Tool.

Naive Bayes Classifier

Naive Bayes is the first and the easiest method to classify sentiment in a text. It’s based on the Bayes formula for conditional probabilities:

We’ll represent a text by a Bag of Words, which is a set of features “the word w appears f times” for each word w in the sentence and f, the frequency of w in the sentence. Assuming the Naive Bayes assumption that these features are independent, this formula helps us deduce the probability that the sentence is positive (A) knowing that w appears f times (B) for every w. In fact, we can deduce from the frequencies in a large enough dataset the probability for a sentence to be positive (A), and the probabilities of every feature and then of their intersection (B). Training the model on a training set of 10,000 annotated sentences, we get a set of informative features that are helpful to predict whether a sentence is positive or negative. Here are the 10 most informative features we get:

Naive Bayes classifier’s informative features

This method is the easiest to implement and the big advantage is that it’s completely transparent. When we process it, we know that the classifier found a set of strongly positive or of strongly negative words, and that it is why we classified the sentence in such a way.

How to improve it

However, there are several drawbacks using this method.

First, it fails to identify the neutral class. As a matter of fact, words can have a positive or a negative meaning (“good”, “awesome”, ”horrible”, …) but no word has a neutral connotation. Often, it’s all about the absence of such positively or negatively meaningful words or about the structure of the sentence that reflects the absence of strong emotion. The Bag of Words representation doesn’t address this problem.

It also fails to understand intensity and negations. Comparing “good” and “quite good” for instance, the first one is more likely to appear in a positive sentence than the second one. We tried some methods to address this: adding a list of meaningful bigrams (which mean that we would read “quite good” as a single word for instance), or training the model on bigrams instead of training it on single words, but both didn’t improve our model very much. We also fail to identify negations most of the time, because this model doesn’t take the word order into account.

Most of all, the Naive Bayes model doesn’t perform very well in solving the local sentiment analysis problem. In a long text, having a high frequency of positive words: “sensational”, “tasty”, … makes it very likely that the author is expressing positive sentiment. But as our goal is to determine the local sentiment, we want to process the tool on short sentences and subsentences. (We already have a star rating that tells us the author’s overall sentiment.) We don’t have enough words in the sentence to aggregate so we need to understand very precisely the semantic structure.

The Bag of Words representation is a very bad way to do this. For instance, the sentence “The food could have been more tasty.”, we detect the word “tasty” that is related to a positive feeling, but we don’t understand that “could have been more” is a kind of negation or nuance. Many short sentences are like that, and looking at only a small sentence dataset reduced our accuracy from around 77% to less than 65%.

Rule-based sentiment models

To improve the Naive Bayes methods and make it fit the short sentences sentiment analysis challenge, we added some rules to take into account negations, intensity markers (“more”, “extremely”, “absolutely”, “the most”, …), nuance, and other semantic structures that appear very often near sentimental phrases and change their meanings. For instance, in “The food wasn’t very tasty”, we want to understand that “not very tasty” is less negative than “not tasty” or “not tasty at all”.

We leveraged the results of the Naive Bayes training to build a large vocabulary of positive and negative words. When we process a given sentence, we attribute every word a positive and a negative score, and calculate the overall scores by a precise analysis of the semantical structure based on the open-source library spacy’s pipelines for part-of-speech tagging and dependency parsing. We get a metric for positive, negative and neutral scores, the neutral score being defined as the proportion of words that are neither positive nor negative in the sentence. We used a deep-learning technique to deduce from our training set the relation between these scores and the sentiment. Here are the graphs we obtained for negative, neutral and positive sentences:

The model helps us decide very well whether an expressive sentence is positive or negative (we get around 75% accuracy), but struggles understanding a criteria for neutrality or absence of sentiment (on our test-set, it’s wrong 80% of the time). It’s much better than the Naive Bayes, but 75% is less than the state-of-art for positive/negative decision.

Introduction to word2phrase

When we communicate, we often know that individual words in the correct placements can change the meaning of what we’re trying to say. Add “very” in front of an adjective and you place more emphasis on the adjective. Add “york” in after the word “new” and you get a location. Throw in “times” after that and now it’s a newspaper.

It follows that when working with data, these meanings should be known. The three separate words “new”, “york”, and “times” are very different than “New York Times” as one phrase. This is where the word2phrase algorithm comes into play.

At its core, word2phrase takes in a sentence of individual words and potentially turns bigrams (two consecutive words) into a phrase by joining the two words together with a symbol (underscore in our case). Whether or not a bigram is turned into a phrase is determined by the training set and parameters set by the user. Note that every two consecutive words are considered, so in a sentence with w1 w2 w3, bigrams would be w1w2, w2w3.

In our word2phrase implementation in Spark (and done similarly in Gensim), there are two distinct steps; a training (estimator) step and application (transform) step.

*For clarity, note that “new york” is a bigram, while “new_york” is a phrase.

Estimator Step

The training step is where we pass in a training set to the word2phrase estimator. The estimator takes this dataset and produces a model using the algorithm. The model is called the transformer, which we pass in datasets that we want to transform, i.e. sentences that with bigrams that we may want to transform to phrases.

In the training set, the dataset is an array of sentences. The algorithm will take these sentences and apply the following formula to give a score to each bigram:

score(wi, wj) = (count(wiwj) – delta) / (count(wi) * count(wj))

where wi and wj are word i and word j, and delta is discounting coefficient that can be set to prevent phrases consisting of infrequent words to be formed. So wiwj is when word j follows word i.

After the score for each bigram is calculated, those above a set threshold (this value can be changed by the user) will be transformed into phrases. The model produces by the estimator step is thus an array of bigrams; the ones that should be turned to phrases.

Transformer Step

The transform step is incredibly simple; pass in any array of sentences to your model and it will search for matching bigrams. All matching bigrams in the array you passed in will then be turned to phrases.

You can repeat these steps to produce trigrams (i.e. three words into a phrase). For example, with “I read the New York Times” may produce “I read the new_york Times” after the first run, but run it again to get “I read the new_york_times”, because in the second run “new_york” is also an individual word now.

Example

First we create our training dataset; it’s a dataframe where the occurrences “new york” and “test drive” appears frequently. (The sentences make no sense as they are randomly generated words. See below for link to full dataframe.)

You can copy/paste this into your spark shell to test it, so long as you have the word2phrase algorithm included (available as a maven package with coordinates com.reputation.spark:word2phrase:1.0.1).

val wordDataFrame = sqlContext.createDataFrame(Seq(
(0, “new york test drive cool york how always learn media new york .”),
(1, “online york new york learn to media cool time .”),
(2, “media play how cool times play .”),
(3, “code to to code york to loaded times media .”),
(4, “play awesome to york .”),
.
.
.
(1099, “work please ideone how awesome times .”),
(1100, “play how play awesome to new york york awesome use new york work please loaded always like .”),
(1101, “learn like I media online new york .”),
(1102, “media follow learn code code there to york times .”),
(1103, “cool use play work please york cool new york how follow .”),
(1104, “awesome how loaded media use us cool new york online code judge ideone like .”),
(1105, “judge media times time ideone new york new york time us fun .”),
(1106, “new york to time there media time fun there new like media time time .”),
(1107, “awesome to new times learn cool code play how to work please to learn to .”),
(1108, “there work please online new york how to play play judge how always work please .”),
(1109, “fun ideone to play loaded like how .”),
(1110, “fun york test drive awesome play times ideone new us media like follow .”)
)).toDF(“label”, “inputWords”)

We set the input and output column names and create the model (the estimator step, represented by the fit(wordDataFrame) function).

Here are some of the scores (Table 1) calculated by the algorithm before removing those below the threshold (note all the scores above the threshold are shown here). The default values have delta -> 100, threshold -> 0.00001, and minWords -> 0.

Table 1

bigram

score

test drive

0.002214815139686856

work please

0.002047826661381…

new york

5.946183949006843E-4

york new

-1.64600247723372…

york york

-6.43001404062082…

york how

-6.64999302561707…

how new

-6.80666229773923…

new new

-7.42968903739342…

to new

-7.52757602015383E-5

york to

-9.25567252744992…

only showing top 10 rows

So our model produces three bigrams that will be searched for in the transform step:

test drive
work please
new york

We then use this model to transform our original dataframe sentences and view the results. Unfortunately you can’t see the entire row in the spark-shell, but in the out column it’s clear that all instances of “new york” and “test drive” have been transformed into “new_york” and “test_drive”.