Sentiment analysis with R: quality criteria for backlinks

Several patents have been filed by Google to quantify the opinions and reviews of Internet users from corpus that do not use traditional rating systems. The sentiment analysis then takes on its full meaning: how do we quantify so-called “raw” opinions?

The purpose of this article is not to go into the details of patents but to draw substantial information from them to project the mechanism of weighting sentiments at the level of backlinks: what if the sentiments analysis became an additional rating signal to measure the quality of a backlink?

A large amount of web content is subjective and reflects people’s opinions. With the rapid growth of the web, more and more Internet users are writing reviews for all types of products and putting them online. It is becoming increasingly common for a consumer to learn how others like or dislike a product before buying it or for a manufacturer to monitor customer feedback on its products to improve customer satisfaction. However, as the number of opinions available for each particular product increases, it becomes increasingly difficult for Internet users to understand and evaluate the dominant opinion on a product. This demonstrates the need for an algorithmic classification of feelings to digest this huge repository of “hidden”criticism.

From there, one can easily imagine that a classification of the backlinks by dominant feeling (ndlr. surrounding the text around the link) would make it possible to measure more precisely the real notoriety of the website (besides the aspects that we already know: citation flow, trust flow, etc.). Opinion as such would then become a discriminating signal in measuring the power of backlink.

However, this hypothesis comes up against multiple difficulties set out in the same patent:

Building feeling classifiers for web-scale processing, with all the problems of pre-processing, spelling and grammar errors, broken HTML… that result, brings about a whole host of other problems. Data sets number hundreds of thousands, sometimes millions, and are neither clean nor consistent. Algorithms must be effective and the subtle effects of language become visible.

Sentiment analysis applied to backlinks

To illustrate our example, I chose to select four backlinks surrounded by content rich enough in words to make the sentiment analysis relevant. So I deliberately set aside the others, which would obviously not be possible in a web-scale model.

Our example is based on the links to the Bang and Olufsen BeoSound Shape page and is directly inspired by the Tidy Text Mining site for the application of sentiment analysis.

Preparation of the dataset

For this part, I directly put the referral domain name (Referral) and the content surrounding the backlink (Content) in an Excel file.

Here, the relevance of the analysis is already hampered by expressions composed of two or more terms that deserve not to be isolated. A difficulty raised in the patent mentioned above:

Current work focuses mainly on the use of unigrams and bigrams – rather than higher-order grammar models – to capture feelings in the text. (2002) found that unigrams surprisingly exceed other characteristics in their ratings. Similarly, Dave et al. (2003) experimentally showed that trigrams and higher values did not show constant improvement. We assume that these experiments were hampered by the low volume of training dataset, and therefore were not able to demonstrate the effectiveness of high order n-grams in discernible subtleties in expressing feelings. Unfortunately, lower-order n-grams, such as unigrams and bigrams, are not able to capture dependencies in the longer term, which often leads to problems of accuracy in classification.

How to introduce the classification of sentiments as a measure of the quality of a backlink?

The patents “Creating a summary of feelings for local service opinions” and “Summary of feeling: Evaluation and learning of user preferences” introduce mathematical concepts to give value to positive and negative sentiment, and achieve an overall result (editor’s note: therefore per document / web page).

It is therefore conceivable that a calculation method will be decided in the future to be applied as a measure of the quality of a backlink, and thus reinforce the relevance desired by the algorithm. In this model, a negative sentiment would cause links to reverse or decrease the weight of the PageRank assigned to destination pages so that only positive links would add value. If many negative links pointed to a document, it would require many positive links of equal value to compensate for the loss of PageRank.