Background: Consumer transactions and individual's online information exchange streams hold within them an enormous amount of data that is large in volume, velocity of generation, and variety. Motivation: Extracting information and trends from these data is a valuable asset in getting a better understanding of consumer's activities and preferences to guide future decision-making. Consequently, analyzing customer reviews through sentiment analysis classification is increasingly growing in interest. However, the resources and lexicons available to aid the classification learning are still scarce. Aim: The present research presents a domain-specific lexicon, enhancing the analysis intelligence of customer reviews on services. The Lexicon for Sentiment Analysis for Reviews (LSAR) is applied using semi-supervised SVM classification. Results: Results were encouraging, showing that the classifier based on the proposed lexicon, LSAR achieved better accuracy 0.94 compared to 0.72 for the AFINN-based classifier.

Analyzing large amounts of data using data mining, text mining, machine learning, and natural language processing (NLP) is of great value in revealing meaning and patterns from unstructured available text.[1] Data mining is the exploration and analysis of large amounts of data.[2] Sentiment Analysis (SA) is one of the NLP concepts, which is also called “opinion mining,” “subjectivity analysis,” “analysis of stance,” and “evidentiality,"[2],[3],[4] According to Cambria,[1] we are gradually shifting into an era in which people's opinions will dictate the shape of the final products and services. Opinion mining or SA is concerned with quantifying content from subjective accounts. This is done by extracting “sentiment” from a text written in a natural language, providing useful information about the communicator's views and tendencies about a specific item or event by classifying text into positive and negative classes. SA could be applied to different domains such as customer reviews, business intelligence, services ratings, political opinions, social media transactions and interactions to extract tendencies, polarity of opinion, and trends.

Turning large amounts of data into valuable knowledge and learning from that data is a necessity for improving data intelligence. Data mining is “the process of discovering interesting patterns and knowledge from large amounts of data."[2] Data mining is a major step in knowledge discovery. Other steps include data cleaning, data integration, data selection, data transformation, pattern evaluation, and knowledge presentation.[2] The definition of “machine learning” could be considered like that of “data mining,” with a noticeable difference in the kind of inputs and outputs in a machine learning algorithm. That is, in the latter, the mining goals could not be directly described, unlike in the aforementioned,[5] that is analyzing movie reviews to classify the movies into likable or not, based on analyzing a sample of the movie reviews to predict other movies' likability. A machine learning algorithm would be a good candidate for use in this case.

Implementing SA could be using two approaches; the first is the supervised learning approach, known as the “corpus-based” approach. Where both inputs and the wanted outputs of the system are known and the system learns to map inputs to outputs.[6] This is done using machine learning algorithms such as the support vector machine (SVM), Naïve Bayes (NB), decision tree (D-Tree), and K-nearest neighbors algorithms. The second approach is the unsupervised approach, also known as “knowledge-based,” “rule-based,” or “lexicon-based."[6] In this latter approach, a word is compared to a “look up” dictionary of terms, where each word is associated to with a polarity value of +1 or −1, for positive or negative, respectively. Most of the previous studies do not use the neutral class (0), making the classification easier, and more striking, but possibly biased.[3] In the present study, data are classified into three classes: positive, negative, and neutral.

This research presents a new lexicon for review analysis. It was tested with reviews from Feed-Finder application http://feed-finder.co.uk about how to locate places amenable to natural breastfeeding outside of home. Feed Finder was developed in the Open Lab, Newcastle University, UK. It promotes breastfeeding, as opposed to bottle feeding, by providing a location-based social network to interested mothers. This application allows mothers to tag, review, or search for breastfeeding-friendly venues. Utilizing the available 1956 reviews about different venues, the present research aims to design a model to classify the reviews into positive, negative, and neutral using a semi-supervised approach, or what is called the “Bag-of-Words” method.[7],[8],[9] That is a mixed approach between the supervised learning using machine learning classification and an unsupervised approach, using lexicon-based learning. This method proved to be much faster than the pure supervised approach.[7] Exercising this method, two different classifiers are implemented, the first is built on the Lexicon for Sentiment Analysis for Reviews (LSAR) and the other is built upon the AFINN lexicon.[10] The next section presents a review of the related work, Section 3 describes the methodology, and Section 5 shows the results and discussion, which is followed by the conclusion and acknowledgment.

Literature review

The body of knowledge about SA is large, however, works that mostly classify reviews in a chronological manner are the main focus in the present research, [Table 1]. Additionally, I take into consideration papers that represent a marked methodology in classifying unstructured data entries. Reviewing the literature, a general methodology is followed comprising different steps. This includes preprocessing and data cleaning, data normalization and tokenization, data annotation, and data classification and analysis. The former steps' specification details are dependent on three factors: first, the source of data under consideration; second, how noisy is the data; and third is the aim of the classification. Classification could be done using different algorithms. Few of the marked algorithms used are SVM,[11],[12],[13] NB,[14],[15] and Maximum Entropy (MaxEnt).[14] Classification with SVM algorithm was first used to classify reviews by Pang et al.[16] using SVM with supervised learning and the Bag-of-Words lexicon method for text classification which proved its superiority over other algorithms like NB.[3],[16],[17] Pang et al.[16] used a scoring system in combination with the SVM classifier. That is, a review is classified word-by-word to either 1, −1, or 0 that is positive, negative, or neutral, respectively. Then, the sum of the words' scores is then calculated. One of the most-cited articles about SA by Sindhwani and Melville used both a sentiment lexicon and a sentiment pattern database.[17] The classification was applied on online product-review articles and on more general documents including general web pages and news articles. Moreover, the sentiment analyzer was developed using NLP techniques to extract topic-specific features, and the key sentiment from each sentiment-bearing phrase, and to make topic and feature sentiment associations.[17] Another interesting research done by Yi et al.[18] used a framework of adjectives, verbs, and adverbs to intensify the weight of sentiments in a sentence. For example, the sentence “the movie was good” is positive, but “the movie was very good” is extra positive. Asur and Huberman[19] studied the microblogging application, Twitter, in specific and analyzing sentiments aiming to predict a new movie's revenues. They were trying to answer this question specifically, “Using the tweets referring to movies before their release, can we accurately predict the box-office revenue generated by the movie in its opening weekend?” the classifier was built using the LingPipe linguistic analysis package,[20] in particular, using the DynamicLMClassifier. The results of this study successfully scored higher accuracy in predicting revenues than those of the Hollywood Stock Exchange.

Shi and Li in paper[12] did sentiment analysis for 4000 hotel reviews, half of which were positive and the other half is negative, using SVM. Also, they segmented the document using the Chinese lexical analysis system ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System).[21] The main focus was on unigram features having 12745 unigrams in all documents. The evaluation was measured using recall, precision, and F-score for both approaches frequency and a term frequency-inverse document frequency approach. The results are comparable for both approaches 86% and 87% for the former and the latter, respectively, with slight improvement for the latter. Colbaugh and Glass[22] proposed a sentiment orientation (SO) algorithm to classify a number of documents n utilizing a labeled subset of n1. The algorithm was studied using the publicly available dataset from Cornel NLP Group[23] with 2000 reviews that are equally divided into positive and negative. Furthermore, they built a lexical vector of 1400 sentimental domain-independent words from Ramakrishnan et al.[24] comparing the proposed SO algorithm to three other schemes: Lexicon only, NB classifier, and RLS classifier. The results show that the proposed algorithm's accuracy outperforms the other schemes in all dataset sizes reaching 1000 data entries. Zheng and Ye[11] used SVM to classify Chinese hotel reviews extracted from Ctrip (www.ctrip.com). The results then were compared to the results of English SA of reviews to a different dataset carried out by one of the authors. Although the comparison carries questionable value, due to the difference between the compared languages and hence the classification features, the results of the analyzing the proposed classifier are quite high, 94.87% and 91.15% for recall and accuracy, respectively. De Albornoz et al. analyzed 60 hotel reviews from booking.com.[13] Each review contained hotel information, a numerical score from 0 to 10 for five different aspects, and textual comment. The authors noticed that polarity of the review score had no relation to the reviewers' comment polarity, as people tend to be tactful when writing comments. The study was carried out using three Weka classifiers: logistic regression model, SVM, and a functional tree to classify data into three classes: good, fair, and poor. Using movie reviews to predict sales, Yu et al.[25] developed two algorithms to model the prediction, first is the sentiment analyzer, Sentiment Probabilistic Latent Semantic Analysis (PLSA) based on the traditional algorithm.[26] Second, a product sales prediction model, the Autoregressive Sentiment Aware model, based on the autoregressive model presented in.[27] In the study, 45,046 blog entry concerning 30 movies were collected using Apache Lucene. Furthermore, the daily gross revenues for those movies were collected from IMDB.

To measure the prediction accuracy in a fitted time series value, mean absolute percentage error (MAPE)[28] was used. The study shows that using SA for prediction gives a better indicator than volume alone.

Sixto et al.[14] aimed at classifying hotel reviews using the MaxEnt Model and the NB Model. The features had three components: tokenizing the text, Stanford lemmatization, and Part-of-Speech (POS) tagger.[29] Lemmas, their POS tags, and their raw frequency are used as classification features. Words are annotated using Q-WordNet, which is a lexical resource based on WordNet to find words' polarities. Classifying 1000 reviews, the F-score is comparable to the basic classifier without features scoring 0.79 and 0.82 for without and with features, respectively.

Dalal and Zaveri[30] proposed a feature-based sentiment analyzer for product reviews implementing fuzzy linguistic hedges, taking into account the product features, descriptors, and modifying hedges. Hedges, or ‘intensifiers’,[3],[31],[32] ensemble words like ‘very, highly, extremely’. Specifying products' features is a manual labor that needs to be done for each product beforehand, thus the study considered only four products, tablets, e-book readers, smartphones, and laptops of different brands. The reviews were classified into one of five classes: very positive, positive, neutral, negative, and very negative. Compared to two other approaches, valence points adjustment approach[33] and Vo and Ock's fuzzy adjustment approach,[34] the proposed approach outperformed the other two approaches in terms of accuracy.

Rajput et al.[35] used a sentiment dictionary MPQA corpus[36] containing 8221 records where each record consists of six features that describe the word in the record. The students' reviews are analyzed using this corpus. The sentiment of a word is calculated by multiplying its sentiment by the frequency of the word. Where the overall sentiment score of the review is the sum of the sentiments of each word in that review. The analysis is done using the KNIME open source data analytics platform.[37] The proposed approach was able to achieve an accuracy of 91.2%. The paper suggests that the SA of reviews gives more insight than the Likert-based score that has five predefined options the student selects from.

Paredes-Valverde et al.[38] implemented a Spanish sentiment analyzer to improve products and services using deep learning approach, in specific implementing a Convolutional Neural Network (CNN) and word2Vec[39],[40] classification model with Tensorflow. More than 130k Spanish tweets were collected, only 50k positive and 50k negative tweets were used to create the classifier. With 80% of the tweets are used for training and the rest for testing. To evaluate the proposed approach, precision, recall, and F-measure metrics were used. The study proves that the CNN algorithm outperformed SVM and NB with around 5% of gain with no clear justification.

Martins et al.[15] analyzed hotel reviews in Portuguese available in TripAdvisor, a website for tourist attractions reviews. The data size were 69,075 reviews comprising a 10 words review title, a review with an average of 60 words, and a star rating. For the sake of simplicity, the rating was mapped into three classes: positive, neutral, and negative. The corpus was divided into 69% for training and the rest for testing. The normalization process comprised lemmatization, polarity inversion for negation, and creating vibe tables, that is, word frequency in each class. For classification, Naïve Bays with Laplace smoothing was used. The F measure, 87% was comparable to other related work.

Methodology

As explained in the previous sections, classification of reviews could be done using SA with supervised classification or unsupervised learning using a lexicon. In this research, we implement a hybrid classification using “Bag-of-words,” taking advantage of both techniques using SVM classification and a lexicon.[41] The problem could be described as following. Given a set of reviews R and a set of classes PX= {positive, neutral, negative}, the aim is to map each review to its appropriate class R → PX that describes the review polarity. The process of analyzing the Feed-Finder reviews involved different stages including data extraction, data annotation, data normalization, classification, and results analysis as in [Figure 1]. Moreover, in this research, two classifiers are trained using two different lexicons LSAR and AFINN. The LSAR lexicon is created manually observing the top sentimental terms used in location reviews of several location-based applications such as Foursquare https://foursquare.com, Yelp https://www.yelp.com, and Google Maps. Incorporating the Bag-of-Words method,[7] each word of the review is regarded as a feature. The overall sentiment of a review is decided based on the sum of the numerical values of each word, assuming a positive word is 1, negative is −1, and otherwise, it is regarded as 0. If the sum is >0, a review is regarded as positive, if <0, it is negative, and if it is equal to 0, then it is neutral.

The lexicons are incorporated in the learning process as an annotated corpus to train the classifier. The LSAR lexicon comprises words created by the present author of this research; containing 768 labeled positive, negative, and neutral keywords that are likely to describe services reviews into either positive, negative, or neutral. Those words were inspired from a subset of the reviews available and AFINN lexicon contains 2477 words and phrases[10] having ordinal scores as integers between −5 for negative words and 5 for positive words, having ten different classes. In both cases, the overall polarity of a review is calculated as the average weight of the sum of each word's score in the review.

Review analysis started with data extraction and recovering data from the log file of the Feed-Finder application. Then each review was annotated with a label, positive, negative, or neutral. After that, reviews are normalized, all emoticons are replaced with words, for example, “:)” or “:-)” are replaced with the word “Happy.” In addition, all numbers and punctuation marks are deleted. After that, reviews are classified using the SVM classifier written and programmed in R.

The resultant data contains a classification of each review to positive, negative, or neutral. Positive reviews are those with positive comments and connotations describing the venue as opposed to negative reviews. Neutral reviews are reviews with no sentiment or reviews outside of the scope of Feed-Finder data. The latter includes offensive reviews, “null” and “test” reviews. Removing offensive reviews that include swear words and with no substance is necessary for the classification accuracy.[25]

The classification phase mentioned earlier is composed of two parts. The training phase and the testing phase. The training phase trained two classifiers using the mentioned lexicons. The testing phase used the two trained classifiers to classify the list of unlabeled reviews. The results of the classifier are then compared to the labeled reviews. The results are analyzed in the next section.

Results and Discussion

The number of reviews analyzed was 1951, classified using the SVM classifier and trained using two lexicons: LSAR and AFINN. A description of the two lexicons is given in the previous section. The number of positive, negative, and neutral reviews was 1639, 138, and 174 reviews, respectively.

To measure the performance of both classifiers, Lexicon for Sentiment Analysis for Reviews (LSAR) based and AFINN based, the confusion matrix of both techniques is depicted in [Table 2]. While true positive (TP) and true negative represent the correctly classified positive and negative reviews, respectively. While false positive and false negative represent the wrongly classified reviews into positive and negative and they were otherwise. Also, a third class is considered in the classification, neutral, and true neutral (TE) represents the correctly classified neutral reviews. The LSAR-based classifier, 1834 reviews are classified correctly, [Table 3] with an accuracy of 0.94. While the AFINN-based classifier resulted in 1412 reviews classified correctly, [Table 3] with an accuracy of 0.72. One reason to the superiority of LSAR-based classifier is that although the AFINN lexicon has more than 2477 entry and 10 degrees of sentiment. They represent general terms that describe positive and negative emotions and states but not necessarily used by consumers. For example, a word like “agog” in the AFINN lexicon expresses positive tendencies but is almost never used to express sentimental reactions to a service. Moreover, some of the keywords that are used frequently to describe services or feelings toward those services are missing from the AFINN lexicon, that is, “exceptional, welcoming, expensive, and suitable.” LSAR is specifically created to analyze consumer reviews to services, incorporating 768 weighted terms. As the classes are of different sizes, F1 score will give a better indication of the classifier performance as it represents the weighted average of TP rate. The value of the F1 score is calculated based on precision (P) and recall (R) metrics as shown in equation 1. Precision, equation 2, calculates the ratio of correctly predicted positive reviews to the total predicted positive reviews, answering the question: among all reviews that are classified as positive, how many were positive?

While recall, or sensitivity, equation 3, calculates the ratio of correctly predicted positive reviews to all reviews in the actual class, answering the question, of all the reviews that are positive how many did we actually label? calculating precision and recall, F1 measure for both classifiers is depicted in [Table 4]. As the results show the F1 score of classification based on LSAR is better than that for AFINN-based, scoring 0.96 and 0.82, respectively.

This article presented the analysis of a social application review dataset, Feed-Finder, by modeling a classifier that automatically generates a numerical value or polarity to each venue based on the reviews on that venue. This classifier incorporates a new lexicon for SA for consumer reviews, LSAR. The work implemented is compared against a classifier based on the already available lexicon, AFINN. Using SVM implemented in R, results show that the classifier based on LSAR lexicon achieved better accuracy 0.94 compared to 0.72 for the AFINN-based classifier. This is attributed to the LSAR domain-specific lexicon that incorporates consumer review terms as opposed to general sentimental terms. As a continuation of this work, implementing a visualization system that maps review polarity on a map would be a key aim to show location pleasantness in a fraction of the time compared to the textual output.

Acknowledgments

This research is supported by a grant provided from King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi with letter number 3213/15. Hosted by Professor Patrick Olivier head of the Open Lab in School of Computing Science at the University of Newcastle, United Kingdom.

Nielsen F. A new ANEW: Evaluation of a word list for Sentiment Analysis in Microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages, Heraklion, Crete, 2011;93-8.