CS8761 - Natural Language Processing
Date : 12/16/2002
Algorithm for sentiment classification
======================================
CS8761
UNION MINAS
Bridget Thomson McInnes, bthomson@d.umn.edu
Deodatta Y Bhoite, bhoi0001@d.umn.edu
Yanhua Li, lixx0333@d.umn.edu
Nitin Agarwal, agar0054@d.umn.edu
Kailash Aurangabadkar, aura0011@d.umn.edu
The problem is to design an algorithm to perform sentiment classification
of reviews. The algorithm to be implemented by Union Minas employs three
main resources: Longman's Machine Readable Dictionary (LDOCE), MacQuarie's
Machine Readable Dictionary (Big Mac), and the World Wide Web (WWW). LDOCE
is accessed using the perl module Minas::LDOCE, BigMac is accessed using
the module Minas::BigMac and WebReader is accessed using the module
Minas::WebReader.
The algorithm for sentiment classification consists of 3 stages, one stage
per resource. And each stage may be broken down into sections. Each section
returns one of three results: good, bad or unknown. The results will then be
combined using an ensemble approach to obtain the final classification.
Stage 1: (Big Mac)
The first stage of the algorithm consists of two parts, Part I and
Part II. Part I utilizes the thesaurus functionality provided by the
BigMac interface, whereas Part II makes use of the words in definition
functionality provided by the BigMac interface.
Part I : Each class in the Big Mac thesaurus has been tagged with a
positive, negative or neutral sentiment. A review is evaluated by
inspecting each adjective in the review and determining its class in
the thesaurus. The class returns the appropriate sentiment for that
word. After discerning each word's sentiment in the review, the
number of positive and negative sentiments are tallied determining
the possible sentiment for the review.
Part II : Part II partially implements the distance vector concept
described in [NN1994]. The distance of a word is one from the origin
word if the origin word occurs in the definition of the word. We choose
a small set of positive and negative words and find all the words at
unit distance. Thus we build two lists, one for positive words and the
other for negative words. We scan the review for these words and
classify it as positive or negative based on the class containing
maximum number of words.
Stage 2: (LDOCE)
The second stage of the algorithm consists of two parts, Part I and
and Part II. Each part utilizes different aspects of LDOCE.
Part I : Part I of Stage 2 utilizes the active codes located in
LDOCE. Each active code is tagged with a positive, negative or
neutral sentiment. A review is evaluated by inspecting each word in
the review and determining its active code which in turn returns the
appropriate sentiment for that word. After discerning each word's
sentiment in the review, the number of positive and negative sentiments
are tallied determining the possible sentiment for the review.
Part II : Part II of Stage 2 uses LDOCE to determine whether or not
a word in the review is an adjective. If the word is an adjective,
the file containing the classification of adjectives is consulted to
determine the word's sentiment.
The number of positive and negative sentiments are tallied after each
word in the review has been examined to determine the possible sentiment
for the review. A basic negation algorithm is also implemented to support
negation of adjectives. If a word is preceded by a "not" then we negate
the class of the adjective.
Stage 3: (WebReader)
The third stage of the algorithm utilizes the World Wide Web through
the Minas::WebReader perl module. A review is evaluated by inspecting
each adjective in the review, which will be called 'review word', and
determining its association with the words 'excellent' and 'horrible',
which will be called 'set words'. This will be implemented by querying
the web for:
query1. the set word (np1)
query2. the review word (n1p)
query3 the set word and the review word (n11)
The value for npp is assumed as number of hits for excellent plus the
number of hits for horrible. It can also be assumed as the number of
documents available with the search engine, but since this number is
difficult to find out we assumed the npp according to the previous
assumption.
The number of documents returned for each of these queries
will act as counts allowing for the following contingency table to
be created:
set wrd | !set wrd
______________________
| | |
review wrd | n11 | n12 | n1p
| | |
----------------------
| | |
!review wrd | n21 | n22 | n2p
| | |
---------------------- ----
np1 np2 | npp
where n11 = query3,
n1p = query2,
np1 = query1.
With this contingency table different measures of association
can be calculated. This is similar to the PMI-IR measure described
by [Tur2002] for sentiment classification. However, we use T-score,
Poisson's measure and dice coefficient to measure the association of
the word with "excellent" or "horrible".
We also use a variation of the ensemble technique described in [Var2002]
to combine the results generated by the various tests/measures of
association. Each measure/test of association returns an association
class (excellent/horrible) for the word. The class with the highest
agreement between the different measures is assigned to the word.
The classification of the review is based on the number of words which
belong to a particular class in that review.
The results from each of the Stages will be weighted and a final result
will be tallied and sent to standard out.
The weighting of each of the stages results will be determined upon the
accuracy of their individual results. The movie data will be used as a
baseline for determining the accuracy of each of the stages results.
More information on this version of the design can be found in the README
file.
References
------------------------------------------------------------------------------
[NN1994] Y.Niwa and Y.Nitta. Co-occurrence vectors from corpora vs. distance
vectors from dictionaries. Proceedings of COLING'94. 1994.
[Tur2002] P.Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews. Proceedings of ACL'02. 2002.
[Var2002] N. Varma. Identifying Word Translations in Parallel Corpora using
Measures of Association. Master's thesis, University of Minnesota,
December 2002.
COPYRIGHT AND LICENSE
------------------------------------------------------------------------------
Copyright (C) 2002 Bridget Thomson, Deodatta Y Bhoite, Kailash Aurangabadkar,
Nitin Agarwal, Yanhua Li.
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.