I am a noob to sentiment analysis and found a good resource for Bayesian Opinion Mining and a way to make it self improving. I was wondering though, if the optimum analysis is dependent upon the supplied data set and since self improvement would mean adding known patterns to the data set (my understanding), wouldn't the application become too overloaded with huge data set over due course of time with more and more patterns getting added to the data set every day ? What should be the proper approach to make the application scalable (if I am using the right term at the right place) ?

2 Answers
2

It sounds to me that you are building a text classifier with a supervised
training stage at the beginning, where you assign labels manually. Your model
is performing well (high precision and recall), so you want to supplement the
initial training model with an unsupervised training process over new input
strings.

These new inputs will have some known signals (words you've seen before) so your
model can do its job well, but they will also have unknown signals (words you
haven't seen before). You want your unsupervised training process to associate those new words with the known ones, to "learn". This way, you are trusting that the association between the new word and the known word is correct. Because
language processing is so difficult, you will probably generate false positive
associations automatically which would have been excluded/corrected in a
supervised environment. Thus, by doing the unsupervised learning you are
risking lowering your precision.

Your question is about being "overloaded" with lots of data. This is a fair
concern, and depends very much on your data size, implementation choice, and
system behavior expectations. While responsiveness and the tractability of
dealing with large quantities of data is one thing, I feel that the precision
and recall of your sentiment labeling algorithm is probably of greatest
importance.

In the article you linked the author has a confidence score that causes unsupervised associations to be considered only if there's a "high confidence". This is good, but there's still a risk that over time your overall precision will drop. Your system would have to be periodically evaluated for precision and recall, and re-trained. The "Bad Santa" example in the comments is a good example. I suggest you read about semi-supervised training and get this labeling right on small data sets before trusting it to work well on much larger data sets. Language processing is hard!

For other tasks such as part of speech tagging, condensation after self-training made the model smaller and better! If you identify a scalability issue, look this way first before trying to optimize your code.

The idea is that after self-training, you iteratively create a model which is initially empty. You then add data points to the new model only if they were not classified correctly. This avoids overfitting and keeps your model as small as it can be.