The data is in Windows CP1252, unfortunately. And, unfortunately, the Stanford Classifier doesn't yet support specifying a character encoding (unlike most of the rest of our packages). So, we'll convert it to utf-8.

Short sentiment snippets (the Kaggle competition version of the Stanford Sentiment Treebank)

This example is on the same Rotten Tomatoes data, but available in the forum of judgments on constituents of a parse of the examples, done initially for the Stanford Sentiment Dataset, but also distributed as a Kaggle competition.

If you download and unpack the data, you will have files train.tsv and test.tsv [really a devtest set, hopefully]. These both have a header row, which the Stanford Classifier doesn't by default know how to ignore, so you should edit the two files and delete the first row entirely.

The data is a tsv file with 4 columns: columns 0 and 1 are phrase and sentence ID and then columns 2 and 3 give a phrase and its sentiment score (from 0 through 4, ranging from negative to positive).

Since this data was already tokenized (with the Stanford Tokenizer), we can probably just use whitespace tokenization. (Unless maybe we wanted to try being clever like splitting on some hyphens.)

So, here's our first attempt at a prop file. We will do the initial experiments using 10-fold cross validation on the training data, since we don't have answers for the devtest data. This properties file gives us a Naive Bayes model: