Pages

ggplot2: How to adequately capture spread of data in plot

Monday, May 5, 2014

Using Stanford's sentiment classification routine (contained in the CoreNLP toolkit), I am trying to plot the "sentiment" of each sentence in a given document. The data from the sentiment classification essentially consists of five columns and n rows:

Each row represents a sentence in the file fed to the classifier, and each of the columns in that row represents the confidence intervals regarding the sentence's sentiment value: The first column contains the algorithm's confidence value that the given sentence has "very negative" sentiment, the second column contains the confidence value that the sentence has "somewhat negative" sentiment, the third column represents the confidence value that the sentence contains "no sentiment" (i.e. is descriptive); the fourth column represents the confidence value that the sentence is "positive", and the fifth column represents the confidence value that the sentence is "very positive".

For each row in the data, it's easy enough to identify the column with the maximum value, and then plot those values in sequential order, using positive values if the maximum value in the row falls in the first or second columns, zero if the max is column three, and positive values if the maximum value falls in the fourth or fifth rows:

If I only plot the sentiment scores with the largest confidence values in each row, though, (which is what I've done in the plot above), I end up throwing out four of five columns of data for each row. Is it possible to represent all the rows of this data in a reasonably intuitive fashion using ggplot2? I realize that this question is borderline off-topic on SO, but I thought that others with more familiarity with ggplot (and dataviz more broadly) might be able to point me towards a better visualization method for my data structure. In any event, I would be eager to hear others' thoughts on this question.