More data isn't always a good thing.

One of the questions we face in mining text data is how much data do we really need to draw useful conclusions.

In text mining it seems obvious that we should use all the
data we can get our hand on for use in drawing conclusions. The temptation is always to use the broadest
possible query to select the data set, because we don’t want to miss anything
that might be important. The problem
with such an all inclusive strategy is that it often adds more noise that
obscures the signal we are trying to detect.
So for instance, if I’m doing a study for a chocolate candy manufacturer
and simply enter the query, “chocolate”, the vast majority of the data I
collect for my study will have nothing whatever to do with chocolate
candy. This will make it much harder to
detect the relevant trends and themes in the data related to chocolate candy
because they will be obscured by unrelated issues, such as the color chocolate
or chocolate ice cream. So the query
“chocolate candy” might actually make more sense, even though it leaves out a
lot of relevant data. As long as we
have enough data, adding more that is mostly irrelevant could actually make our
analysis less effective.

But how much data is enough? The answer may surprise you. It doesn’t really take as much data as you
might think to spot a potentially interesting trend or correlation. To see why, let’s try a simple thought
experiment. Say we are given a coin and
we are told that it may or may not be “loaded”, where a loaded coin is one that
when flipped nearly always comes up heads, whereas a normal coin will only come
up heads half the time. How many flips
of the coin will it require for me to determine that the coin is fair or loaded
with 99% confidence. The answer is 7
(the first flip of heads gives me 50% confidence (1/2), the next flip 25%
(1/4)… the seventh flip .007 (1/128)).
So in this simple experiment I only needed seven data points to tell
that something was probably amiss with the coin.

But if seven examples is enough to draw a conclusion from a
simple experiment, why do we usually typically use thousands of examples to
draw conclusions from text? There are
actually a couple of reasons. Partly its
because we frequently don’t get to design our experiment before the data is
generated. So we basically have to take
whatever data is given to us, and some it is certain to be redundant or
irrelevant for our purposes. The other
issue, is that we usually aren’t simply trying to determine the answer to one
yes/no question (e.g. “is the coin loaded or not”) but rather are looking
across thousands of potential features and correlations to find a handful that
are potentially interesting. When you
have to cover more bases, you naturally need more data to do it with.

So the better, more relevant the data, and the more focused
the subject of the analysis, the less data you actually need to get an accurate
picture. Typically when get a fairly
focused set of short documents (paragraphs)
that are relevant to the subject under study, I can usually get a pretty
good picture of between 25 and 50 themes using between 1000-10000
documents. Right around 500 documents
usually turns out to be too small a set to be interesting (it might even be
easier just to read the documents one by one, than it is to try to analyze them
using text mining techniques). Once I
get above 100,000 documents I will usually either sample the data or divide
into smaller chunks using some other feature of interest.

The moral of the story is, adding more data is not a
panacea. Being thoughtful about what
you want to study and why and then carefully selecting data that is relevant to
those objectives will produce much better results in the end.