To Big Or Not To Big

Big data is everywhere. Not surprisingly, it has come to our neck of the woods, too: research in software engineering, programming languages, and computer science in general. I’ve done a fair amount of work with it, and I suspect that will not stop. But do we really need big data? When is “big” really necessary and when is it, well, just a showoff? Here’s a reflection on this topic, just to reinforce the point that more is not always better. It all depends on what we are trying to achieve.

The Truth, the Whole Truth, and Nothing but the Truth

When it comes to deriving knowledge from data, three concepts are important: accuracy, precision, and recall. Accuracy gets at how close to the truth we are; precision gets at how certain we are about what we are seeing; and recall gets at how much of the truth we are seeing. These three concepts are independent: one can have all sorts of combinations among them.

If you have a very large dataset to work with, and you process it in its entirety in search for some facts in it, then you can max out on all three concepts: you literally get the truth (100% accuracy), the whole truth (100% recall), and you can also get nothing but the truth (100% precision) if you remove everything that’s not relevant.

The Truth, but not the Whole Truth

But that may not always be necessary. Sometimes, one can get high accuracy with low recall, meaning that one doesn’t need to process the whole truth in order to get the truth. As an example, look at the picture above. The statement “there’s a giraffe” is 100% accurate without having 100% recall, because we’re not seeing the whole giraffe. But we don’t need to see the whole giraffe to know, with a very small margin of error, that there’s a giraffe there.

That’s the intuition.

Statistics captures this intuition through the concept of sampling, which is used extensively in many fields. Sampling is the art of selecting only a small number of data points out of a much larger population in order to derive knowledge that generalizes to the entire population. There are all sorts of techniques for sampling the right way, including for making sure that the sample is representative of the whole population.

Typically, sampling is used because there is no practical way of collecting data on the whole population. That’s the case, for example, with social and market studies, political polling, medical research, etc. But if we have the whole data, as is the case with the huge datasets that are now available everywhere, then why sample it? Why not process the whole thing anyway?

When To Big

A reason to process huge datasets is because they exist. Hey!, they’re there, we can just do it. But that’s not a very good reason, especially because we know (or, at least, we should know!) that we can sample it instead.

A good reason to do it is when we need high recall. That is, when we really need to know the whole truth in order to do something with it.

Here’s an example from my own research group. We are working on a tool that, given a file URL, it returns all the duplicate files of that file in Github. This tool requires 100% recall. There’s no way around it: we need to process the whole thing.

When Not To Big

A good reason to sample, instead of processing the entire dataset, is the good old time-money concern. Processing a terabyte dataset takes a very long time, and requires the attention of some human(s). It also usually requires hardware and software resources that go well beyond commodity computing. It’s very hard to share very large datasets.

When high recall is not necessary, then sampling is a much better alternative. Sampling the data properly can get us the truth much faster. (This is hard for experimental minds to believe, but statistical sampling really works!)

For example, if you are a bank and you want to know the rate of suspicious transactions, sampling the data correctly will give you the accurate answer much faster, and without needing a whole big data infrastructure. Or if you’re a researcher, and you want to know how much duplication there is Github, sampling the data will also get you an accurate answer with a fraction of the overhead required to achieve full recall.

There is another reason for not “bigging,” or for at least being careful about what “big” entails. Sometimes the datasets are big for the wrong reasons, resulting in highly skewed data towards certain properties. So more is not always better. When it comes to data, we really need to know what’s inside the package.

How To Sample

Sampling is an Art. Like all Arts, it requires a solid set of techniques, in this case statistical techniques. But first and foremost, it requires a clear understanding of what the analytical goal is. That goal drives the sampling.

More on sampling: I liked this paper discussing the Art of sampling in software engineering research.