The Big Data Long Tail

Big Data: a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

The Long Tail: the large number of occurrences far from the 'head' or central part of a distribution of popularities, probabilities or such.

Take these two buzz phrases and put them together, and what do you get?

I ran into a discussion of the Big Data Long Tail at a recent conference, where scientists were discussing their Big Data challenges and successes. In data-centric sciences, say genomics or astronomy for example, they routinely deal with the 'head' of Big Data: large data sets that they intentionally collect, store, manage, and analyze. But more difficult to analyze, but every bit a Big Data problem nevertheless, is the Long Tail of Big Data: data on individual researchers' laptops or other systems scattered about in offices, under desks and in laptop bags.

Taken individually, the data on these systems are important but not plentiful enough to be Big Data. But consider all such systems en masse, and now you have an especially knotty Big Data problem. Not only are data formats a jumble, but so are your metadata, not to mention the challenge of constructing and running algorithms across all these various systems scattered potentially around the world.

The business world faces its own Big Data Long Tail problems as well. Yes, you may have your large data sets that you intentionally collect and analyze for business purposes, whether you be analyzing customer purchasing behavior, the movement of the stock market, or whatever Big Data are important to your business. But what about the data your business keeps in, say, Excel spreadsheets?

Virtually every computer in your enterprise has a passel of spreadsheets on it. People upload some to your portal, while other spreadsheets remain hidden away on individual's laptops. Is there value in those spreadsheets? Indubitably. So, what would it take to consider all the spreadsheets everywhere in your organization as a single Big Data set, so that you can gain intelligence from such a collection? Now we're talking a true Big Data problem: a problem today's tools are woefully inadequate to solve.