HPC Wire Posted blog, “Big Data Challenge: Data Paring”

I think a lot about big data and the challenges it proposes. I guess I started thinking about big data a long time before I ever heard the term. I began to think about it when I heard and read one of the conclusions of the post-mortem for the events on and around September 11th, 2001: that we had the information to realize that this attack was coming, but we simply didn’t analyze the data fast enough. This has always stuck with me and kept the problem on my mind, quite a while before I was aware of the term Big Data.

One of the well-known challenges for big data is that you have to pay a lot more attention to where the data is now. This isn’t a new challenge—I remember a customer laughing as he told me how 15ish years ago his team would express mail hard drives from one site to another because the postal service was faster than the network transfer—but the size of data is growing so quickly that this is changing from a fringe concern to a core concern.

You always hear about the need to localize computational resources and make intelligent data staging decisions, but one of the dimensions of this problem that needs to be more discussed is data paring. The need for this is fairly obvious: data is growing exponentially, and growing your compute data exponentially will require budgets that aren’t realistic. One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t; I would bet that for most purposes the interesting data added is a tiny percentage of the new data that is added to the overall pool of data.

To explain these claims, let’s suppose I’m an online media streaming provider attempting to predict what you’d be interested in seeing based on what you’re looking at now. This is an incredibly difficult machine-learning problem. Every time a user watches some content it has to be cross-referenced with everything else that user has watched, potentially creating hundreds, thousands, or even more new combinations that can be used to predict what else you might like to see.

These are then compared with all of the other empirical data from all other customers to determine the likelihood that you might also want to watch the sequel, other work by the director, other work from the stars in the movie, things from the same genre, etc. As I perform these calculations, how much data should be ignored? How many people aren’t using the multiple user profiles and therefore don’t represent what one person’s interests might be? How many data points aren’t related to other data points and therefore shouldn’t be evaluated as a valid permutation the same as another point?

Answering these questions through paring and sifting algorithms is a dimension of Big Data that will only grow in significance over time. Data capturing will always be fundamentally faster and easier than data analysis, and data will continue to multiply faster than bunny rabbits. Not wasting time on irrelevant data will be one of the keys to staying ahead of the competition.

The scientific community has been determining how to remove irrelevant data for a long time, so long that the term outlier is mainstream. As Big Data moves to the forefront, organizations that can adapt techniques to ignore outliers and draw intelligent conclusions based on higher-correlated data are going to lead the way.