The data scalability challenge

John Parkinson of TransUnion describes the data handling issues more companies will face in three to five years.

Interview conducted by Vinod Baya and Alan Morrison.

John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines TransUnion’s considerable requirements for less-structured data analysis, shedding light on the many data-related technology challenges TransUnion faces today—challenges he says that more companies will face in the near future.

PwC: In your role at TransUnion, you’ve evaluated many large-scale data processing technologies. What do you think of Hadoop and MapReduce?

JP: MapReduce is a very computationally attractive answer for a certain class of problem. If you have that class of problem, then MapReduce is something you should look at. The challenge today, however, is that the number of people who really get the formalism behind MapReduce is a lot smaller than the group of people trying to understand what to do with it. It really hasn’t evolved yet to the point where your average enterprise technologist can easily make productive use of it.

PwC: What class of problem would that be?

JP: MapReduce works best in situations where you want to do high-volume, accurate but approximate matching and categorization in very large, lowstructured data sets. At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions of rows of data looking for things that match a pattern approximately. MapReduce is a more efficient filter for some of the pattern-matching algorithms that we have tried to use. At least in its theoretical formulation, it’s very amenable to highly parallelized execution, which many of the other filtering algorithms we’ve used aren’t.

The open-source stack is attractive for experimenting, but the problem we find is that Hadoop isn’t what Google runs in production—it’s an attempt by a bunch of pretty smart guys to reproduce what Google runs in production. They’ve done a good job, but it’s like a lot of open-source software—80 percent done. The 20 percent that isn’t done—those are the hard parts.

From an experimentation point of view, we have had a lot of success in proving that the computing formalism behind MapReduce works, but the software that we can acquire today is very fragile. It’s difficult to manage. It has some bugs in it, and it doesn’t behave very well in an enterprise environment. It also has some interesting limitations when you try to push the scale and the performance.

We found a number of representational problems when we used the HDFS/Hadoop/HBase stack to do something that, according to the documentation available, should have worked. However, in practice, limits in the code broke the stack well before what we thought was a good theoretical limit.

Now, the good news of course is that you get source code. But that’s also the bad news. You need to get the source code, and that’s not something that we want to do as part of routine production. I have a bunch of smart engineers, but I don’t want them spending their day being the technology support environment for what should be a product in our architecture. Yes, there’s a pony there, but it’s going to be awhile before it stabilizes to the point that I want to bet revenue on it.

“I have a bunch of smart engineers, but I don’t want them spending their day being the technology support environment for what should be a product in our architecture.”