Interview: How massive parallel-processing compute power is changing the face of large graph analysis

With Big Data on the lips of technologists and marketers alike, there is a strong temptation for many to dive headlong into the world of analytics without a coherent strategy. In a revealing interview, HPCC Systems Data Architect Jo Pricharddiscusses the evolution of businesses around Big Data and the value in creating a sustainable, scalable solution.

Hi there Jo. Could we start off with a little bit about your background and how you came to be at LexisNexis.

I come from a programming background primarily. I started over 20 years ago working on mainframes and Vax COBOL. Since then I've progressed through to working with third and fourth generation programming languages, ending up working with an R&D team in London just over 15 years ago. Parts of that R&D team were the guys behind Borland C++, from a company called JPI who wrote compilers. In 1999, I had someone contact me from the USA who reached out and asked me and my colleagues to join his company to create a massive parallel-processing super computer. So that where it all started where we had some specific needs we wanted to solve but would require some completely new sets of technologies to meet those business needs from a Big Data perspective. The team tackled that and created what is now HPCC Systems, which stands for High Performance Cluster Computing from LexisNexis.

From my perspective I do less and less platform coding and much more people data and Big Data work. What fascinates me personally is what Big Data can tell you about people from the perspective of how they interact with each other and how you can infer what is going on behind the scenes. That could be for many reasons. When I first started out over 20 years ago I worked for a short term insurance company and one of the biggest challenges was to figure out how people would game the system from an auto-insurance and short-term insurance perspective. It was very difficult because you had no idea how they connected behind the scenes. In today's world you have social media and you can kind of figure that out, but even back then we were able to notice trends, such as the fact that every time someone had their car radio stolen they just happened to have a pair of Ray-Bands in the car. It became an epidemic, and it seemed to just happen out of nowhere. People communicate that knowledge and there was no way to pierce that. So for a long time I've been fascinated by what you can learn from people in a general sense by leveraging big data, how can you understand what's normal and when people step out of the boundaries of that and start to socially engineer stuff, that fascinates me. Most of my work focuses on understanding what is normal behaviour and which data points call tell us if there is something else going on.

Has there been a clear evolution of the power of analytics since you have been working in this field, or has it really been over the last couple of years that you have seen a big data explosion?

I think from an enterprise perspective, Big Data has been around for a long time. At LexisNexis we have really had to shape our business around Big Data for well over a decade. The difference is that for many years our internal platform code and our supercomputing technology was proprietary, it was our competitive edge. We protected that very carefully. We are primarily an information solutions company so we have all of these massive data assets, around 50 Terabytes of data across 10,000 data sources. This has been growing over the past decade but even 10 years ago we were faced with the realisation that the only way we could grow our business was to wrap our hands around the Big Data and come to terms with the fact that we had to have new and innovative ways to help us deal with it and get value from it without getting hamstrung by the fact that we couldn't scale. If we could solve the scaling issue then that would give us a massive edge.

The difference was that we watched very carefully as the rest of the world had to catch up with that. After the MapReduce paper came out, there was a tipping point and about 24 months ago when the rest of the world woke up to the fact that it wasn't impossible to leverage that amount of data, it didn't belong necessarily to the realms of the big social media companies, and that it could be accessible for enterprises.

So for many years we had this head start over everyone else. Having said that, we did have massive data stores and those brought along a number of their own issues. One such key problem was from an entity resolution point of view. We have all of these entity fragments, but to do really successful entity resolution you need to take effectively every fragment for every person (around 300 million people) and compare every fragment against every other person. It becomes an enormous n2 problem. To do that you needed to be able to create a platform that could help you solve that problem. And that has been one of the key things that has driven the growth of our business. There is a point at which the world has finally realized that there is enormous value in data. If you can string the right tools together and architect the right solution inside your business you can successfully create value out of it.

What was the main argument for open-sourcing the platform?

There are a whole bunch of reasons. Primarily it was no longer competitive edge for us. When nobody else has the power to scale-out data intensive computing like that then you hold on to it very carefully. Now however, you have so many vendors who are appearing in that space it is no longer what we would consider our competitive advantage, it is just one of our tools. We are keen to grow a smart community around our platform and our open-source product, so that we people can contribute and we can both extend the platform for general use but also to extend it to create lift within our own community.

Do you feel that it is more difficult to be as agile as some of the start-ups who have only been in the market for a year or so?

Agility is a double-edged sword. As an analogy, when dealing with my kids it is very hard to tell them "trust me, I have experience in this!" You don't want to do it that way because whilst you might have quick results right now, there is a downside in a year or two's time because you've created something that is hard to manage. From our perspective it is often difficult to communicate that you need to get past the initial proof of concept phase where all you want to do is solve one technical challenge. It's a bit like Farmville. You see a little patch of land that you want to cultivate, you throw everything into it for free and you do really well. But if you want to start growing it then you have to start paying people money, and then more money, and then you start roping your friends in to help out. What we try to do is help people get past the novelty of Big Data and get past the novelty of being able to do things massively in scale. Once you understand that it is possible you need to step back, take a breath and understand what it is that you really want to accomplish, as well as understanding what the downside is of stringing up this stuff very carefully without a long term plan.

We've always had a long-term plan because we've had so much leeway to figure stuff out and make mistakes which has ensured that we don't have to do that at high-speed now because we've crossed that threshold a long time ago. So after crossing that threshold the first thing we realised was that Big Data is really hard work. The bigger it gets the more clunky it gets in terms of having to manage your data flow. One of our issues is that we have so many different sources reporting data to us that it creates an administrative nightmare in terms of formats and multiple landing zones. So we very quickly realised that even if you can execute it in three milliseconds but are relying on an army to keep it up and running then it simply isn't scalable, it's people scaling that becomes the problem. So what we really focused on was creating workflow products that built into the open-source platform so that you can track your jobs as they run through and you can see how your data is flowing through the system. One of the key things is versioning your data because it really is like software. You have to QA it at the end, you have to have this process where you are rolling in updates and date stamping your indexes as you build them so that you can flip them back and forth if there are data issues. You have to have technologies that help you scale your people and not just the platform; you need to be able to do more with less.

The speed of the platform is just one small piece of the entire puzzle. If you are a start-up it is easy to get three bright guys together and string up a demo to take a whole bunch of Big Data and predict who the next President of the United States is going to be. If you had to make it repeatable however, if you are getting frequent data updates and have to manage that data flow, it becomes a people problem as opposed to a computing problem. Imagine if you had the next thousand US Presidents that you had to predict with a whole bunch of data and you have to manage that data, along with the essential code encapsulations. You don't want to have a rats' nest of code that you have to have more and more people on board to manage. It's very similar to software development in that sense. You want to be able to grow and scale-out your business without having to continuously hire new people or invest in more technology.

So it becomes a business strategy problem as much as anything else?

Absolutely. It's also about recognising that, whilst data warehousing, predictive modelling and supercomputing technology are nothing new, it is all about how agile you are in stringing them together in the most efficient way so that you can very quickly scale your business without having to hire new staff exponentially as well. Finding that balance is really important, as is finding the technology that can help you achieve that. I was at a conference recently and I was overwhelmed. I understand the world I work in really well, but I was staggered by the number of vendors out there. It's a bit like a bazaar! Behind all of that you have to keep asking how you can put it all together so that you don't need specialists in everything under the sun. You need simplicity and you need a small number of moving parts. Ultimately you don't want to be building the Concord, you need to have a long term plan that helps you understand where you want to end up.

So what do the next 12 months hold for you guys?

We have two sides to the business. From the point of view of HPCC Systems we're investigating a whole bunch of new technologies. At the moment we have a front-end, high-speed data delivery engine. We're looking to close the gap on putting in more technology that will get you end-to-end. Our two main technologies will allow you to put most of your end-to-end products together both in terms of doing your large data mash-ups on Thor in batch mode to your data delivery and queries. We're looking to expand and add some more functionality in terms of doing more of the pieces that other vendors are offering. Going to multiple vendors just adds complexity. There are places internally that we went to to shorten that loop and create extra lift.

From a LexisNexis position we have a very large push on data graph technologies. We're keen to really leverage our internal social graph to do more for enterprises and businesses. We help companies assess and manage risks, and then mitigate against those risks. We use our vast data assets to tell them more about their own population from a different perspective. In many cases that helps them with things like anti-money laundering and fraud protection across industries. We've really found this to be a rich vein of value in terms of leveraging social graph to figure out really complex things. It's a huge data challenge but it does yield extraordinary results.

You're going to be joining usat Big Data World Europe, what are you most excited about in terms of attending and speaking at the event?

Whenever I get back from a conference one of the things that I always look back on and find fascinating are the strangest meetings that I tend to have out of the blue. Some of my best ideas come from such meetings. I was at a conference and I met with a wellness and disease management firm. We just happened to be talking about blue zones and longevity, nothing to do with anything that I do. They were explaining to me that they really have a problem trying to figure out where to do community based intervention. What they want to achieve is find neighbourhoods that have the best receptiveness to community based intervention. The key things from a blue zone perspective are things like neighbourhoods that have a good social makeup, where they have a well connected community, where they have good walk-ability and so on. And during this discussion I was thinking about a friend within LexisNexis who is âMr Neighbourhood'. He has variables that describe every neighbourhood in the USA. He loves this stuff.

So I called him up afterwards and said "If I help you map out how cohesively connected every neighbourhood is along with your other variables, could we create an index that could help wellness companies pinpoint which communities would be most receptive to community based interventions." He agreed and the process really only took two weeks to code it. That was just one of those random things that happen when you talk to people about the challenges they are currently facing. If you can help them then that can often be mutually beneficial. I really enjoy talking to people across industries and getting my creative juices going!

For more information as to how Big Data is set to change business structure and strategy read on for this interview with HolidayExtras Chief Marketing Technology Officer Nilan Peiris.