I work for Red Hat, where I lead JBoss technical direction and research/development. Prior to this I was SOA Technical Development Manager and Director of Standards. I was Chief Architect and co-founder at Arjuna Technologies, an HP spin-off (where I was a Distinguished Engineer). I've been working in the area of reliable distributed systems since the mid-80's. My PhD was on fault-tolerant distributed systems, replication and transactions. I'm also a Professor at Newcastle University and Lyon.

Monday, June 18, 2012

Worried about Big Data

I've been spending quite a lot of time thinking about Big Data over the past year or two and I'm seeing a worrying trend. I understand the arguments made against traditional databases and I won't reiterate them here. Suffice it to say that I understand the issues behind transactions, persistence, scalability etc. I know all about ACID, BASE and CAP. I've spent over two decades looking at extended transactions, weak consistency, replication etc. So I'm pretty sure I can say that I understand the problems with large scale data (size and physical locality). I know that one size doesn't fit all, having spent years arguing that point.

As an industry, we've been working with big data for years. A bit like time, it's all relative. Ten years ago, a terabyte would've been considered big. Ten years before that it was a handful of gigabytes. At each point over the years we've struggled with existing data solutions and made compromises or rearchitected them. New approaches, such as weak consistency were developed. Large scale replication protocols, once the domain of research, became the industrial reality.

However, throughout this period there were constants in terms of transactions, fault tolerance and reliability. For example, whatever you can say against a traditional database, if it's been around for long enough then it'll represent one of the most reliable and performant bits of software you'll use. Put your data in one and it'll remain consistent across failures and concurrent access with a high degree of probability. And several implementations can cope with several terabytes of informations.

We often take these things for granted and forget that they are central to the way in which our systems work (ok you could argue chicken-and-egg). They make it extremely simple to develop complex applications. They typically optimise for the failure case, though, adding some overhead to enable recovery. There are approaches which optimise for the failure free environment, but they impose and overhead on the user who typically has a lot more work to do in the hopefully rare case of failures.

So what's this trend I mentioned at the start around big data? Well it's the fact that some of the more popular implementations haven't even thought about fault tolerance, let alone transactions of whatever flavour. Yes they can have screaming fast performance, but what happens when there's a crash or something goes wrong? Of course transactions, for example, aren't the solution to every problem, but if you understand what they're trying to achieve then at some point somewhere in your big data solution you'd better have an answer. And "roll your own" or "DIY" isn't sufficient.

This lack of automatic or assistive fault tolerance is worrying. I've seen it before in other areas of our industry or research and it rarely ends well! And the argument about it not being possible to provide consistency (whatever flavour) and fault tolerance at the same time as performance doesn't really cut it in my book. As a developer I'd rather trade a bit of performance, especially these days when cores, network, memory and disk speed are all increasing. And again, these are all things we learnt through 40 years of maintaining data in various storage implementations, albeit mostly SQL in recent times. I really hope we don't ignore this experience in the rush towards the next evolution.

1 comment:

Mark, I do not think structured data (customer information) is growing at a pace to be termed Big Data. It is the unstructured information or semi-structured information - twitter posts, facebook posts, web clicks that the industry is deeming valuable, from marketing/analytical perspective. It definitely is synonymous with "seeking a needle in a haystack". Just like the wave of Cloud Computing as a marketing term has stabilized, I feel "Big Data" will become less attractive (but a reality) in 2-3 years.

What really scares me is that unstructured data in pieces may not have privacy concerns. But a collection of these pieces may yield private information (Eg: Target knowing that a teen was pregnant via her buying pattern). That definitely scares me.