Tuesday, 21 February 2012

I've been going through the Gartner hypecycle these last few weeks, regarding Big Data. I've been through the Trough of Disillusionment and back, and rocked back and forth a bit even, and I now figured out what's been bugging me so much about it.
It reminds me a bit of #E20...First, I didn't think much of Big Data. Then, I thought it would be a great next new thing and bandwagon to jump onto: like datawarehousing it's closely related to Integration so I might get some spin-off.
But, that whole last idea quickly faded.
And then I went to Pervasive's integration World and got sucked into it all by Mike Hoskins's enthusiasm. Petabytes, exabytes, zettabytes - if you count all the bits and bytes there will be an awful lot of data to crunch in the next decades

I see the largest growth in machine-generated data. Measuring and spitting out data at increasingly closer intervals, I've seen it happen in my own profession: Enterprise Integration.
We used to do things in batch, once a day. Synchronise databases across space and / or time, our own company ones or those of partners, suppliers, customers

Then, the batch windows would grow smaller, or rather: the demand for updates would increase. So next to pushing data out in the evening once a day, we'd also allow inbound pulls during the day. We'd change the batch job at the end of the day into an hourly job that would aggregate changed data, add it to a file and mark it as processed. Our eager friends would call in (via a perfectly secure hard-coupled leased line) at regular times during the day, gather the little intel present, and figure out for themselves the pointer to their last piece of information

Then, demand increased even more and our hourly batch job turned into a real-time job, adding data by having that being triggered by data-changing events. Still, connectivity was kind of costly so that new info was still pulled from the outside.
A little later, pull changed to push: for the really time-critical stuff we'd not only build up new data event-driven, but also push it out on the spot to that very select group of the best of our friends and partners.
Just a little after that, connectivity costs dropped dead to almost zero and all of a sudden we considered pretty much everyone to be our best friend

From batch, we went to event-driven - it's unclear whether increased time-to-market demand increased the scale of connectivity so much that its cost flat-lined, or vice versa - but probably both. The end result? The same data being available to everyone else within seconds or minutes, versus once a day after close of business or just before that

What changed, was the speed at which information came available - nothing else. But decisions could be made sooner, and there was a minor trade-off there of course: we'd now also send out e.g. orders in the morning that would get cancelled in the afternoon, whereas this situation would result in no record whatsoever in the old situation

I envision something similar with Big Data - yet very, very different. It's not called Big Information, it's called Big Data: you now get data at the speed of light, or you can process it at a ridiculously high speed (for the record, I do drool at the showcases where millions of records get processed per second).But you'll still have to turn all that data into information yourself

I can picture the typical vendors smiling brightly. I can also see new vendors rise and shine, and preach the gospel of Big Data and how it will save you from purgatory. I see hardware sales increase, software sales and licenses explode, and a whole new service will see the light: BIaaS

That's right: Big Information as a Service (coined on the spot, btw LOL). Why is it going to be the next big thing?

Turning Big Data into Big information? No, that's not going to happen. Business Intelligence hasn't been successful at all, datawarehousing neither, and Business Process Management suffers the same ill fate: it takes ages to structure unstructured anything - especially if both unstructured and structured keep changing, which is going to happen increasingly faster, parallelling what we witnessed in Enterprise Integration.
Occurring from both sides, the sandwich image is clear

Dragging Big Data indoors even? Hell no, the new bottle neck here is bandwidth. All fair and square that you can analyse petabytes of information within hours or even minutes, but where do those gazillionbytes come from? Outside our data centre, where ever that maybe.
Currently considered okay bandwidth for those? 2-3 Gbps, given a few thousand users. Big Data? Coming to you at a few Terabytes at least, preferrably Petabytes. Bytes versus bits there (a factor 8), and Giga versus Tera (a factor 1,000) or even Peta (a factor 1,000,000).
Looks like DHL and UPS might make a good buck from transporting Big Data - redefining the meaning and use of the carrier pigeon, hey?

Getting Big Data? Well, I don't know about you but it seems to me that we find the most interesting that which doesn't belong to us: houses, jobs, cars, women - the list is endless. And while we're on the subject: that involves money, right? A lot of it, usually? Yes

So, what's the business model going to be? Like in the old-fashioned days of Integration, batches of Big Data will be waiting for us to pick up from some gateway or server, at 2 Gbps, while the data is 2 TB? That means a 24-hour download for just one file, if absolutely nothing goes wrong, and then to analyse it within 5 minutes - where's the gain here?

My guess is, we'll quickly follow the path travelled by Integration: forget batch, we'll go straight to real-time. In stead of big batches of data, we'll get very small real-time bits and pieces: Small Data, not Big Data.
Maybe even tiny Data - but that doesn't sound so sexy now does it?

I see a good and sensible solution to bandwidth spoiling the Big Data game: owners of Big Data providing the service of Big Information. After all, one Terabyte of Data will give you only a few Megabytes of Information at best. Possibly incredibly valuable Information, but extremely limited in size

Big Data? We need a whopping 2,000-lane highway in order to make that happen without constantly being stuck, waiting to make it to our destination - to spend another few days or weeks on turning that same data into information.Oh, and act upon it...