The rise and rise of big data

Data warehousing software sparks into life

By Andy Hayler

July 25, 2011

CIO
Share

Twitter

Facebook

LinkedIn

Google Plus

Share

Twitter

Facebook

LinkedIn

Google Plus

Over the years there has been an explosion in the growth of data. As recently as 2000, digital media accounted for just 25 per cent of all information in the world, but by 2007 it was 94 per cent, according to a study by the University of Southern California.

Although processing power has, in accordance with Moore’s Law, also seen exponential growth, this pace has not been kept up with by memory and disk storage access speeds.

The last few years have seen this problem exacerbated by the growth in social media and by the increase in the amount of data automatically collected by sensors and devices like RFID tags and clickstream tracking software.

The consequence of this divergence is that many enterprises find that traditional database approaches are struggling to keep up with their needs to analyse the increasingly huge volumes of data.

To complicate things, more of this data is unstructured (such as documents and web pages, rather than just numbers), which traditional databases have never been especially good at dealing with.

Industries which have found this a problem include internet marketing companies, social media web sites and financial institutions like hedge funds who want to test trading strategies on historical trading data.

The term Big Data has been coined to describe this issue, and a number of interesting approaches have arisen to tackle it. For one thing there has been an explosion of entrants to the previously staid data warehouse market.

Approaching the numbersTwo approaches have come to the fore.

First, traditional relational databases have been optimised for transaction update, and are row-oriented, designed to have tables with a few columns (name, address or product number) and large numbers of rows.

This is what you want for update processing, but in the case of largely read-only processing it can be more efficient to flip this on its head to column-oriented storage.

Pioneered by Sybase, this approach has been taken up by many of the recent analytic database market entrants.

It is easier to compress this style of data, though there is a price to pay in terms of load times and it is not well suited to frequent transactional updates.

However for analytic processing this is not really an issue, and columnar databases can, for certain analytic queries, deliver query performance an order of magnitude faster than traditional approaches.

A second approach, often allied to columnar storage, has been to use massively parallel processing, where instead of one processor dealing with all queries, software parcels the tasks out across an array of processors operating in parallel.

This is tricky to programme, but a number of vendors have succeeded in providing such an environment, the pioneer being Teradata.

High volumes of unstructured data challenge even these newer approaches, and has led to a parallel track of technology development that eschews traditional databases.

Google patented MapReduce as a framework to allow highly parallel processing of huge datasets distributed over large numbers of computers.

Hadoop is an open-source implementation of MapReduce pioneered by Yahoo and picked up by a growing set of software developers interested in tackling the issue of ballooning quantities of unstructured data.

The Hadoop and the database worlds are starting to connect as a number of analytic database vendors have introduced support for Hadoop programming within their databases, to a lesser or greater extent.

Traditional SQL programmers find their experience ill-suited to the very different programing paradigm needed for parallel processing such as the Hadoop framework, so skills are at a premium.

Moreover it is worth emphasising that these esoteric techniques only apply to certain types of analytic requirements.

What is certain is that the recent very rapid growth in database size presents some significant challenges to an industry used to relying on faster processing power to solve its problems, and that this challenge has spawned some genuine innovation in a previously rather staid software sector.

It will be extremely interesting to see how this area develops in the coming year or two, and which vendors prosper.

Given the ever-increasing demands placed on analytic processing by the factors outlined earlier, there will be plenty of opportunities for innovation, and we can expect to see some unfamiliar vendors causing a stir, doubtless triggering further merger and acquisition activity as the giants seek to keep up.

Andy Hayler is founder of research company The Information Difference. Previously, he founded data management firm Kalido after commercialising an in-house project at Shell