Corporations have a big advantage when it comes to ‘big data’

Computers are spewing forth data at astronomical rates about everything from astrophysics to internet shopping. And it could be hugely valuable, writes John Naughton

One of the most famous quotes in the history of the computing industry is the assertion that “640KB ought to be enough for anybody“, allegedly made by Bill Gates at a computer trade show in 1981 just after the launch of the IBM PC. The context was that the Intel 8088 processor that powered the original PC could only handle 640 kilobytes of Random Access Memory (RAM) and people were questioning whether that limit wasn’t a mite restrictive.

Gates has always denied making the statement and I believe him; he’s much too smart to make a mistake like that. He would have known that just as you can never be too rich or too thin, you can also never have too much RAM. The computer on which I’m writing this has four gigabytes (GB) of it, which is roughly 6,000 times the working memory of the original PC, but even then it sometimes struggles with the software it has to run.

But even Gates could not have foreseen the amount of data computers would be called upon to handle within three decades. We’ve had to coin a whole new set of multiples to describe the explosion – from megabytes to gigabytes to terabytes to petabytes, exabytes, zettabytes and yottabytes (which is two to the power of 80, or 10 followed by 23 noughts).

This escalating numerology has been necessitated by an explosion in the volume of data surging round our digital ecosystem from developments in science, technology, networking, government and business. From science, we have sources such as astronomy, particle physics and genonomics. The Sloan Digital Sky Survey, for example, began amassing data in 2000 and collected more in its first few weeks than all the data collected before that in the history of astronomy. It’s now up to 140 terabytes and counting, and when its successor comes online in 2016 it will collect that amount of data every five days. Then there’s the Large Hadron Collider, (LHC) which in 2010 alone spewed out 13 petabytes – that’s 13m gigabytes – of data .

The story is the same wherever you look. Retailers such as Walmart, Tesco and Amazon do millions of transactions every hour and store all the data relating to each in colossal databases they then “mine” for information about market trends, consumer behaviour and other things. The same goes for Google, Facebook and Twitter et al. For these outfits, data is the new gold.

Meanwhile, out in the non-virtual world, technology has produced sensors of all descriptions that are cheap and small enough to be placed anywhere. And IPv6, the new internet addressing protocol, provides an address space that is big enough to give every one of them a unique address, so they can feed back daily, hourly or even minute-by-minute data to a mother ship somewhere on the net.

To call what’s happening a torrent or an avalanche of data is to use entirely inadequate metaphors. This is a development on an astronomical scale. And it’s presenting us with a predictable but very hard problem: our capacity to collect digital data has outrun our capacity to archive, curate and – most importantly – analyse it. Data in itself doesn’t tell us much. In order to convert it into useful or meaningful information, we have to be able to analyse it. It turns out that our tools for doing so are currently pretty inadequate, in most cases limited to programs such as Matlab and Microsoft Excel, which are excellent for small datasets but cannot handle the data volumes that science, technology and government are now producing.

Does this matter? Yes – for two reasons. One is that hidden in those billions of haystacks there may be some very valuable needles. We saw a glimpse of the possibilities when Google revealed that by analysing the billions of queries it handles every hour it could predict flu epidemics way ahead of conventional epidemiological methods. There’s a lot more where that came from.

More importantly, we need to recognise that Big Data (as it’s now called) could tip the balance between society’s need for independent scientific research and the corporate world’s use of data-mining to further its own interests. Tooling up to handle this stuff requires major investment in computer hardware and software and you can bet that most of the world’s big corporations are making those investments now. But most PhD students working in data-intensive fields are still having to write their own analytical software and cadge computing cycles wherever they can find them.

Research Councils and other funding bodies should be investing in the IT infrastructure needed to level this particular playing field. In the kingdom of Big Data, the guy who only has Excel will be blind.