How Big is “Big Data”?

6 October 2012

“Big Data” is getting a lot of attention lately as a key computing area for the coming years. Even the White House has gotten involved with this year’s announcement of a Federal Big Data initiative. But exactly how big is “big data”? It’s a moving target of course, shifting with our growing ability to generate, store, and process ever larger volumes of data.

IBM 2314 Disk Drives

The IBM 2314 Disks, introduced between 1965 and 1970 were a technical wonder in their day. But it took a whole row of large appliance sized units to crack 200 MB, and the “big data” of that day was mostly stored on tapes and accessible only via slow sequential processing of carts full of tape reels. Megabytes clearly qualified as big data.

Today I can beat that string of 2314 disks by an order of magnitude with a USB stick for under $20. Clearly the economics are radically different. But where does that leave the qualifying level for “big data”?

Wikipedia, that font of modern knowledge, provides an interesting perspective. A quick browse of the entries for gigabyte, terabyte, petabyte, exabyte provide all the scale we need without even worrying about a yottabyte. The system and storage examples in the Wikipedia entries are informative:

Megabytes clearly don’t make a blip on the Big Data horizon. The Big Data of yesteryear is a routine unit for the size of individual files today.

Gigabytes can be covered with examples of modest amounts of image, audio, or video data that most computer users deal with routinely. A few music CD’s or the video on a DVD breaks into gigabyte territory. There’s not much here that will impress as Big Data.

Terabytes are just one step up the scale, but things start to get much more interesting. The examples deal with data capacities and system sized from the last 10 to 15 years. They include the first one terabyte disk drive in 2007, about six terabytes of data for a complete dump of Wikipedia in 2010, and 45 terabytes to store 20 years of observations from the Hubble telescope. Clearly, at this point we start to be entering “big data” territory.

Petabytes start to move beyond the range of single systems. Netflix stores one petabyte of video to stream. World of Warcraft has 1.3 petabytes of game storage. Hadron Collider experiments are producing 15 petabytes of data per year. IBM and Cray are pushing the boundary of storage arrays with systems in the 100 – 500 petabyte range.

Exabytes examples start to leave systems behind and mostly describe global scale capacities. Global Internet traffic was 21 Exabytes per month in 2010. Worldwide information storage capacity was estimated at 295 exabytes of compressed information in 2007. On the other hand, astronomers are expecting 10 exabytes of data per hour from the Square Kilometre Array (SKA) telescope, although full operation is not projected until 2024.

750 Gigabyte 3.5” disk

So this scale would seem to clearly put gigabytes in the yawn category and probably below the threshold of Big Data. Terabytes clearly qualify, and probably will account for much of the Big Data efforts at the moment. Petabytes cover the really impressive data collections for today and really seem to contain the upper boundary of what even the most ambitious Big Data projects will be able to handle. Exabytes are rushing at us in the future, but mostly beyond what anyone will be able to address in the next few years.

So the bottom line: Big Data today has moved beyond Gigabytes. It is squarely in Terabytes and edging up into Petabytes. Exabytes and beyond is in the future. And we still don’t need to try to comprehend what a yottabyte is.