What is Big Data?

I can’t remember the last time I made it through an airport terminal without
seeing a giant abstract poster up on the wall with a vague heading like “Big
Data is here: are you ready?“, or “Big Data: the new natural resource”. Often,
the heading is in front of a photo of clouds, or a gazelle, or a server room,
or some other such thing.

So, for the weary business traveler who at this point is too afraid to ask what
“Big data” refers to, here’s a quick breakdown.

What is big data?

The term “Big Data” became popular when businesses started to regularly
interact with datasets that were big enough that you couldn’t load them into
your computer’s RAM, which these days is generally somewhere between 1GB and
16GB.

This has been a problem for a long time, but this particular name for it didn’t really
take off until around 2004.

If your dataset doesn’t fit on RAM, you can’t do anything with it that
requires looking at all of it at the same time, which makes it a pain to work
with. You need to load a chunk of it off a hard drive, do something with it,
and then load the next chunk.

Alternatively, you’re dealing with big data if whatever you’re doing with your
data takes so long to do that you need to get mulitple computers to work with
parts of it at the same time.

That’s basically the gist of it. “Big data” refers to the set of techniques and
solutions that deal with those two kinds of scale—data that there’s a lot of,
or computations that take too long to run. In other words, when data achieves
an inconvenient scale in either space or time.

You may also have a big data problem if you need to do real-time proecssing of data
that is generated or modified at a sufficiently high rate that you need multiple
servers to handle the flow.

Together, those two reasons are the “Volume” and “Velocity” in Gartner’s
popular “3 Vs” characterization of big data—“Volume, Velocity, Variety.” We
think “Variety” is neither necessary nor sufficient for you to have a big data problem.

A big data example

Imagine you’ve acquired a giant CSV file from Facebook that contains a billion
rows of data about Facebook users. Let’s say there are only two
columns—“Name” and “Age”.

Let’s also say the average length of a name is about 10 characters, and
the average length of an age is about two characters. So for each row, you
have 12 characters, which makes for a total of 12 billion characters.

A single character takes up about a single byte of space on a hard drive. That
means your dataset is about 12 billion bytes, or 12 gigabytes.

Suppose you want to compute the average age of Facebook users. Computing an
average means summing up all the values of something and dividing by the number
of things you added up. So if you can’t load this dataset into your RAM all at
once, you’ll need to load up chunks of it at a time, add up the ages in that
chunk, store the resulting sum, and also store the number of rows you just
summed over. If you split up your data into, say, ten chunks, at the end,
you’ll have ten sums, and ten counts. Then you just add up the sums, and add up
the counts, and divide the total sum by the total count.

All big data problems are solved this way. Of course, things get tricky when
you want to do more complex things, or if you have specific requirements for
how fast you need a response, for example. But this is the core idea.

Why you probably don’t have big data

You probably only have big data problems
if whatever you’re doing with your data requires you to look at every single item in the dataset.

That’s because statisticians spent the bulk of the 1800s learning how to work
with samples of large datasets. Most of the time, the answer you get by
looking at a random sample of your data is good enough, if you just apply the
techniques of classical statistics.

In the example above, you could have easily just taken a random sample of
100,000 rows from your 1 billion row dataset, and computed the average of those
rows. The answer would probably be basically the same, and the degree to which
you might be off can be quantiifed rigorously using statistics.

You only need to look at every single row in specific cases. At Facebook,
engineers regularly need to modify data in their databases in a very particular
way for every single user, often in complex and interrelated ways. They also
need to deal with lots of updates and deletions by lots of people in real-time.
In these cases, you can’t just use statistics to simplify away the problem.

How to talk about big data

If you’re discussing your organization’s data requirements and scale with someone,
don’t hesitate to get into the weeds. Avoid talking in abstractions and generalities. Ask
specific questions:

What’s the total magnitude of data on disk? A few Gigabytes? A few Terabytes?

Is the data you’re working with growing? If so, by how much, and how often?

What do you need to do with it? Do you need immediate responses to complex queries across multiple datasets? Or is it sufficient to run offline batch jobs once a week?