A Big Dud on Big Data

Terrible Definition for Big Data

Not everything that happens online or in the cloud is related to Big Data. In particular, the
article seems to equate YouTube videos with Big Data. The word "data" has essentially become
a meaningless epithet applied to all things digital. This is a horrible practice perpetrated
by people who lack a tech background. So here's a handy guide for all the journalists who
shunned techie courses in favor of the fun times spent taking humanities courses.

The way to decide if something constitutes "Big Data" is to ask if it's data, and if so, how big it is. Just because something is digital, or because it is the input to a computer program does not make it "data" in the "Big Data" sense. The question is: What is the information content here, and how big is it?

Quantifying information content is
actually a complex and deep subject, the domain of an intriguing field called Information Theory,
but I'll skip the complexities of Info Theory, not mention Shannon, avoid defining entropy, and describe a simpler
technique that anyone can use: simply summarize in English what information
is encoded in the data.

For instance, imagine the universe, so rich and full
of information that it defies a summary. The universe, or even
smallish fractions of it that hold some of its inner workings that we cannot currently summarize with our knowledge base, would be the subject of Big Data. Imagine
the data collected by the Square Kilometer Array, petabytes collected
per day through thousands of radio receivers sprawled over an enormous chunk of land; that's Big Data, the information collected by these radiotelescopes might even contain traces of other civilizations.
Imagine all the biomolecular information collected at various laboratories around the world; again, Big Data,
holding secrets to countless drugs. Imagine the information encoded in all books published since the 1600's; Big Data. Imagine
the information embodied in the movements of humans, and how it encodes all sorts of complex phenomena; Big Data.

Now imagine a dumb cat video on YouTube. It can be summarized in under 14
bytes as a "dumb cat video." It's not Big Data, no matter how many times it is downloaded.

Terrible Metric for Impact

The metric that people use for economic impact is often GDP. Indeed, the article measures Big Data's
impact by how much it grows the economy. Yet even economists will admit freely that the GDP is at best a misleading metric for progress.

The textbook case that illustrates the failings of GDP as a progress metric involves the "broken windows" example.
If someone were to go around breaking perfectly fine windows, the GDP of a country would increase as everyone has to buy replacement glass, but almost no one, save for a few glass vendors, would be any happier or better off.

The GDP is an especially misleading
metric for Big Data, as Big Data is often used to improve business efficiency, which is more likely to
shrink the economy than to grow it in the short term. Imagine that Target studies the buying patterns of its customers
so well that it only ships precisely as many items as will be sold, precisely on time -- the net effect
will be a reduction in GDP!

Terrible Product Placement

There is a "submarine" below the fold, where there is a reference to a small player in the
Big Data space. Even though the company is actually all about improving efficiency, the
article doesn't pick up on this fact, and goes on to talk about GDP growth, leaving the
reader wondering why this reference was dropped in the first place.

There are countless companies in the Big Data space. Surely, the NYT can afford a few phone calls.

Terrible Framework

Overall, the discussion falls far short of the mark. Big Data,
properly defined, is clearly not a dud -- it has the potential to
improve our lives in immeasurable ways, through drug discovery,
individualized medicine, more efficient business practices, and many
others in all aspects of science, every branch of engineering and
even, with the resurgence of quantitative methods in the humanities,
in liberal arts. But if we let the word get misappropriated, if we
blithely apply it to everything digital such that the term loses its
meaning, kind of like IBM's erstwhile "autonomic computing," then it
is guaranteed to become a meaningless bandwagon that denotes whatever
the author feels like that day, and it is guaranteed to fall short of
expectations.