The backlash against big data, continued

Ignore the hype. Learn to be a data skeptic.

Yawn. Yet another article trashing “big data,” this time an op-ed in the Times. This one is better than most, and ends with the truism that data isn’t a silver bullet. It certainly isn’t.

I’ll spare you all the links (most of which are much less insightful than the Times piece), but the backlash against “big data” is clearly in full swing. I wrote about this more than a year ago, in my piece on data skepticism: data is heading into the trough of a hype curve, driven by overly aggressive marketing, promises that can’t be kept, and spurious claims that, if you have enough data, correlation is as good as causation. It isn’t; it never was; it never will be. The paradox of data is that the more data you have, the more spurious correlations will show up. Good data scientists understand that. Poor ones don’t.

It’s very easy to say that “big data is dead” while you’re using Google Maps to navigate downtown Boston. It’s easy to say that “big data is dead” while Google Now or Siri is telling you that you need to leave 20 minutes early for an appointment because of traffic. And it’s easy to say that “big data is dead” while you’re using Google, or Bing, or DuckDuckGo to find material to help you write an article claiming that big data is dead.

Big data isn’t dead, though I only use the word “big” under duress. It’s just data. There’s more of it around than there used to be; we have better tools to generate, capture, and store it. As I argued in the beginning of 2013, the mere existence of data will drive the exploration and analysis of data. There’s no reason to believe this will stop.

That said, let’s look at one particular point from the Times op-ed: successful data analysis depends critically on asking the right question. It’s not so much a matter of “garbage in, garbage out” as it is “ask the wrong question, you get the wrong answer.” And here, the author of the Times piece is at least as uncritical as the data scientists he’s criticizing. He criticizes Steven Skiena and Charles Ward, authors of Who is Bigger, along with MIT’s Pantheon project, for the claim that Francis Scott Key was the 19th most important poet in history, and Jane Austin was only the 78th most important writer, and George Eliot the 380th.

Of course, this hinges on the meaning of “important.” If “important” means “central to the musical or literary canon,” then yes, the data-driven results are nonsense. But I wouldn’t expect data analysis to give me the same results I could get by talking to musicologists or literature professors. If by important, we mean that the works somehow drove historical events, I would expect the author of “The Star Spangled Banner” (to say nothing of the authors of “The Marsellaise”) to outrank Keats. People don’t fight wars citing Keats’ Ode on a Grecian Urn.

The Pantheon project doesn’t use the word “important”; it measures global historical popularity, which is something quite different. And their result just isn’t very surprising. It is easy to forget how many authors there are; coming in 78th is not a bad showing when you’re competing with Homer, Shakespeare, and Dante. I am certainly not in a position to debate whether Austen is more or less popular than the Japanese 17th century author Basho (52) or, for that matter, Nostradamus (20).

What do we mean by importance? What do we mean by influence? What do we mean by popularity? These are the sorts of questions you have to ask before doing any data analysis. I haven’t read Who is Bigger, but the Pantheon site does an excellent job of discussing its methodology, biases and limitations. And it provides an excellent foundation for a more important, nuanced discussion of popularity, influence, and importance.

There is a lot of hype about “big data,” and much of it is ridiculous. Ignore the hype. Learn to be a data skeptic. That doesn’t mean becoming skeptical about the value of data; it means asking the hard questions that anyone claiming to be a data scientist should ask. Think carefully about the questions you’re asking, the data you have to work with, and the results that you’re getting. And learn that data is about enabling intelligent discussions, not about turning a crank and having the right answer pop out.

Data is data. It was valuable 50 years ago, when IBM released the first model 360. It’s more valuable today.

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

Scott Berkun

A clear sign of hyperbole is when everyone is arguing about a term without defining what it means. Big Data is well into it’s hype cycle as at the same time we have people lauding it as the future and decrying it as over, without any being clear about what it is. I’ve yet to see any of these articles bother to explain what they mean by the term, which signifies all on its own how groundless their overly polarized opinions are.

datachick

Those of us who have been doing this for a while saw the hype cycle of BigData(TM) coming. But that’s just the buzzword term. We still see the potential and need for the tools and methods associated with Big Data.

I think we will see the same thing with NoSQL. Both are terms that are mostly defined by what they *aren’t*, which is why the profession is struggling to define these things.

Ernie Davis

Pantheon aims to be a measure of “cultural contribution” not “global historical popularity”. Judging by amazon sales, Nostradamus is a lot less popular than Austen or Eliot. Middlemarch, Penguin edition is #2873; Pride and Prejudice (paperback) is #5531; the most popular edition of Nostradamus that I’ve found is #157,099

The example of Francis Scott Key was convenient for the Times article, but there are lots of terrible comparative judgments of “historical impact” in “Who’s Bigger”. For instance, Queen Victoria, who had no significant political power, is ranked #16, whereas Bismarck, who unified Germany, is ranked 97. I wrote a full review of this book at https://www.siam.org/news/news.php?id=2132

— Ernie Davis

Ernie Davis

In response to Scott Berkun: We certainly didn’t say that “Big Data is over”; quite the contrary, we wrote “Big data is here to stay, as it should be.”

Let me suggest the following definition of BIg Data: The use of statistical models of surface characteristics of very large data corpora, often collected unsystematically. As with most definitions of vague categories, it fits some examples better than others. It certainly includes some very legitimate and successful projects.

Ernie Davis

By the way, Keats is ranked _above_ Francis Scott Key at WhoIsBigger.

Vincent Granville

What about fake big data? This article has 239 tweets as of today, 25 Google+ likes, and 83 LinkedIn shares. Yet only 6 comments. This does not make sense, with so many tweets (and how easy it is to post a comment here), you should have 400 comments by now. Something is wrong here.