Steamrolled by Big Data

Five years ago, few people had heard the phrase “Big Data.” Now, it’s hard to go an hour without seeing it. In the past several months, the industry has been mentioned in dozens of New York Times stories, in every section from metro to business. (Wired has even already declared it passé: “STOP HYPING BIG DATA AND START PAYING ATTENTION TO ‘LONG DATA’.”) At least one corporation, the business-analytics firm SAS, has a Vice-President of Big Data. Meanwhile, nobody seems quite sure exactly what the phrase means, beyond a general impression of the storage and analysis of unfathomable amounts of information, but we are assured, over and over, that it’s going to be big. Last summer, Jon Kleinberg, a computer scientist at Cornell, said in the Times that “The term itself is vague, but it is getting at something that is real… Big Data is a tagline for a process that has the potential to transform everything.”

Most of what’s written about Big Data is enthusiastic, like Kenneth Cukier and Viktor Mayer-Schonberger’s gushing ode “Big Data: A Revolution That Will Transform How We Live, Work, and Think,” which is currently selling briskly on Amazon, or the recent Times article on Mayor Bloomberg’s geek squad, and how “Big Data’s moment, especially in the management of cities, has powerfully and irreversibly arrived.” But despite the sense of excitement and promise surrounding the industry, Big Data isn’t nearly the boundless miracle that many people seem to think it is.

The reason scarcely anybody used to talk about Big Data is that, until very recently, it didn’t exist—most data had been, by current standards, small potatoes. Now, Big Data is mainly measured in terabytes (trillions) and petabytes (quadrillions); within a decade, even those numbers may seem quaint.

As companies like Google have shown, more data often means newer and better solutions to old problems. Last year, I wrote about how Google significantly improved spell-checkers by using massive databases of users’ self-corrections to do work that previously required hand-crafted algorithms focussed on the intricacies of English spelling and the psychology of typing. Google’s new trick wouldn’t work if you only had a few user searches to draw on, but if you have trillions of searches from many millions of users, across a hundred and forty-six languages, it’s pure genius—and a technique that can be rapidly applied to different languages with relatively little manual labor. And it’s just one of hundreds or perhaps thousands of innovations driven by the sheer mass of data that we’re capable of storing, wrangling, and manipulating. Cukier and Mayer-Schonberger’s book, for example, explains how the artificial-intelligence researcher Oren Etzioni created Farecast (eventually sold to Microsoft, and now part of Bing Travel), which scraped data from the Web to make good guesses about whether airline fare would rise or fall. Coupled with some advances in statistical techniques, Big Data is now de rigueur—to the point, almost, of being a kind of new religion, nicely parodied in Dilbert last summer: “In the past, our company did many evil things,” Dilbert’s pointy-haired boss says, “but if we store Big Data in our servers we will be saved.”

Companies like the PalmPilot co-founder Jeff Hawkins’s Numenta offer the promise of something even more transformational: universal, one-size-fits-all, real-time predictive analytics. According to Numenta’s Web site, their software, Grok, “finds complex patterns in data streams and generates actionable predictions in real time…. Feed Grok data, and it returns predictions that generate action. Grok learns and adapts automatically.” Numenta boasts that “As the age of the digital nervous system dawns, Grok represents the type of technology that will convert massive data flows into value.” The company doesn’t lack for press, either; the Times (“JEFF HAWKINS DEVELOPS A BRAINY BIG DATA COMPANY”), Technology Review, Forbes, and Bloomberg News have all covered Numenta with gusto.

According to Quentin Hardy of the Times, “Jeff Hawkins has been a pioneer of mobile devices, a distinguished lecturer in neuroscience, and a published author of a revolutionary theory of how the brain works. If he’s right about Big Data, a lot of people are going to wish he’d never gone into that field.” Why? “From initially observing the data flow, [Grok] begins making guesses about what will happen next. The more data, the more accurate the predictions become.” (It also promises to obviate the need for massive hard drives: by analyzing incoming data so quickly, there will be no need to store the old information.)

Read a few paragraphs further down, though, and the article reveals that—for all of Numenta’s billing as a one-size-fits-all automatic solver—Grok is “still in limited release, with just a few customers in the fields of energy, media, and video processing.” Numenta champions data but so far gives little concrete support of its own claims. (A company spokesperson noted that Grok “is in private beta and pilots with select customers in a variety of vertical markets, including electric energy, I.T. management, online advertising, and finance.”)

If fifty years of research in artificial intelligence has taught us anything, it’s that every problem is different, that there are no universally applicable solutions. An algorithm that is good at chess isn’t going to be much help parsing sentences, and one that parses sentences isn’t going to be much help playing chess. A faster computer will be better than a slower computer at both, but solving problems will often (though not always) require a fair amount of what some researchers call “domain knowledge”—specific information about particular problems, often gathered painstakingly by experts. So-called machine learning can sometimes help (spell-checking is a case where it can help a lot, and ditto for speech recognition), but nobody has ever, for example, built a world-class chess program by taking a generally smart machine, endowing it with enormous data, and letting it learn for itself. If Grok really did what its Web site promises, a complex problem like chess would be grist for its mill; there are tons of chess games online, often live, and the rules of chess could be programmed into Grok in an hour. But I’ll eat my hat, and send Hawkins a personal apology, if he can get Grok to mine that stream of chess games well enough to beat Magnus Carlsen or Garry Kasparov without relying on a whole lot of expert knowledge of the game, even using whatever high-end hardware Numenta presumably has access to. (Nevertheless, a Numenta spokesperson noted that “Grok is ideally suited for working with fast data streams.”)

Of course, Numenta is not the only company working with Big Data. Almost every expert I spoke to for this story mentioned interesting work being done elsewhere in the field. I.B.M., for example, used Big Data (along with many other techniques) to great effect in its “Jeopardy”-winning Watson, and products such as Siri and Google search depend heavily on it, without quite making the lavish promises of human-free automaticity that Numenta implies.

Some problems do genuinely lend themselves to Big Data solutions. The industry has made a huge difference in speech recognition, for example, and is also essential in many of the things that Google and Amazon do; the Higgs Boson wouldn’t have been discovered without it. Big Data can be especially helpful in systems that are consistent over time, with straightforward and well-characterized properties, little unpredictable variation, and relatively little underlying complexity.

But not every problem fits those criteria; unpredictability, complexity, and abrupt shifts over time can lead even the largest data astray. Big Data is a powerful tool for inferring correlations, not a magic wand for inferring causality. The field has thus far apparently yielded only modestly improved weather prediction, and had little, if any, impact on challenges such as getting computers to program themselves. “Jeopardy” is a feasible application because most of the required knowledge derives from titles on Wikipedia pages; it’s largely an exercise in data retrieval, to which Big Data is well-suited. Chess, in contrast, is an exercise in novelty that demands an enormous amount of precision. Every position is different and has its own best move, often a function of a great number of interdependent pieces with complex relationships that are highly sensitive to exact details. In Google Translate, gist often suffices; in chess, nobody can win at a grandmaster level by picking moves that are only roughly correct.

In fact, one could see the entire field of artificial intelligence as an inadvertent referendum on Big Data, because nowadays virtually every problem that has ever been addressed in A.I.—from machine vision to natural language understanding—has also been attacked from a data perspective. Yet most of those problems are unsolved, Big Data or no. Even with the world’s largest databases, for example, the challenge of machine vision remains largely open. Last summer, the “cat detector” at Google (which I’ve mentioned before) was trained on ten million images, using a thousand machines, for three days, and if the program managed to “learn” what a cat looks like, its overall visual performance was poor. Cats that were bigger than usual, smaller than usual, or slightly out of frame caused significant decrements in performance. It’s also a good bet that the system would do a lot poorer in more complex scenes with many objects. That’s not because Big Data isn’t useful; in vision, as in so many things, Big Data is only a small part of a solution.

As one skeptical industry insider, Anthony Nyström, of the Web software company Intridea, put it to me, selling Big Data is a great gig for charlatans, because they never have to admit to being wrong. “If their system fails to provide predictive insight, it’s not their models, it’s an issue with your data.” You didn’t have enough data, there was too much noise, you measured the wrong things. The list of excuses can be long.

In reality, most computational models of most things have, historically speaking, been wrong—or at least incomplete, effective in some circumstances, not all. Even Google, which likely has the biggest data of anyone, still uses humans to hand-curate some of it, because unanalyzed gobs of information are no guarantee of anything, and giant servers still can’t serve as fully trustworthy replacements for human judgment.

For perspective, it might help to consider the challenge of inferring the structure of protein from its underlying DNA sequence, a problem with an enormous number of applications in medicine, and throughout biology. Hundreds if not thousands of researchers have worked on the problem for fifty years, and for the last decade have had large databases to help; yet, in the words of a review published a few months ago in Science, “no single group [of researchers] yet consistently produces accurate models,” especially with more complex DNA sequences that don’t closely resemble genes that are already well understood. The more complex a problem is, and the more particular instances differ from those that came before, the less likely Big Data is to be a sure thing.

In the years to come, scientists and engineers will develop a clearer picture of the circumstances in which Big Data can and can’t make a big difference; for now, hype needs to be tempered with caution and a sensitivity to when humans should and should not remain in the loop. As Alexei Efros, one of the leaders in applying Big Data to machine vision, put it, Big Data is “a fickle, coy mistress,” inviting, yet not without risk.

Illustration by Joost Swarte

Gary Marcus is a professor of cognitive science at N.Y.U. and the author of “Guitar Zero.”