Why Big Data Isn’t Necessarily Better Data

From the editors and reporters of Scientific American , this blog delivers commentary, opinion and analysis on the latest developments in science and technology and their influence on society and policy. From reasoned arguments and cultural critiques to personal and skeptical takes on interesting science news, you'll find a wide range of scientifically relevant insights here. Follow on Twitter @sciam.

Larry Greenemeier is the associate editor of technology for Scientific American, covering a variety of tech-related topics, including biotech, computers, military tech, nanotech and robots. Follow on Twitter @lggreenemeier.

Google Flu Trends exemplifies the best and worst of research culled from vast amounts of data available on the Web. Image courtesy of Google Flu Trends.

Tech companies—Facebook, Google and IBM, to name a few—are quick to tout the world-changing powers of “big data” gleaned from mobile devices, Web searches, citizen science projects and sensor networks. Never before has so much data been available covering so many areas of interest, whether it’s online shopping trends or cancer research. Still, some scientists caution that particularly when it comes to data, bigger isn’t necessarily better.

Context is often lacking when info is pulled from disparate sources, leading to questionable conclusions. Case in point are the difficulties that Google Flu Trends (GFT) has experienced at times in accurately measuring influenza levels since Google launched the service in 2008. A team of researchers explains where this big-data tool is lacking—and where it has much greater potential—in a Policy Forum published Friday in the journal Science.

Google designed its flu data aggregator to provide real-time monitoring of influenza cases worldwide based on Google searches that matched terms for flu-related activity. Despite some success, GFT has overestimated peak flu cases in the U.S. over the past two years. GFT overestimated the prevalence of flu in the 2012-2013 season, as well as the actual levels of flu in 2011-2012, by more than 50 percent, according to the researchers, who hail from the University of Houston, Northeastern University and Harvard University. Additionally, from August 2011 to September 2013, GFT over-predicted the prevalence of flu in 100 out of 108 weeks.

Nature reported in a February 2013 news article that GFT predicted more than twice the number of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from a number of U.S. laboratories. (Scientific American is part of the Nature Publishing Group.)

Google’s software “relies on data mining records of flu-related search terms entered in Google’s search engine, combined with computer modeling,” Nature reported. Even though the researchers who wrote this week’s Policy Forum for Science cite several instances where GFT has faltered, Nature pointed out that GFT’s overall body of work has “almost exactly matched the CDC’s own surveillance data over time—and it delivers them several days faster than the CDC can.”

Google itself concluded in a study last October that its algorithm for flu (as well as for its more recently launched Google Dengue Trends) were “susceptible to heightened media coverage” during the 2012-2013 U.S. flu season. “We review the Flu Trends model each year to determine how we can improve—our last update was made in October 2013 in advance of the 2013-2014 flu season,” according to a Google spokesperson. “We welcome feedback on how we can continue to refine Flu Trends to help estimate flu levels.”

The Policy Forum researchers recognize that increased traffic to flu-related online resources could have factored into the problem, but they question whether “a media-stoked panic last flu season” fully explains “why GFT has been missing high by wide margins for more than [two] years. A more likely culprit is changes made by Google’s search algorithm itself.”

This is key to the researchers’ argument and they contend that two issues have contributed far more to GFT’s mistakes: algorithm dynamics and “big data hubris.”

Big data hubris is the “often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.” The mistake of many big data projects, the researchers note, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments.

Other studies have shown the value of big data, the researchers acknowledge, yet “we are far from a place where they can supplant more traditional methods or theories.”

They note that “greater value can be obtained by combining GFT with other near–real-time health data.” For example, “by combining GFT and lagged CDC data, as well as dynamically recalibrating GFT, we can substantially improve on the performance of GFT or the CDC alone.” Big data could likewise be an effective tool for better understanding the unknown, in areas where CDC data does not work well, such as presenting flu prevalence at very local levels.

Projects would also benefit from more transparency by improving others’ ability to replicate them, according to the researchers. Platforms such as Google, Twitter and Facebook are always re-engineering their software, and whether studies based on data collected at one time could be re-done with data collected from earlier or later periods is an open question.

About the Author: Larry Greenemeier is the associate editor of technology for Scientific American, covering a variety of tech-related topics, including biotech, computers, military tech, nanotech and robots. Follow on Twitter @lggreenemeier.

Lots of people still naively believe big data advertisements ‘we can learn everything from smartphones and search engines’ You failed? ‘We will make other models and we can learn everything from smartphones and search engines!’.

Things to remember:
Adding new data has any advantage only if amount of added useful information is bigger than the amount of added noise from new data.

Human is highly skilled machine for browsing big data already. All these millenia of spotting lurking tigers in fields of tangled grass etc. If you see no pattern by eye, very likely no algorithm can find anything either.

It is impossible to predict many things even with all the information avialable. Consider your spouse, somebody you know for years and have any amount of information – but you still often predict wrongly what birthday gift to buy.

Do not underestimate the power of big data (rightly or wrongly) to influence behavior. We are heading into the Internet of Everything and much of that influence will be psychologically transparent / oblivious.

The fact is:
Even if your process/algorithm has a very high Q factor (the ability to filter out the noise, or in database terms to select the relevant data),

Adding massive amounts of input data is going to pass along much more noise data into the results.

On the other hand, I must say that I am impressed that Google and others can process at all the massive amounts of data they have. A Megabyte was once considered a massive amount of data. These days, there are pocket sized devices that have gigabytes of data and are considered limited in storage.