Share this story

Over the past few decades, researchers in a variety of fields have had to come to grips with analyzing massive data sets. These can be generated intentionally, through things like astronomy surveys and genome sequencing, or they can be generated incidentally—through things like cell phone records or game logs.

The developments of algorithms that successfully pull information from these masses of data has led some of the more enthusiastic proponents of big data to argue that it will completely change the way science is done (one even argued that big data made the scientific method obsolete). In today's issue of Science, however, a group of scientists throw a bit of cold water on the big-data hype, in part by noting that one of the publicly prominent examples of massive data analysis, Google Flu Trends, isn't actually very good.

Not so trendy

Their analysis builds on an earlier report from Nature News that highlights a few clear failures of Google Flu Trends. The service is meant to give real-time information on seasonal flu outbreaks by tracking a series of search terms that tend to be used by people who are currently suffering from the flu. This should provide a bit of lead time over the methods used in the US and abroad, which aggregate monitoring data from a large number of healthcare facilities. Those are considered the definitive measurements, but the testing and data aggregation take time, while Flu Trends can be updated in near real time.

The problem is that Flu Trends has gotten it badly wrong in at least two cases. The reason for these errors is remarkably simple: the flu was in the news, and people were therefore more interested and/or concerned about its symptoms. Use of the key search terms rose, and, at some points, Google Flu Trends predicted double the number of infected people than were later revealed to exist by the Centers for Disease Control data. (One of these cases was the global pandemic of 2009; the second an early and virulent start to the season in 2013.)

On its own, this isn't especially damning. But the authors note that flu trends have consistently overestimated actual cases, estimating high in 93 percent of the weeks in one two-year period. You can do just as well by taking the lagging CDC data and putting it into a model that contains information about past flu dynamics. And, unlike the Flu Trends algorithm, they point out that this sort of model can be improved.

In describing their system, the Flu Trends engineers have said that they started by identifying a series of search terms that correlated with CDC data. They then had to exclude a bunch of search terms that correlate with flu searches simply because they follow the same seasonality (high school basketball was apparently one of them). And the remaining terms? They've never actually been described in full, even as Google engineers have added revisions to the system. That means that Flu Trends results are fundamentally irreproducible, and nobody outside of Google could ever improve the system.

Complicating matters further, Google changes its search behavior and results in various ways for reasons that have nothing to do with flu trends, and those feed back into user behavior in complicated ways. The company has also engaged in constant warfare with people who want to game its system, a problem it shares with other commercial sources of big data. All of these factors can make the real goal of big data analyses—getting at some underlying feature of reality—a tricky prospect.

Thinking big

The researchers note that none of this means that Google Flu Trends is useless. It would be more useful if it were reproducible, but even without that, it serves as a helpful addition to the CDC numbers. And, although the piece reads like a bit of a takedown of Flu Trends, the authors' target is something larger, something they call "big data hubris."

This they define as "the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis."

The problem they identify resulting from this form of hubris is that it's relatively easy to use big data to identify eye-catching and publicity-generating correlations. It's much harder to turn these correlations into something that's scientifically actionable, and harder still to do the actual experiments that reach a scientifically valid conclusion. (As the authors put this, "The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.")

Put another way, it's not uncommon to hear the argument that "computer algorithms have reached the point where we can now do X." Which is fine in and of itself, except, as the authors put it, it's often accompanied by an implicit assumption: "therefore, we no longer have to do Y." And Y, in these cases, was the scientific grunt work involved with showing a given correlation is relevant, general, driven by a mechanism we can define, and so forth.

And the reality is that the grunt work is so hard that a lot of it is never going to get done. It's relatively easy to use a computer to pick out thousands of potentially significant differences between the human and mouse genomes. Testing the actual relevance of any one of those could occupy a grad student for a couple of years and cost tens of thousands of dollars. Because of this dynamic, a lot of the insights generated using big data will remain stuck in the realm of uncertainty indefinitely.

Recognizing this is probably the surest antidote to the problem of big data hubris. And it might help us think more clearly about the sorts of big data work that are most likely to make a lasting scientific impact.

In cases where they're analyzing search terms, would it not be possible to "poison the well" by having a botnet run a massive amount of searches that would skew the results of such analysis?

If people hack a bunch of machines just to make people think its flu season, then I think we are lucky they don't have any real aspirations.Also, the point of trending would be to pin point where its worst. The botnet would have to be concentrated.

I've got a better example: The NSA and the war on terrorThey collect massive amounts of data that could potentially be used to develop extensive profiles of basically everybody in the world, yet they can't even stop terrorists to dumb to blow themselves up.

Disclaimer: my PhD thesis was in machine intelligence and signal processing having to do with large image datasets.

Seems like a key point here is not that big data (I really hate this term) is wrong, but having the wrong data doesn't work.

In the Flu Trends case the point is that the available information from just search terms is not sufficient to predict with high accuracy. This suggests that further data sources are necessary, I would think that adding the most recent CDC report (which will lag reality) and having the system learn the connection to search terms would probably improve accuracy significantly.

The problem I have seen in the past 2 years especially is this sudden hype train of "big data" going on, which is great for my job prospects (though bizarrely enough I took a job in wireless comm) but has made every crackpot think that running the latest algorithm of the week on any big pile of data will pop up gold like magic.

The data source needs to be reliable. Any errors in the data will cause errors in the models generated.

This is also a problem for things like Google Translate that uses data from translated web pages to improve its translation. The problem is that several websites are using Google Translate to create translated web pages, which is getting fed back into the Google Translate model causing the model to degrade.

Wasn't "big data" as a buzzword preceded by "data mining", itself preceded by "spreadsheet management" ... all seemingly premised on the idea of hey - I've got all this data, surely there's some amazing knowledge to be extracted from it!

More data is not always better. In many cases, the amount of signal to be found remains the same, so you end up having to find that needle in a much bigger haystack of noise. N. Nassim Taleb wrote a bit about the topic in one of his books, explaining that people are more likely to jump at false signals the more data they have. Something like "fooled by noise". In the end, it doesn't mean that big data is inherently bad, just that it has to have the right application.

I think part of the hype around big data is that, in the past, collecting enough data to verify large theories about social phenomena or healthcare effects was impossible. This kind of research was always limited by the amount of data that could reasonably be collected in a reasonable amount of time. Now big data eliminates this problem, but presents the opposite; cut down the data enough with the right tools to find what is needed. And that's still very difficult.

Disclaimer: my PhD thesis was in machine intelligence and signal processing having to do with large image datasets.

Seems like a key point here is not that big data (I really hate this term) is wrong, but having the wrong data doesn't work.

In the Flu Trends case the point is that the available information from just search terms is not sufficient to predict with high accuracy. This suggests that further data sources are necessary, I would think that adding the most recent CDC report (which will lag reality) and having the system learn the connection to search terms would probably improve accuracy significantly.

The problem with the Flu Trends example is that there is not a correlation between the search terms and flu cases. You cannot know why someone is doing a particular search (they may be doing it for biology homework, or as research for a novel). Just because someone searches for "signs of pregnancy" does not mean they are pregnant (or infers that they are female!).

Every increase in flu searches needs to be compared to the news results at the time so they can apply a correction. The other problem is that if someone gets the flu one year and again the next there would be no additional search done unless they infect others that had not recently had the flu.

Wasn't "big data" as a buzzword preceded by "data mining", itself preceded by "spreadsheet management" ... all seemingly premised on the idea of hey - I've got all this data, surely there's some amazing knowledge to be extracted from it!

I think you mean and "automatic, hassle-free way" of extracting knowledge from it.

Disclaimer: my PhD thesis was in machine intelligence and signal processing having to do with large image datasets.

Seems like a key point here is not that big data (I really hate this term) is wrong, but having the wrong data doesn't work.

In the Flu Trends case the point is that the available information from just search terms is not sufficient to predict with high accuracy. This suggests that further data sources are necessary, I would think that adding the most recent CDC report (which will lag reality) and having the system learn the connection to search terms would probably improve accuracy significantly.

The problem with the Flu Trends example is that there is not a correlation between the search terms and flu cases. You cannot know why someone is doing a particular search (they may be doing it for biology homework, or as research for a novel). Just because someone searches for "signs of pregnancy" does not mean they are pregnant (or infers that they are female!).

You mean to say there isn't a correlation of 1 between searches and flu cases. Saying there is no correlation implies that people suffering flu symptoms are no more or less likely to look them up than people without flu symptoms, that's rather difficult to believe.

The problem you describe is noise, that is the data has a random distribution of things in there that look like flu cases but are not. The goal then of the data mining algorithm is to be robust against this noise. And a huge amount of the work in the field is talking about how to do that for different sources of noise.

The reliable sources here are low noise sources - which are great, but come with their own problems. Data from the CDC, NIH, NHS, etc. will lag reality, they have to survey, collect results, process it and report on it. This can take a significant amount of time, so often by the time the CDC says something you're already sick. This lag though can be worked around using a predictive model - but that model will need another data source to drive predictions. This is why I suggest that combining search terms with CDC records could be used to learn a good system.

Hospital records are private information, and the agencies that deal with this have to collect it carefully anonymized (which is part of that lag time) and people would have a fit if Google started snatching up hospital records. (I know I would)

I'm one of the people who use "big data" for medical research, and i can tell you that this term is already overused in the media by people who don't know what they're talking about. It's as if the self-appointed "trend-setters" have suddenly discovered that there are large databases out there with which scientific analyses are made. They never sat in a science class (at least not while awake) and are merely searching for a new buzzword for click-bait articles.

In many ways the confluence of statistical and database software, computing power, and a decade or so of accumulated medical/scientific data have created a 'golden age' of data analysis and scientific discovery in the biomedical field, beginning in the late 1990s. This trend continues today and is nothing new. What is new is some of the gigantic data needs ushered in by new instrumentation and studies on things like genomes, and occasionally some very large databases resulting from combining hospital data, insurance and social security health utilization data, census data, and things of this sort. As the article points out, science must still make sense of data and apply the scientific method to address various theories, analyses and research fronts. Also, data storage is an issue in some mega-data science projects because of the terabyte quantities of information which can be accumulated from just one study.

A typical health study does not require such masses of data in order to show statistically meaningful results. Some analyses, however, such as Geographical Information System analysis, genomic analysis, proper environmental monitoring and others, require a lot of datapoints.

Google's flu tracker is not the only real-time flu tracking system out there. Several large cities track near real-time drug store sales data of over-the-counter medications for sentinel detection of outbreaks of various illnesses, including food-borne illnesses (think diarrhea meds), flu, and other ailments. This is a practical outgrowth of funding and research on bioterrorism. There are also sentinel air sampling sites to specifically monitor for bioterrorism attacks.

The term "big data" is a fad term that will hopefully pass, as the ones who spout such hype go on to talk about the next "app that will change the world."

Disclaimer: my PhD thesis was in machine intelligence and signal processing having to do with large image datasets.

Seems like a key point here is not that big data (I really hate this term) is wrong, but having the wrong data doesn't work.

In the Flu Trends case the point is that the available information from just search terms is not sufficient to predict with high accuracy. This suggests that further data sources are necessary, I would think that adding the most recent CDC report (which will lag reality) and having the system learn the connection to search terms would probably improve accuracy significantly.

The problem with the Flu Trends example is that there is not a correlation between the search terms and flu cases. You cannot know why someone is doing a particular search (they may be doing it for biology homework, or as research for a novel). Just because someone searches for "signs of pregnancy" does not mean they are pregnant (or infers that they are female!).

You mean to say there isn't a correlation of 1 between searches and flu cases. Saying there is no correlation implies that people suffering flu symptoms are no more or less likely to look them up than people without flu symptoms, that's rather difficult to believe.

The problem you describe is noise, that is the data has a random distribution of things in there that look like flu cases but are not. The goal then of the data mining algorithm is to be robust against this noise. And a huge amount of the work in the field is talking about how to do that for different sources of noise.

The reliable sources here are low noise sources - which are great, but come with their own problems. Data from the CDC, NIH, NHS, etc. will lag reality, they have to survey, collect results, process it and report on it. This can take a significant amount of time, so often by the time the CDC says something you're already sick. This lag though can be worked around using a predictive model - but that model will need another data source to drive predictions. This is why I suggest that combining search terms with CDC records could be used to learn a good system.

Hospital records are private information, and the agencies that deal with this have to collect it carefully anonymized (which is part of that lag time) and people would have a fit if Google started snatching up hospital records. (I know I would)

One big problem is that lots of diseases have plenty of flu-like symptoms, or even symptoms that are nothing like the flu at all but make people think they have the flu. There are also people who have the flu that think they have something else. This data would be much more useful in tracking hypochondria.

I'm one of the people who use "big data" for medical research, and i can tell you that this term is already overused in the media by people who don't know what they're talking about. It's as if the self-appointed "trend-setters" have suddenly discovered that there are large databases out there with which scientific analyses are made. They never sat in a science class (at least not while awake) and are merely searching for a new buzzword for click-bait articles.

In many ways the confluence of statistical and database software, computing power, and a decade or so of accumulated medical/scientific data have created a 'golden age' of data analysis and scientific discovery in the biomedical field, beginning in the late 1990s. This trend continues today and is nothing new. What is new is some of the gigantic data needs ushered in by new instrumentation and studies on things like genomes, and occasionally some very large databases resulting from combining hospital data, insurance and social security health utilization data, census data, and things of this sort. As the article points out, science must still make sense of data and apply the scientific method to address various theories, analyses and research fronts. Also, data storage is an issue in some mega-data science projects because of the terabyte quantities of information which can be accumulated from just one study.

A typical health study does not require such masses of data in order to show statistically meaningful results. Some analyses, however, such as Geographical Information System analysis, genomic analysis, proper environmental monitoring and others, require a lot of datapoints.

Google's flu tracker is not the only real-time flu tracking system out there. Several large cities track near real-time drug store sales data of over-the-counter medications for sentinel detection of outbreaks of various illnesses, including food-borne illnesses (think diarrhea meds), flu, and other ailments. This is a practical outgrowth of funding and research on bioterrorism. There are also sentinel air sampling sites to specifically monitor for bioterrorism attacks.

The term "big data" is a fad term that will hopefully pass, as the ones who spout such hype go on to talk about the next "app that will change the world."

right, big data is corrupt data. there must be academic, scientific oversight on the collection of data from people.Google and Facebook need to open the kimono and actually start sharing. Dunno if the DNA at G and F has been lost in the effort to reach the stratosphere of stock prices and money. Sadly it would seem so.

The upside is, internet users are waking up to the deep behavior inspection google and facebook use on people. Every keystroke, mouse move, hesitation, attention, motion, location, can and often is collected.

Hell samsung wanted to watch your eyes to watch what you are looking at. sheez. glass could have been really good, but its so packed with google spyware for google, they should *pay* people to wear them. And when any sensor is on, a rotating beacon on a beanie should alert everyone nearby. The fact that it was *designed* with the assumption that data collection would be accepted, shows how arrogant Google is - its natural to track everything. Its natural if along the way, everything you took from people, you demonstrated was given back in some tangible form.

The notion that being watched is natural, is truly problematic. Watching people via sensors and taps people don't know exist, is just wrong. A new paradigm for creating big data is needed.

Well, sometimes you encounter people who claim that there should be a fourth V besides Volume, Velocity and Variety, namely Veracity. They do have a point.

But in general, it does not seem to me that the possibilities of Big Data are overhyped.Rather, people (salespeople? ) confuse possible and easy. All the developments around Big Data have made things possible, but not necessarily easy.

To take the flue example: you could further enrich it with crawling media for mentions of flue and in what context those are posed in order to allow corrections for that. You could also start monitoring a whole bunch of other terms which have been mostly stable to have a better chance at detecting changes in Google's logic and partially compensate for that as well etc.Naturally, such changes make it way more difficult than just monitoring flue.

Seems like if flu trends is just over predicting that it could be tweaked to provide more accurate feedback.

Seems like a weird complaint. Sounds like the google tracking can be useful and provide a picture more quickly. They just need to tweak their analysis of the data. In other words seems like a lot of nothing.

Wasn't "big data" as a buzzword preceded by "data mining", itself preceded by "spreadsheet management" ... all seemingly premised on the idea of hey - I've got all this data, surely there's some amazing knowledge to be extracted from it!

And even before that there was "cybernetics". It was a popular buzzword in postwar socialist science.

The meaning was some kind of blend of computer science and self-organizational models. "With all this data, how could we not achieve socialism!"

In 1971, the socialist government of Salvador Allende in Chile constructed a cybernetic operations room with 500 telex machines linked to a computer for Bayesian analysis:

Science, especially in the healthcare arena, is susceptible to trends, usually not driven by the grunts who do the work, whereby the latest technique will put an end to the need to actually input thought into the process. Usually this is driven by commercial considerations, as hiring thinking people is expensive, but there are always a few zealots in the vanguard of the latest greatest thing since sliced bread. The cycle is always the same. Zealot cries in wilderness, with all of the skeptics quite rightly pointing out the obvious holes in the idea. Zealot gets high up supporters, who suddenly see how if all claims are true, all those highly paid, science rather than command and control-beholden, nuisances could be replaced by a machine and a monkey, who would be paid in peanuts. New technique forced on most from above. General disillusion, followed by a few of original skeptics realizing the true use of the new technique, and adding it to current techniques, which allows data to be used to help solve even more complex puzzles than before. Zealots hounded out by higher-ups, technique now discouraged, as no money saved. Ex-skeptics now have to fight to retain what is now a useful supplement to all other techniques, and requires new hiring to deal with extra work load. Wash, rinse, repeat cycle.

The Flu is a virus, so there's little your primary healthcare provider can do for you that you can't self treat using over the counter medications. Many people don't go to the doctor when they have the Flu because of this. Additionally, as you develop more symptoms, you may again google Flu symptoms to check to see if symptom X is consistent with that self-diagnosis. I'm suggesting that there is a good percentage of cases that have a multiple search factor and other cases still that will never be officially reported via a medical facility. The google numbers may be inflated relative to the official number, but I wonder if they properly account for the unreported cases + multiple search hits. Maybe the google numbers are closer than we think.

Without the why/how part of the process, relying on big data can be positively dangerous. Remember that (roughly speaking) for every 20 long sequences of coin flips there's going to be one that shows "statisitically significant" anomalies at the 95% level. So the more analyses you do looking for correlations, the more spurious ones you'll find. And you don't know which are which until you do the work.

The Flu is a virus, so there's little your primary healthcare provider can do for you that you can't self treat using over the counter medications. Many people don't go to the doctor when they have the Flu because of this. Additionally, as you develop more symptoms, you may again google Flu symptoms to check to see if symptom X is consistent with that self-diagnosis. I'm suggesting that there is a good percentage of cases that have a multiple search factor and other cases still that will never be officially reported via a medical facility. The google numbers may be inflated relative to the official number, but I wonder if they properly account for the unreported cases + multiple search hits. Maybe the google numbers are closer than we think.

That doesn't make sense. You're suggesting that Google has somehow developed a magic formula that correlates searches to the "real" number of flu cases, without having any way of knowing what that real number is. That would be a pretty good trick, but is effectively impossible.

What Google claims to have done is basically a regression analysis against the official data. At best, that would give a predictor of the official data. It's not going to give you a predictor of some other kind of related data set.

I think google is unique among businesses in that they really do have hundreds of world-class engineers who can vet the results. But that is extremely rare for a business.

I have worked with so many different types of reporting for various businesses that I always find this "big data" thing to be a little tedious really. I wish that instead of "Big Data" people would call it "Applied Statistics" or something more accurate. It's not that there isn't data to be looked at but for the most part decision makers are not in a position to make informed decisions about the magic that pops out and so far as I have ever experienced almost no one in any company I have ever worked for has the technical background required to do anything like "peer review" of results that are presented. Unless you have a team of legitimate scientists who are trained in genuine statistical methodologies and an organizational culture that can vet the results objectively, then you probably are just wasting time following this fad. At the same time, the whole idea plays to the hubris of C-level managers because everybody likes to push reports up the ladder that are based on "data". It's almost always total bs.

Bob just spend three weeks working with this data set and now we are all convinced that the "buy now" button on our website should be blue. Fantastic.

I imagine googles methodology is a bit more complicated than counting the number of people who type " flu" into google. And as a chaos theory fan, the idea the researchers put forth to forecast flu epidemics from previous data doesnt sit well.

Wasn't "big data" as a buzzword preceded by "data mining", itself preceded by "spreadsheet management" ... all seemingly premised on the idea of hey - I've got all this data, surely there's some amazing knowledge to be extracted from it!

Fact-based decision making goes back much farther than that. Even Deming's ideas weren't "new", he just helped manufacturing companies integrate the concepts.

The Flu is a virus, so there's little your primary healthcare provider can do for you that you can't self treat using over the counter medications. Many people don't go to the doctor when they have the Flu because of this. Additionally, as you develop more symptoms, you may again google Flu symptoms to check to see if symptom X is consistent with that self-diagnosis. I'm suggesting that there is a good percentage of cases that have a multiple search factor and other cases still that will never be officially reported via a medical facility. The google numbers may be inflated relative to the official number, but I wonder if they properly account for the unreported cases + multiple search hits. Maybe the google numbers are closer than we think.

That doesn't make sense. You're suggesting that Google has somehow developed a magic formula that correlates searches to the "real" number of flu cases, without having any way of knowing what that real number is. That would be a pretty good trick, but is effectively impossible.

What Google claims to have done is basically a regression analysis against the official data. At best, that would give a predictor of the official data. It's not going to give you a predictor of some other kind of related data set.

I think that qvgamer's concern is that the CDC will never see all of the subclinical cases of flu. That's true, and if you care about the "true" number of flu cases it is a serious problem. For a given level of virulence there will some fraction of cases serious enough to get into the CDC's database. Backing out the number of "true" cases would be very difficult work because the team would need to gauge the virulence of a strain and survey a big sample of the population over multiple flu seasons to build a model of subclinical flu cases.

The question is: do we really care? The CDC is concerned primarily with serious flu cases, and serious flu cases end up seeking care in exactly the places that CDC collects data. What Google is doing is predicting the numbers that CDC will publish. This may help hospitals and clinics do a better job responding to peaks and troughs of demand.

An economist would care about sick days during flu outbreaks. That model can be built using BLS data and the CDC data. Maybe each serious case is accompanied by an average of eight lost work days (average of four for the serious case plus an average of four more from not-quite-as-serious cases). In this case, Google's ability to predict the CDC numbers may make things move a bit faster but economics doesn't exactly move at a breakneck pace anyway.

A pharma company would care about symptomatic cases, clinical and subclinical. They have at their disposal their own past sales data and the CDC data. Google's ability to predict CDC numbers might help with marketing or shifting inventories a bit, but the lead time on manufacturing/packaging/distributing medicines prevents week-by-week tweaking. It's actually be more useful to ramp up production of healthcare gear (masks, gloves, specula, tongue depressors, etc.), though not especially quickly. These are non-perishable items that were stocked up ahead-of time, that need to be replenished at some point.

The real challenge is predicting the prevalence of flu strains (for vaccine manufacturing) at the season level. As far as I know, it's no more sophisticated than "what we had last year is our best predictor of this year." Improving on that is way outside the scope of Flu Trends.

The Flu is a virus, so there's little your primary healthcare provider can do for you that you can't self treat using over the counter medications. Many people don't go to the doctor when they have the Flu because of this. Additionally, as you develop more symptoms, you may again google Flu symptoms to check to see if symptom X is consistent with that self-diagnosis. I'm suggesting that there is a good percentage of cases that have a multiple search factor and other cases still that will never be officially reported via a medical facility. The google numbers may be inflated relative to the official number, but I wonder if they properly account for the unreported cases + multiple search hits. Maybe the google numbers are closer than we think.

That doesn't make sense. You're suggesting that Google has somehow developed a magic formula that correlates searches to the "real" number of flu cases, without having any way of knowing what that real number is. That would be a pretty good trick, but is effectively impossible.

What Google claims to have done is basically a regression analysis against the official data. At best, that would give a predictor of the official data. It's not going to give you a predictor of some other kind of related data set.

I think that qvgamer's concern is that the CDC will never see all of the subclinical cases of flu. That's true, and if you care about the "true" number of flu cases it is a serious problem. For a given level of virulence there will some fraction of cases serious enough to get into the CDC's database. Backing out the number of "true" cases would be very difficult work because the team would need to gauge the virulence of a strain and survey a big sample of the population over multiple flu seasons to build a model of subclinical flu cases.

The question is: do we really care? The CDC is concerned primarily with serious flu cases, and serious flu cases end up seeking care in exactly the places that CDC collects data. What Google is doing is predicting the numbers that CDC will publish. This may help hospitals and clinics do a better job responding to peaks and troughs of demand.

An economist would care about sick days during flu outbreaks. That model can be built using BLS data and the CDC data. Maybe each serious case is accompanied by an average of eight lost work days (average of four for the serious case plus an average of four more from not-quite-as-serious cases). In this case, Google's ability to predict the CDC numbers may make things move a bit faster but economics doesn't exactly move at a breakneck pace anyway.

A pharma company would care about symptomatic cases, clinical and subclinical. They have at their disposal their own past sales data and the CDC data. Google's ability to predict CDC numbers might help with marketing or shifting inventories a bit, but the lead time on manufacturing/packaging/distributing medicines prevents week-by-week tweaking. It's actually be more useful to ramp up production of healthcare hear (masks, gloves, specula, tongue depressors, etc.), though not especially quickly. These are non-perishable items that were stocked up ahead-of time, that need to be replenished at some point.

The real challenge is predicting the prevalence of flu strains (for vaccine manufacturing) at the season level. As far as I know, it's no more sophisticated than "what we had last year is our best predictor of this year." Improving on that is way outside the scope of Flu Trends.

Yes. But my point is that the type of model Google is building needs to be tuned using real-world data. It simply cannot magically pull the "actual" number of cases out of it's figurative ass, just because it sees a bunch of "flu" searches come through. qvgamer was suggesting that Google's model was doing exactly that -- getting a better estimate of actual flu cases than the data set it was tuned against. Statistical analysis simply doesn't work that way.

The notion that the algorithm could not be improved seems wrongheaded. It *is* predictive. The correlation may not be high, but it does appear to be statistically significant. So one would think that a combination of norming and the use of a genetic algorithm to better explore the "variable space" might be very fruitful. Big data analytics is really still in its infancy; why would one expect it to achieve fully equivalent predictive ability with vastly more mature statistical and epidemiological methods?

I agree that many evangelists have oversold the current value of big data; however that appears to be much more a problem with the evangelists themselves (who are often hawking consulting services) than with the promise of big data itself.

Hospital records are private information, and the agencies that deal with this have to collect it carefully anonymized

And then we laughed, and we laughed....

As one who has written scripts for scraping hospital records and shipping it to state syndromic surveillance, I can tell you the are FAR from "carefully anonymized." Are, in fact, easy to backfill with names and addresses by combining these records with simple public search--as I showed more than one person time and time again. What did I get for my trouble of pointing this out? I was threatened with termination.