Big Google can be Benign

An article in today’s New York Times reports on Google Flu Trends, which aspires to detect regional outbreaks of the flu before they are reported by the Centers for Disease Control and Prevention. As reported in the article:

Google Flu Trends is based on the simple idea that people who are feeling sick will probably turn to the Web for information, typing things like “flu symptoms” or “muscle aches” into Google. The service tracks such queries and charts their ebb and flow, broken down by regions and states.

It’s a clever idea, though obviously it raises privacy concerns. Google mitigates those concerns by “relying only on aggregated data that cannot be used to identify individual searchers.”

It will be interesting to see popular reaction to this offering in the United States and in more privacy-conscious Western Europe. On one hand, health-related search logs are the bête noire of privacy activists–and with good reason, since people are terrified of losing their health insurance. On the other hand, Google seems to have only the best intentions here, and the service they provide may do a lot of good.

I personally hope we can see efforts like these succeed. Of course, it’s essential that Google and anyone else who pursues such efforts be transparent about what data they collect and how they protect individuals from inadvertent disclosure. Ideally, they don’t collect more data than is needed–especially when that data is dangerous in the wrong hands.

Though I have to wonder, might anyone try to game such a system? Maybe I have an over-active imagination, but systems like these are seem to be ripe targets for denial-of-insight attacks. Whit, another one for your files?

Indeed, I don’t think even they are claiming that the idea is especially novel. But utility trumps originality.

The good news is that Google has this data and can analyze it with very low latency, evidently much lower than the epidemiologists at the CDC. The disconcerting news is that they also have a lot of identifying information about users (at least for folks like me who log on to use services like Gmail).

AFAIK, Google and Yahoo store a *minimal* amount of information for their Health-related offerings. This is unintuitive, as tracking Internet activity related to health can provide a sort of loosely structured medical record that could prove useful in the future (or, from the perspective of a company, ridiculously *profitable*). The reason for not storing this information is essentially a reaction to privacy concerns, which are paramount when it comes to health.

Obviously, users are not protected if they rely on search and e-mail for their information, but Google seems to be keeping things under wraps. They don’t provide anything remotely close to what AOL or Netflix have released in terms of raw search data. Everything they are releasing is post-process.

Given how unscientific their analyses are, flu trends probably won’t be useful beyond encouraging folks to get flu shots. Additionally, I would hope Google is sophisticated enough at this point to recognize any “DoI” attacks.

I think it’s absolutely a risky element. If one could figure out how the Google system were capturing the data and how it was being jacked into the CDC analytic system, one could do a bunch of things:

1. Create the impression of illness for some purpose such as “fooling” a competitor to make supply decisions. (Seems unlikely.)

2. Establish a covering strategy to obscure what’s really happening. Again, this seems less likely — but in the unfortunate and unlikely event that Google became a major element of the early-warning system, it would be fairly easy to establish a searcher network that did obscuring searches to create the impression that an illness was widespread when it wasn’t, or the reverse.

3. Any time there’s visibility, there’s a reason to hack. If a network of interested people thought they could get publicity by pwning the CDC, they might. Sad but true.

The NY Times article speculates that “The C.D.C. reports are slower because they rely on data collected and compiled from thousands of health care providers, labs and other sources.” By that reasoning, Google, by adding yet another provider will slow them down even more!

We’re working with a company called Health Monitoring Systems, whose Epicenter product tracks emergency room chief complaints from 100s of hospitals in real time, with statistical time series trend detection.

The real questions are: (1) how reliable are Google’s reports and how much do people search ahead of going to the ER, and (2) [from Whit Andrews’s comment above] would the CDC and other public health providers trust them given Google’s track record of being susceptible to link spam?

How does the world’s largest advertising company track the flu? The same way it’s tracking everyone’s searches as proxies for purchases of consumer goods like cars or shampoo.

Adam: I assumed they were getting search queries from regular Google users. And, as you might recall from an uproar earlier this year, there’s some debate as to what constitutes personal information (http://googlepublicpolicy.blogspot.com/2008/02/are-ip-addresses-personal.html). I think Google’s motives are good here, but I think privacy rights activists have legitimate concerns about the potential for harm.

Whit: it is #2 that concerns me–not necessarily in this application, but more if someone got the idea of using an application like this to detect an anthrax attack.

Bob: I don’t suppose that HMS plans to offer that information to the general public for free? Google may be offering an inferior solution, but the free access plus their brand ensures far more press coverage.