The peril of anonymized data

The blogosphere is justifiably abuzz with the release by AOL of “anonymized” search query histories for over 500,000 AOL users, trying to be nice to the research community. After the fury, they pulled it and issued a decently strong apology, but the damage is done.

Many people have pointed out obvious risks, such as the fact that searches often contain text that reveal who you are. Who hasn’t searched on their own name? (Alas, I’m now the #7 “brad” on Google, a shadow of my long stint at #1.)

But some other browsers have discovered something far darker. There are searches in there for things like “how to kill your wife” and child porn. Once that’s discovered, isn’t that now going to be sufficient grounds for a court order to reveal who that person was? It seems there is probable cause to believe user 17556639 is thinking about killing his wife. And knowing this very specific bit of information, who would impede efforts to investigate and protect her?

But we can’t have this happening in general. How long before sites are forced to look for evidence of crimes in “anonymized” data and warrants then nymize it. (Did I just invent a word?)

After all, I recall a year ago, I wanted to see if Google would sell adwords on various nasty searches, and what adwords they would be. So I searched for “kiddie porn” and other nasty things. (To save you the stigma, Google clearly has a system designed to spot such searches and not show ads, since people who bought the word “kiddie” may not want to advertise on those results.)

So had my Google results been in such a leak, I might have faced one of those very scary kiddie porn raids, which in the end would find nothing after tearing apart my life and confiscating my computers. (I might hope they would have a sanity check on doing this to somebody from the EFF, but who knows. And you don’t have that protection even if somebody would accord it to me.)

I expect we’ll be seeing the reprecussions from this data spill for some time to come. In the end, if we want privacy from being data mined, deletion of such records is the only way to go.

If the Kiddieporn Police were stupid enough to raid you based on your search, you can be sure they wouldn't find "nothing." An "unproductive" raid is very embarrassing to those crusading officers and their politician-prosecutor bosses, so they'd do whatever it takes to avoid coming up empty handed. If they scoured your hard disk and found no child pornography, they'd probably just plant some of what they had lying around the lab and commit serial perjury (they'd get away with that because it's your word against theirs-- a statement by an esteemed vice officer is always believable while that of a presumed pedophile never is). Failing that, they can easily trump up some other "genuine" violations sufficient to justify the raid and persuade you to accept their plea bargain so they can get credit for both the raid and the conviction on this month's status report. With the proliferation of laws, a prosecutor who wants to get you can always find some convincing evidence of a violation, even if it's a law you never knew existed (or something Alberto Gonzales dreamed up at Dick Cheney's request). That's how Justice works in Bushi Amerika.

In all the protests against AOL's sharing of the query-log data, there
has been little discussion of the importance of such data to research
on information retrieval. In addition to the real privacy concerns, a
key point that must be considered is the fact that if useable data is
not made available to the wider research community, only the big
search companies will be able to analyze that data. We academic
researchers are increasingly dependent upon industry for this sort of
data to do research; the sort of small-scale data that can be gathered
in a university-based setting is simply insufficient for obtaining
reliable experimental results.

Should companies be prevented from sharing data with the research
community (either by law or public outcry), research progress will be
greatly reduced, as it will be impossible to compare different studies
with one another, since each study's data will be proprietary, and
thus no one will be able to trust any research result from another
lab. All non-industrial research in this area will more-or-less dry
up, and search technology will tend more and more to be developed in
"closed-shop" efforts within the large firms; innovative startups and
open-source hacking will not exist, since the research projects that
serve as launching pads for such technological innovation will not
exist. This prospect should disturb us all, as search technology
(broadly construed) is more and more the vehicle that people use to
gain information about their society and the world.

All of this is not meant to ignore the real privacy issues that can be
involved in the preparation and release of such data. It appears to
me that there was little real privacy risk in the data released by
AOL, but it is clear that policies and practices need to be debated
and developed that accomplish two essential goals: (a) to protect the
privacy of individuals in any sharing of research data, and (b) to
ensure that as much useful data can be shared by companies with the
greater research community.

Not just to researchers. Release for researchers can be done, but the researchers should sign contracts of confidentiality, and keep the data on secured machines (not connected to internet) and destroy it after use (they can always get it again in a pinch.)

Perhaps one of the early research projects can be towards better anonymisation tools, which preserve privacy but don't destroy the information content. Clearly, what was used here didn't get the job done.

No real privacy risk??? The New York Times has already located
and interviewed one 62y.o. woman who verified her search data. And
no doubt they picked her because she was less self-incriminating
than others.