Friday, January 27, 2006

The Google Search Subpoena in Perspective

"During Supreme Court oral argument about the COPA law, the Solicitor General made the following claim:

MR. OLSON: .... But the problem with respect to the children is the material that is so widely available on the Internet that doesn’t reach the definition of – that is not as bad as obscenity. It is a wide amount of information. The legislative history described 28,000 pornographic sites in a – this is also outside the record, but if an individual goes to their Internet and – and uses an Internet search engine and – and types in the word, free porn, I did this this weekend, the – your – your computer will say that there are 6,230,000 sites available. Now that’s available now.

This was a ludicrous abuse of statistics. He had searched Google for all items which contained the word “free”, and the word “porn”, somewhere on the page (not even strictly the phrase “free porn”). And then offered the meaningless number returned as if it were somehow relevant. Indeed, this very article will now increase that number, merely by quoting him. It was (or at least, should have been) an embarrassing display of ignorance.

With such poor quality of evidence being proffered to justices of the Supreme Court, it’s easy to see why the government wanted to be better prepared in future trials. Indeed, the eventual decision about the COPA law required more investigation, particularly of censorware:

Second, there are substantial factual disputes remaining in the case. As mentioned above, there is a serious gap in the evidence as to the effectiveness of filtering software. See supra, at 9. For us to assume, without proof, that filters are less effective than COPA would usurp the District Court’s factfinding role. By allowing the preliminary injunction to stand and remanding for trial, we require the Government to shoulder its full constitutional burden of proof respecting the less restrictive alternative argument, rather than excuse it from doing so.

So for this evidence, a statistics professor working with the Department of Justice decided to try to use search engine queries as a basis for various estimates:

Reviewing URLs available through search engines will help us understand what sites users can find using search engines, to estimate the prevalence of harmful-to-minors (HTM) materials among such sites, to characterize those sites, and to measure the effectiveness of content filters in screening HTM materials from those sites.

Reviewing user queries to search engines will help us understand the search behavior of current web users, to estimate how often web users encounter HTM materials through searches, and to measure the effectiveness of filters in screening those materials

Many problems have been pointed out with these ideas, to put it gently. It’s important, however, to keep in mind that the previous state-of-the-art in research evidence here was typing in the words “free” and “porn”. It’s hard to imagine a better case for never attributing to malice what can be explained by stupidity.

But when the keywords “Google”, “government”, “pornography”, “privacy”, all mixed together, it produced an explosive reaction due to the volatility of the components. Many news reports gave a false impression that the government intended to go on a fishing expedition of sifting through personal search records in order to track down seekers of child pornography. The recent NSA wiretapping scandal provided yet another framework for suspicion.

Pragmatically, if the government was going to data-mine search engines as a source for investigation of terrorism, or even child pornographers, those actions would be surrounded by secrecy. And the public would only find out about it through leaks, not open court action. For example, an ACLU lawsuit over the PATRIOT Act was subject to extensive gag orders. And if there was a fishing expedition, since other search engines had complied with the government data requests, the net had already been spread far and wide. So from a very narrow perspective, any privacy damage had already mostly been done.

However, the relatively minor goal of statistical studies has ended up raising public awareness of the overall issues with personal data stored by search engines, and the fears it could be misused for criminal investigations. Search engines are almost an outsourced surveillance system. And completely unaccountable since they’re private companies. Perhaps the overall lesson of the story is that information collected for business purposes could easily be abused. We should start thinking about mandating privacy protection before the abuses imagined in this case become reality in the future."