August 26, 2006

Blog Update (June 29, 2007):Note that despite any rumors to the contrary, the law discussed in this entry takes effect on July 1, 2008, NOT 2007. It will then, as in other states with similar even stricter laws, be increasingly ignored as time goes by, while law enforcement rightly concentrates instead on serious issues.

Greetings. As you know, I frequently speak out against what I view as silly laws that fly in the face of logic, science, or just plainly observable facts.

In yet another proof that reality and politics often don't mix, lawmakers here in California are poised (after many years of refusing to go along with the bill's main sponsor) to approve a ban on handheld cell phones when driving. This may happen as soon as next week. You can count on Arnold, desperate for popular actions he can take so close to election day, to sign the bill.

All of us have been annoyed by the gabbing cell phone user who seems to be driving oblivious to everything around them. So without a doubt this law will have wide appeal. And if experience in other states holds, the law will have little or no long-term positive safety effects, and handheld cell phone use will quickly rise back to pre-law levels after a brief initial reduction.

The reasons are obvious. Study after study shows that distracted driving of any kind is a key factor in accidents. While someone holding a cell phone clamped to their ear is easy to spot, we're less aware of the radio manipulators, people screaming at their children in the back seat, makeup applicators, food eaters, and any of a myriad number of other distracted drivers. In fact, studies have shown that the most common distractions leading to accidents when driving are other people inside the vehicle or things seen outside the vehicle.

Even worse, research shows quite clearly that talking on hands-free cell phones (still permitted under the bill) is equally distracting as using a handheld device. It's the remote conversation itself that is the real distraction, not the act of holding the cell phone -- plus there's all the situations where people fumble around to answer or dial a call even on a hands-free cell phone.

When proponents of this legislation are presented with these inconvenient facts, they tend to reply with, "Oh well, at least we're doing something..."

"Something" isn't good enough when it's based on bad science. If you really want to remove cell phones as a distraction, you need to ban them totally when driving -- handheld or hands-free, as has been done in some other countries. I'm not advocating this, nor do I think that politicians here have the guts for such actions anyway. In fact, banning children from cars might be far more effective in terms of reducing accidents, however unlikely the prospect.

To a certain extent this law will be a paper tiger. Major California cities don't have enough police to deal with serious crime, much less pulling over people for illegal cell phone use. And the bill's penalties -- $20 for first offense, $50 for subsequent -- will hardly be seen as an onerous burden by most drivers in an era of $3+ gasoline.

But this law itself is still primarily pandering to voters in a manner that flies in the face of science. Perhaps laws officially recognizing astrology will be next here in the Golden State.

August 22, 2006

Greetings. A very recent New York Times story neatly encapsulates the overall state of search engine query data retention issues.

The observant reader will note that despite the rising tide of concerns regarding search query privacy, the industry as a whole is still pretty much in a state of denial, made all the more confusing by various signals from the U.S. Department of Justice.

This is turning into such a mess that it's becoming difficult to even keep the various participants and their positions completely clear. There is every reason to believe that without heroic action by the players involved, we may be heading toward a privacy, legislative, and judicial nightmare. But maybe there's a way out.

Let's review:

AOL's release of search query data made obvious to everyone what many of us knew all along -- that such data contains all manner of personal information, even when the identity of the party making the query is not immediately known directly from usage logs. In the AOL case, the individual query entries were linked by "anonymized" user IDs, but even without such linkages the query items alone can be highly privacy-invasive. The AOL release triggered (as did DOJ vs. Google) broad calls for mandated search query data destruction policies.

The personal nature of the AOL query data serves nicely to liquidate the DOJ's arguments (again, as in DOJ vs. Google) that such data is not privacy-invasive so long as the query source is unidentified. The expressed DOJ reasoning is this regard is obviously faulty.

Search engine companies have been reluctant to voluntarily dispose of query data on a regular basis. This data has considerable R&D, marketing, and other value. Since the incremental cost of keeping all queries archived forever is so low, there is little incentive within the normal business structure to dispose of this resource, absent overriding considerations.

Even while laudably expressing concerns about the potential for third-party misuse of query data, search engine firms (e.g. Google) have proclaimed their intention to keep collecting and saving this data indefinitely. If AOL actually sets in place an aggressive data destruction schedule, it will be something of a watershed event that may (or may not) have broad impacts across the search engine industry. Fears of being placed at a competitive disadvantage will tend to make unilateral moves toward query data destruction difficult to propose or implement.

Meanwhile, DOJ is moving in exactly the opposite direction, apparently preparing to propose long-term (perhaps measured in years) mandated data retention schedules, requiring the saving of the very data for which destruction demands are being made in other quarters. DOJ is using child abuse (and as of late anti-terrorism efforts) as their hooks to justify such legislation (please see this entry for more).

This situation has all the elements of a painful and wasteful deadlock, potentially triggering years of litigation while the overall search engine issues continue to fester and become even bigger privacy, business, and political problems.

If we wish to avoid this scenario -- or at least have a good shot of avoiding it -- we need to act now, and we need to do so cooperatively. There are policy and technological approaches to the search query dilemma that can be applied in ways that will serve the interests of all stakeholders. Cooperation and compromise mean that nobody is likely to get everything that they'd ideally want, but to paraphrase the great philosopher Mick Jagger, perhaps we can all get much of what we need.

Therefore, I propose the formation of a high-level Internet working group/consortium dedicated specifically to the cooperative discussion of these issues and the formulation of possible policy and technology constructs that can be applied toward their amelioration. Such a working group would be as open as possible, though proprietary concerns would likely necessitate some closed aspects if progress is to be accelerated as much as possible.

Participation by all stakeholders would be invited. Representatives of the major search engine firms and concerned government agencies, outside technologists and other persons involved in privacy and search issues, and other entities as appropriate, would all play important roles.

Of course, it's easy -- especially for large corporate enterprises -- to simply ignore such efforts and just plow ahead independently. Obviously, without the participation of the key players, the effort that I'm proposing would be useless, and I will not continue to promote it if that situation ensues.

However, I suggest that it will be in the long-term best interests, both financially and in terms of corporate and organizational responsibility, for major stakeholders to actively join such a project, since the alternative seems ever more likely to be somewhere between highly disruptive and extremely draconian.

August 19, 2006

Greetings. I've noted a conscious tactic being used by some telco/cable company supporters in the ongoing network neutrality controversy that is extremely disturbing. Republican Senator John Sununu is claiming that the forces promoting network neutrality are engaging mainly in a partisan debate, "What the liberal left have hung their hat on," he says. In a meeting with regulators earlier this month, AT&T CEO Ed Whitacre said of network neutrality that, "It's a made-up issue."

Neither of these remarks advance the debate in a meaningful way. Sununu is engaging in the time-honored technique of tarring his opponents with a broad political brush, to try avoid public consideration of the real issues. Whitacre is using a similar tactic, by claiming that the concerns of network neutrality advocates aren't even real and thus belittling their efforts.

My strong position in favor of network neutrality is well known. But I hope that most persons on the other side of this debate will agree that statements such as those by Sununu and Whitacre are demeaning of the process and decidedly unhelpful in moving us toward positive outcomes regarding these very real and very important technical and policy issues.

August 10, 2006

Web site privacy issues in general, and search engine privacy concerns in particular, are turning into a three-ring circus of ironies.

I discuss these issues until I'm figuratively blue in the face and yet it's deja vu over and over again.

The article referenced below in fact failed to mention the key aspect of the search engine data situation that makes this all so bizarre. We have Rep. Markey, et al. pushing data destruction laws in the wake of DOJ's push (in support of their Child Online Protection Act case) to get Google's query data -- which Google wisely resisted, though ultimately they had to turn some of that data over to DOJ. I do agree with some observers who feel that Markey's proposal is so encompassing that it remains unlikely to ever become law -- I'd much prefer to see more highly targeted and focused legislation.

But meanwhile, as some of us had been predicting for ages, DOJ/Gonzales are out there pushing for broad Web site data retention laws -- ostensibly (do we see a pattern emerging?) using child abuse investigations as the hook.

Gang, we can't have it both ways in any kind of simplistic scenario. The simple choices are (1) Burn the data to prevent abuse -- and also prevent any other non-abusive uses of that data, or (2) Retain the data, along with major internal and external abuse potentials.

The simplistic scenarios are each highly problematic. We need to advance these issues in more sophisticated directions.

The only research and policy paths I see that could possibly lead toward better outcomes in this area are being largely ignored by the major players, so we have this repeating cycle of events and reactions banging back and forth.

A few months ago, in: "An Open Letter to Google: Concepts for a Google Privacy Initiative", I set forth a proposal urging Google, as the global search leader, to apply its formidable resources toward advancing these issues -- both for Google's own benefit and ultimately for the benefit of the entire global community. In light of the whole series of recent events relating to the Web site data retention/destruction sphere, I assert that such efforts are needed now, on a priority basis.

As I've noted previously, we must demand that our data be protected. Accomplishing this properly requires serious thinking, hard work, and in the real world more than a little compromise. We need to develop effective and reasonable technology and policy paths toward management of the vast amounts of personally-related data that Web sites are collecting. AOL's search query data screw-up is bad enough, but it's only a drop in the bucket compared with the sorts of abuses and problems that could take place if we don't move forward appropriately. We can be enriched by data, or we can be enslaved by it. The choice remains ours.

August 07, 2006

Greetings. I've written and spoken many times about the sensitivity of search engine query data. We all know about Google's stance in DOJ vs. Google early this year, where Google wisely attempted (for several reasons) to prevent release of such data to a government fishing expedition related to "child protection" legislation. We also know that Gonzales, et al. are merrily pushing mandated data retention laws -- again mainly in the name of child protection -- that would leave Internet users vulnerable to all manner of unreasonable surveillance of their Internet activities. All of this is already enough to be sounding alarm bells regarding the lack of reasonable legislated protections for such data.

The AOL action in releasing the search records of a reported 500K AOL users -- assuming it took place as outlined below -- is probably the most egregious violation of users' search privacy in the history of the Internet, despite the half-hearted attempt at crude anonymization. The unbelievable lack of responsibility or good judgment shown by AOL in this case should be enough to cause any remaining AOL subscribers (or users of their free services) to strongly consider ceasing any further contact with AOL.

Furthermore, we need to accept the fact that search query data is incredibly sensitive and often contains extremely personal information that does not lose its potential for abuse via simplisitic forms of anonymization. Nor can we necessarily depend indefinitely on some individual search engines' (e.g. Google) honest and praiseworthy desires to protect such data in the face of intense competition and intrusive government actions.

Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even darkest fears.

We must demand that this data be protected.

--Lauren--

P.S.

Subsequent information has revealed that more than 600K users' search data was included in the AOL release.

I have altered the URL reference (3) from the forwarded message below. Anyone who tried to forward that original message to an AOL user may have been in for a surprise.

At least in my experiments just now, AOL rejects that message since URL reference (3) contained a numeric IP address rather than a domain address.

Ironic, isn't it? AOL "protects" users by blocking messages with IP addresses in URLs (can such addresses be suspect? Yeah, but they can easily be legit, too) -- yet they happily release the most private aspects of users' search activities.

AOL just released the logs of all searches done by 500,000 of their users over the course of three months earlier this year. That means that if you happened to be randomly chosen as one of these users, everything you searched for from March to May (2006) is now public information on the internet.

This was not a leak - it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: "Please reference the following publication when using this collection..." is the message before the download.

This is a blatant violation of users' privacy. The data is "anonymized", which to AOL means that each screenname was replaced with a unique number. "It is still a research question how much information needs to be anonymized to protect users," [9]says Abdur from AOL. Here are some examples of what you can find in the data:

User 491577 searches for "florida cna pca lakeland tampa", "emt school training florida", "low calorie meals", "infant seat", and "fisher price roller blades". Among user 39509's hundreds of searches are: "ford 352", "oklahoma disciplined pastors", "oklahoma disciplined doctors", "home loans", and some other personally identifying and illegal stuff I'm going to leave out of here. Among user 545605's searches are "shore hills park mays landing nj", "frank william sindoni md", "ceramic ashtrays", "transfer money to china", and "capital gains on sale of house". Compared to some of the data, these examples are on the safe side. I'm leaving out the worst of it - searches for names of specific people, addresses, telephone numbers, illegal drugs, and more. There is no question that law enforcement, employers, or friends could figure out who some of these people are.

I hope others can find more examples in the data, which is up for [10]download over here. The data set is very large when uncompressed which makes it pretty hard to work with, but someone should set up a web interface so people can browse it (or even 10% of it) without having to download the 400mb file. If you make a mirror or better interface to the data, or find other examples, let me know and I'll put a link up here.

This is the same data that the DOJ wanted from Google back in March. [11]This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.

It's unclear if this is the type of data AOL released to the government [12]back when Google refused to comply. If nothing else, this should be a good example of why search history needs strong privacy protection.

Thanks to Greg Linden for pointing this out [13]here.

Update 2: The md5 of the file AOL posted (and now removed) is 31cd27ce12c3a3f2df62a38050ce4c0a. I'm posting it so you can make sure you have a valid copy, but so far none of the copies I've seen are fake.

Update: Seems like AOL took it down. There are some mirrors of the data in the comments of the digg story, linked below. I estimate about 1000 people have the file, so it's definitely going to be circulated around. The [2]main AOL research page is still up, with some other data collections. The [3]google cache of the download page is still up, but you can't get the data. Here's discussion at other sites: