Site Mobile Navigation

Researchers Yearn to Use AOL Logs, but They Hesitate

When AOL researchers released three months’ worth of users’ query logs to a publicly accessible Web site late last month, Jon Kleinberg, a professor of computer science at Cornell, downloaded the data right away. But when a firestorm over privacy breaches erupted, he decided against using it.

“Now it’s sitting there, in cold storage,” said Professor Kleinberg, who works on algorithms for understanding the structure of the Web and searching it. “The number of things it reveals about individual people seems much too much. In general, you don’t want to do research on tainted data.”

After the data was released for academic researchers like Professor Kleinberg to work with, many were torn, loath to conduct research with it as they balanced a chronic thirst for useful data against concerns over individual privacy.

It is one of the frustrations of being an academic researcher in a world that has grown highly commercial. Data is everywhere, but there is precious little of it for university researchers to work with. Raw data about people’s online behavior — the grist for many an academic researcher’s mill — remains locked up inside large companies, accessible only to a subset of corporate researchers.

Some see the data as too valuable to withhold altogether. “One of the biggest problems is trying to get real data,” said Christopher Manning, an assistant professor of computer science and linguistics at Stanford University.

Although the 650,000 AOL users were not personally identified in the data, the logs contained enough information to discern an individual’s identity in some cases.

AOL quickly withdrew the data from its research Web site, but not before it had been downloaded, reposted and made searchable at a number of Web sites. And on Monday, the company dismissed Abdur Chowdhury, the researcher who posted the data, along with another employee. Maureen Govern, AOL’s chief technology officer, resigned.

Academia has a longstanding disadvantage when it comes to raw data. While fresh data sets are routinely made available to researchers at large companies like Google, academia has largely made do with the same two sets of search data — one from Excite and one from Alta Vista — for nearly 10 years.

Meanwhile, a virtual eternity has elapsed and those two data sets have long outlived their usefulness. “The way people use search engines now is totally different,” Professor Kleinberg said. “Partly because what you expected to get out of a search engine back then was much less, so people didn’t try anything too fancy.”

While acknowledging “genuine privacy concerns,” Professor Manning said, “I think it’s fair to say that given researchers’ craving for data, having the AOL data available is a great boon for research.”

Professor Manning, who has downloaded a copy of the AOL data set but has no immediate plans to work with it, added that the trove of raw research material represented by the AOL data “obviates the need for anyone to worry about getting data for a while.”

William W. Cohen, an associate research professor in the machine learning department at Carnegie Mellon University, said the AOL query logs could be invaluable for researchers working in the field of personalization.

“Someone’s past search history can tell you a lot about what they’re interested in,” he said. Professor Cohen, who takes annual vacations near Charleston, S.C., used himself as an example.

“By knowing what someone searched for in the past, you can do a lot better at answering a query,” he said. “If you look at my recent searches, they might have something to do with vacation homes, Folly Beach and car rentals. So if I search for seafood restaurants, it’s more likely I’ll be looking for one in the Charleston area, and if I say ‘Charleston,’ it’s much more likely to be South Carolina than West Virginia.”

At the same time, Professor Cohen shares Professor Kleinberg’s view about the AOL query logs.

“I would feel personally uncomfortable looking too closely at searches showing things like marriages breaking up,” he said. “I don’t want to do research in order to see if my algorithms are working correctly, while delving into the details of people’s lives.”

Professor Cohen said that although he might eventually do research on the data set, “the privacy issues are something I’d have to think through very carefully.”

Companies occasionally mete data out to academic researchers. Microsoft has done this, but in a controlled fashion. Yahoo shares some statistical data with researchers who are approved case by case through an internal vetting program, according to Joanna Stevens, a company spokeswoman, but query data, she said, has never been distributed.

An error has occurred. Please try again later.

You are already subscribed to this email.

Asked about its policy, Google issued a statement that said, “Our current policy is not to release queries or personal data to researchers outside of the company.”

Oren Etzioni, a professor of computer science at the University of Washington, said that shortly after the news of the privacy violations surfaced, he had lunch with Dr. Chowdhury and two of Dr. Chowdhury’s colleagues.

Professor Etzioni said Dr. Chowdhury was horrified by what had happened. “He didn’t anticipate that this kind of data could be used to track down individuals.” Dr. Chowdhury declined to comment, at the advice of his lawyer.

The release of the AOL query logs coincided with a global conference in Seattle of information retrieval researchers, and it prompted heated discussions among those in attendance.

Among them was Jamie Callan, an associate professor in the school of computer science at Carnegie Mellon and chairman of the Association for Computing Machinery’s special-interest group on information retrieval, which organized the conference.

Professor Callan said that although no one disagreed on the importance of protecting privacy, “there’s also a strong belief that it is very important for the scientific community to have access to data of this kind in some anonymized form.”

The last similar case involved a set of hundreds of thousands of internal e-mail messages from Enron, posted in 2003 on the Federal Energy Regulatory Commission’s Web site, in connection with the agency’s investigation into the company.

Although some of the e-mail was relevant to the investigation, most of it was not. So hungry were researchers for a coherent body of e-mail messages to work with that they were able to set aside their concerns that the privacy of many people who had nothing to do with the Enron scandal was severely compromised.

“Researchers glommed onto it,” Professor Manning said. “It was enormously good for academic research for all kinds of things you might want to do, like social network analysis.”

Several research papers have emerged in the years since the Enron e-mail corpus was released, and it remains the only large body of actual e-mail in the public domain.

“It seems this AOL data is creating more of a stir,” Professor Manning said. Where the release of the Enron e-mail was justified as shedding light on corporate wrongdoers, “the AOL data is more like a real violation.”

Professor Etzioni and others say one partial solution to heightened privacy concerns could lie in more stringent “scrubbing” of data in a way that did not diminish its quality as a research tool. This could entail, for instance, replacing numbers that carry identifying information — like Social Security numbers and ZIP codes — with zeros, or replacing the word “New York” with “X17.” To a researcher, Professor Etzioni said, “it doesn’t matter so much that it’s New York as X17.”

Professor Kleinberg said he hoped that over time, the AOL incident would lead to “a richer, more informed discussion about what it means to create data sets that are clean and anonymized.”

Still, a freeze on all data distribution is likely to be in effect for the foreseeable future.

Professor Etzioni said that over lunch with the AOL researchers, he had mentioned that for his own research, he was interested in a data set containing queries starting with “Wh,” to signify that a question was being asked. Such data need not be tied to an individual to be useful as a research tool.

“We build technology that answers questions,” he said. “So we want to test it on actual questions people are asking.”

The AOL researchers told Professor Etzioni they would get approval from the company and send him a compact disc containing the question set.

But Professor Etzioni is not holding out much hope of receiving the data. “I don’t think that CD is in the mail,” he said, “and that’s too bad.”