Friday, August 04, 2006

A chance to play with big data

A couple fun new data sets are being made available by the search giants.

First, in a humorously titled post, "All Our N-gram are Belong to You", folks at Google Research announced that they "processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times." Very cool.

The massive Googly data set is available on six DVDs -- probably about 30G of compressed data -- but not as a download.

Second, the new AOL Research site has posted a list of APIs and data collections from AOL.

Of most interest to me is data set of "500k User Queries Sampled Over 3 Months" that apparently includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl} for each of 20M queries. Drool, drool!

You know, just the other day, I was watching a Google Tech Talk where a researcher was lamenting the difficulty of getting access to big data. It is exciting to see two of the giants, Google and AOL, making this kind of data available.

Update: Sadly, AOL has now taken the 500k data set offline. This is a loss to academic research community which, until now, has had no access to this kind of data.

The move seems to be a response to a bunch of inflammatory blog posts ([1][2][3][4][5]) that make outlandish claims like:

AOL has released very private data about its users without their permission ... The ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

So much for the privacy of AOL's users ... This is identity theft just waiting to happen, that's what this is.

I expect it's a matter of time before a major national newspaper prints an interview with somebody identified and embarrassed in this manner.

Nevermind that no one actually has come up with an example where someone could be identified. Just the theoretical possibility is enough to create a privacy firestorm in some people's minds.

If someone comes up with a clear example of a privacy violation from this AOL data, I would be convinced. Until then, this looks to me like the mob of the blogosphere getting distracted in the shadows and missing the big privacy picture.

Unfortunately, the research community now will be denied a tool that could have helped push forward the state of information retrieval. Research that could have been accelerated will now be stalled. We all will suffer from the loss.

Update: This morning, in a front page article, New York Times reporters track down a specific AOL user using the released data set and ask her about her AOL searches.

Update: It now appears the entire AOL Research site has been taken offline, including access to their publications, other data sets, and APIs. That is disappointing.

Update: The NYT reports that AOL's CTO, a senior researcher, and a senior manager have been dismissed. It appears AOL Research is being shut down.

Some may cheer AOL getting a firm spanking over this privacy issue, but I think the long-term costs are grave. I suspect this pretty much eliminates any future access for the academic research community to large scale data sets. After this, the only work on big data will be at the search giants.

Hindering academic research will slow progress on building the next generation of search. It is hard to measure the cost of difficulty finding the information you need -- the productivity loss of a few minutes a day over millions of people is difficult to measure -- but it is a cost we will all be paying.

Update Seven weeks later, the NYT also adds perceptive by writing about recent examples of severe privacy violations including "2.6 million current and former Circuit City credit card account holders", "names, Social Security numbers, and dates of birth on roughly 28 million veterans", "names, addresses and credit and debit card numbers of some 243,000 customers of Hotels.com", "names and Social Security numbers — and in some instances medical histories — of some 51,000 current and former patients of PSA HealthCare", ChoicePoint's leak of 145,000 accounts, and CardSystems' leak of 40M credit card numbers. Read more at the "Chronology of Data Breaches".

They don't sell your name--just your clicks -- but the clicks are tied to you as a specific user (User 1, User 2, etc.).

How much are your clicks worth? According to [Compete CTO] David [Cancel], about 40 cents a month per user (per customer) .. .and he estimates that there are 10-12 big buyers of this data. In other words, your ISP is probably making about $5 a month ($60 a year) off your clickstreams.

Someone [in the audience] points out that this is just as bad as the AOL search thing. It's much more! David says -- his excited eyes indicating that he's a happy customer. Someone else observes that the benefits/drawbacks of this are in the eye of the beholder: for the ISPs it's awesome.

I actually wasn't trying to be inflammatory about the privacy issue in my post...here's what I said:

---I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.---

As you've pointed out, the initial blogosphere reaction is now almost entirely on the privacy issue. For research purposes I'd like to see more search data made available, but this is also making me quite happy not to be an AOL customer today.

“does anyone know stanley who might have a sister named [full name removed from here] i have been looking for my family for a very long time please lets learm about each other becfore it is too late. i will leave you my e-mail [full email removed from here]. i so want to hear from you” (You can see that even without providing a user ID like AOL did, publishing non-aggregated queries is a privacy breach.)"

I've removed the name and email myself -- they appear, in full, in the sample.

Based on reading a timestamped history of search terms sorted by anonymized user, is it possible to deduce who that user is, especially if any individual's name was one of the search terms? I think the answer is obviously yes.

This doesn't even get into the odd things entered into search forms by "normal" users (like SSNs, credit cards, messages that they think they are sending in a private email). Technical people have a hard type imagining someone doing that.

I think one of the reasons this mistake happened in the first place is that the engineers failed to see the big picture, instead they had tunnel vision about how wonderful this would be for the research community and came up with a technical solution that solves the wrong problem (replacing username with numbers, whoopee).

Greg: The exerpt itself is fine, as you say it's from a public forum. The problem is that I can now tie a name to a number -- so I can filter the results and see everything she searched for. It's not *that* search term that violates her privacy, it's the fact that I can now identify *all* her searches in the data set.

If these where a random sample without a consistent identification scheme the problem would be less severe. The problem is this data is effectively not annonymous. You can identify users (as per the example I gave) and see 3 months of their search history -- that's a massive privacy violation.

I've now had a chance to spend 5mins browsing the data myself. There's a couple hundred examples of people pasting a phishing email into the search box, each email begins: 'Dear [user's *full* name]'.

I picked a name at random and was quickly able to see this guy lives in Ohio but is moving to Georgia, he drives a Chevy van (which he's looking to 'pimp'), he may have stomach cancer, he desperately wants to win the lottery -- and is considering enlarging his...

Ahh, in any case I know his full name and where he lives, combined with *very* personal details.

Ok, there are two points you seem to have missed that make it rather easier than you suggest to identify alot of these users. The risk is not just percieved.

Number 1: Each entry is time stamped to the second, and includes the web site the user clicked on. It is incredibly easy for any one who owns a website that one of these AOL users clicked on to access their logs, and correlate the time and HTTP referal headers to that specific user ID. Therefore you can attach an IP address and time to a UserID in the AOL data - this makes identifying the person a lot easier. Now, and this isnt hypothetical at all, assume you are the nytimes.com website, where people have to log on to read certain stories, or any other website that requires registration. You have the persons email address, IP address, possibly their name, and so on.

This is not impossible, hell it is not even hard. A significant percentage of these AOL users could be identified like this by websites which a number of the users clicked on.

Now point 2: Each UserID is linked to all the queries over 3 months, so you can confirm your IP and possible name data by triangulating possibly personally identifying queries, along with all kinds of other information you wouldnt want someone to know.

Thanks, Reto. Those are privacy violations. I appreciate you pointing out specific examples.

I think there a couple mitigating factors with this privacy violation that should be considered.

First, the scope and scale of the violation. The number of people potentially impacted is small. The impact on those people is likely small. This is nothing like the release of millions of credit card numbers or social security numbers. The depth of our outrage should be proportional to the damage done.

Second, AOL research's motivations in releasing this data appears to be pure. They sought to help the IR research community provide the state of search. You may consider that naive, but I believe that the fact that they were trying to help should lessen our anger.

There's a lot more than that too. That's only what I was able to find with a quick text search of just the first 3 of the 10 text documents. Also I'm sure that crossreferencing the user ID's will reveal further information. I did find someone's name, location myspace along with what was presumably their Visa #.

It's rather obtuse to say that this is insignificant just because there is so much other ID theft that has happened recently. Plus, this data wasn't even stolen--AOL voluntarily gave it out! ID theft was the first thing that popped into my head when I heard about this and the fact that nobody at AOL realized this before the data were released shows that they seriously need to get their heads out of their butts.

I'm reading through all the comments here, and I have to say that while I agree in theory that some of this data may contain privacy violations, in practice a lot of the examples given by the rest of you commenters are simply false. Just because someone types in a name does not mean it was an "ego surf" query, and therefore does not identify the querier as the named person. I type all sorts of names into the search box all the time.. business contacts, old college friends I'd like to get in touch with again, people whose papers I've read, etc. It is a bit delusional to think that you could infer someone's identity from all the names they typed in.

Now, as for the people who typed in names and soc sec. numbers.. while that is not good, it still does not tie that person's name to the querier. So it identifies one person with one soc#, but it still does not identify the querier.

I mean, for all we know, the querier could be some unscrupulous identity thief who has already managed to get a hold of a soc#, and is trying to find out more info on the web. Ok, that may be a bit of a stretch, but it could also be one of the many businesses that ask for your soc, typing the name in to try to find you, because you are delinquent on your payments. The point is, the querier is still not necessarily the person named in the query.

And even when you really, really think that someone has issued a query that let's you identify them, i.e.: "Hi, my name is Gary D. Sloquist. Where is my homepage?", how can you be certain that all the other queries that follow with the same ID are all -that same- person? People have families. People share computers. People have friends that come visit, and temporarily "borrow" their logons. How do you really know it is the same person?

In general, data like this makes me, nervous, too. But I have to agree with Greg that the benefits outweigh the costs. At least that's how I feel today :-)

And Greg has an excellent point when he talks about the scope and scale of the violation. I mean, if you really wanted to find out as much information on some random person as you do in these queries, you could just drive somewhere and steal someone's trash. That is about the scale and scope that we're talking about here.

Actually, I think the trash stealer has the potential for even worse privacy violations than from this AOL data. Think about it a minute, please. From this AOL data, let's suppose an identity thief wanted to do something nasty. Well, chances are (because this is a big world we live in) there are dozens of identity thieves all around the world, thinking the exact same thing. And don't you think the credit card companies are going to be highly suspicious when, in two days from now, credit card charges start appearing simultaneously in Kiev, in Sofia, in Lagos, and in Ft. Lauderdale, plus two dozen more cities around the world? Because all the identity thieves will not be coordinated with each other, and they'll all be attempting it at the same time. Easy to detect, easy to shut down.

Contrast that with the trash thief. One person. Working small. You probably won't detect it until your next credit card cycle. If you even get the bill, because that one person has redirected that mail. It could take weeks before you know, and by then, lots of damage will be done. MUCH more than through the release of the AOL queries.

There is risk everywhere. And while you would be correct if you said "well, YOU wouldn't want your name and soc to appear in this data, would you?", I think the overall scale and scope of this is much smaller than everyone is making this out to be.

But since no one's bank account has been compromised no biggie right? Take a step back, try to perceive the "big picture" here; maybe review the concept behind privacy laws as understood in the US; take a walk in the sun, get some fresh air; sleep on it, see how you feel.

On Sunday the news broke that AOL purposefully released 20 million partially anonymized search queries. On Monday AOL apologized, and later that evening the first web interface to the data went up.

Today the first person was positively identified from the data - Thelma Arnold, a 62-year-old widow who lives in Lilburn, Georgia.

Based on searches ranging from “numb fingers” to “60 single men” to “dog that urinates on everything,” the New York Times was able to quickly determine and confirm her identity. Ms Arnold is AOL searcher no. 4417749.

I have one more comment to make. Anonymous wrote: Each entry is time stamped to the second, and includes the web site the user clicked on. It is incredibly easy for any one who owns a website that one of these AOL users clicked on to access their logs, and correlate the time and HTTP referal headers to that specific user ID. Therefore you can attach an IP address and time to a UserID in the AOL data - this makes identifying the person a lot easier.

Incredibly easy, you say? Maybe. But I foresee all sorts of obstacles. First, you are assuming that the AOL timestamp and the website owner's timestamp are perfectly synchronized. What happens when the clocks are two seconds off? 30 seconds? 2 minutes? With all the traffic to your website, esp. from AOL to your website, can you still just as easily tell who was who? What if you are getting dozens of hits per second on your site? Or more?

More importantly, does it even matter, for all the people who search for "cookie recipes" and "wyoming rodeo location"?

So in this whole process, for it to even matter, you have to have a website owner that is determined to find someone who is doing something "bad". Then you have to hope that this person actually has a query that is personally identifiable. Then, you have to hope that they clicked on _your_ website, in response to your query, instead of some other website.

Then, even if all those coincidences match up, and you can get the clock times to synch, and separate it out from all your other traffic, you still have the problem of knowing whether or not it was actually the same person issuing that query, as had issued all the other queries. You still have the problem of knowing whether or not it was someone's friend or child or spouse borrowing the computer.

I do agree, there are some privacy concerns. But the blogosphere also just needs to calm down and get some perspective. Scale and scope.

Ok, great, person #4417749 was "exposed". But let's read a little more closely. The searches led reporters to make a pretty reasonable guess. But, in the end, how was she actually discovered?

“Those are my searches,” she said, after a reporter read part of the list to her.

She self-identified. No one actually proved it was her.

Furthermore, the article goes on to say that many of the things you think you learned about here were, in fact, false:

Her search history includes “hand tremors,” “nicotine effects on the body,” “dry mouth” and “bipolar.” But in an interview, Ms. Arnold said she routinely researched medical conditions for her friends to assuage their anxieties. Explaining her queries about nicotine, for example, she said: “I have a friend who needs to quit smoking and I want to help her do it.”

So, wow, we have been able to identify 1 person out of over 657,000. And she's a little old lady with no bank account numbers revealed. And half of what we think we learned about her was actually about her friends. And, without actually asking her, we have no way of knowing which of those ailments are hers, and which are her friends. Which puts everything into suspect territory, and thus useless knowledge. Just as I suspected above.

Scale and scope. 1 out of 657,000. You have a greater chance of dying by falling out of a building (see here) than you do being identified as an AOL querier. Sorry, not even identified... educatedly guessed.

I still stand by feeling that you have more of a privacy risk by getting your trash stolen than you do with the AOL queries in this dataset.

My point, arethusa, uke, and anonymous, is that if the reporter had gone to this lady's house, and showed her the list of searches, and she had said "No, those are not my searches", could the reporter still have positively and conclusively identified her?

We'd need more details on the reporter's methods to know for sure. But from this article, it doesn't sound like it.

This whole "expose" hinges on this lady's cooperativeness.

And, even with her cooperativeness, the other point still stands about not knowing which queries were actually hers, and which were for her friends.

Hi Jeremy, I'm not sure why this problem is so hard to understand. The fact that it's not possible to conclusively prove that an IP address or a set of searches belong to a person is irrelevant. For a moment, stop focusing on the technical details of this issue like synchronizing timestamps, and look at it from the perspective of a real user.

Jeremy, let's say your full name appeared as part of a set of searches done by user 12345. The next search done by that same user was about repairing a car of the model and year that you own. Another search is for a pizza place in your neighborhood, a hotel in your last vacation spot that you told all your friends about, and also searches about your favorite programming language B# (although I would wonder about a programmer using AOL?), and a few aquaintances. Interspersed with these are searches for "transexual teen escorts who take credit cards" and "free syphilis clinic".

So, lets say your geeky co-worker / boss / prospective employer searches this data for your name "Jeremy FullName" just out of curiosity and finds those bits of information all nicely grouped together.

Now, do you think that it matters one bit that they cannot prove with 100% certainty that you were the person behind the machine? Would you really just shrug it off and say "you can't prove it was me, I just won't admit to it."

If you really think you would, then feel free to post all your future web postings using your full name and home address, since no one could prove 100% that it was you who are sitting at that machine. That's the same lack of common sense that these researchers demonstrated in the first place.

Actually, I think the damage here is much worse than getting your credit card # posted online, because in that case, who cares, just close those accounts. But if real people get associated with some of these searches, rightly or wrongly, then as they say "where can they go to get their reputations back"?

Good point, Kevin Murphy. I apologize for characterizing your prediction as "outlandish".

I continue to believe some of the other claims are outlandish -- that this release of data will lead to identity theft and exposure of very private data at anywhere near the scale of past scandals -- but I apologize for describing your words in that manner.

There were privacy issues with this AOL data release that need to be addressed. We have to find a way to facilitate information retrieval research -- to build the next generation of search and help people all over the world quickly find the information they need -- without risking individual privacy.

"If someone comes up with a clear example of a privacy violation from this AOL data, I would be convinced."

the clear examples are growing, people only need to look at the data to serve them up to you. now you should eat your humble pie and announce your "convincedness." instead you are proposing "mitigating factors" in a could-be-satire denial of solid...err...data. Ironic how personal agenda clouds the ability to datamine.

we won't hate you if you say "yeah i'm a researcher. i love playing with data. therefore i support this type of release of information despite any (perceived or real) consequences." but we (i) will respect you less if you make a challenge and then refuse to accept the evidence you've demanded

personally, i love the idea of digging into such a big, real dataset, gonna start tonight. but to echo Ho John Lee, happy to not be one of those AOL customers...

to be honest with myself i say that ultimately it's wrong, but i will participate anyway.

You didn't take that walk outside in the sun and then nap, did you? I swear to God if you mention "but it wasn't a bank account #!" one more time I'm going to wave a magic wand and send you back to junior high civics class.

Privacy laws are not in place to solely protect your bank account # or almost any kind of site that asks you to input personal info wouldn't have a "Privacy Agreement" (ever read one? For eg. AOL'S?). And there's no magic "million" number that companies have to reach before they ethically and legally have compromised a customer's privacy.

Parker, you make a good point, but there are still problems with what you are saying. Regarding your name/car/pizza example: Guess what? I actually do search for the names of my co-workers, so I can find other papers and projects they've done in the past. I also happen to admire some of the cars they drive, such as the Toyota Prius, and have searched for that car as a result of riding in the passenger seat with them. Finally, since we work/live in the same general area (after all, we are co-workers), I know that I've searched for pizza joints near where they live.

According to you, you would see this search history, and automatically think the searcher is my co-worker. It has his name, his car, and his local pizza joint.

But you would be wrong.

I admit sometimes, though, you could be right. And sometimes you could make a pretty informed guess.

So at this point, I go back to the scope and scale argument. So far you have a 1 in 657,000 chance of randomly being identified. If you turn it around, and look at the probability of someone actually going out and looking for you, specifically, and being able to piece together all the pieces they need to really nail it down to you, and not to an employer, co-worker, friend, family member, random fan, random web freak, despised arch-enemy, or whoever happened to be looking for you, then I think the chances are very low of someone actually succeeding.

If someone is really looking for dirt on you, on specifically you, then there are methods that are both much more effective and much more efficient than trawling the AOL data. They could steal your garbage. They could try logging in to your gmail account using the name of your pet dog. Whatever.

There are serious violations of our privacy occuring all the time in the U.S. Think of the recent AT&T/government scandal. That, IMO, is much worse than this AOL data.

But if this move by AOL starts a good dialogue, then I am all for that. I just think there is a lot of overreaction at the moment.

arethusa writes: "I swear to God if you mention "but it wasn't a bank account #!" one more time I'm going to wave a magic wand and send you back to junior high civics class.

Um, dude, with all respect: Are you high? I mentioned "bank account" once and only once. And it was only after you brought it up.

But let's go to what you are really saying: Privacy is not just about bank accounts.

I guess my question to you is: Of all the information that you give to other people and/or companies, what constitutes private data? Certainly when you give a company your soc security number, that is private. But how about when you type a query? Is that really "private information", as defined by law or even ethics?

Here is another example: let us suppose you are ordering a meal in a restaurant. You issue your query: "Can I have the salad?". Is that considered private data? Is the restaurant required not to disclose that data to anyone? Will they be legally brought to court if they publish the fact that you ordered a salad? Is your ordering of the salad "private data"?

Uh, jeremey, that was a rather silly example, wasn't it? A more realistic analogy might be that someone calls Domino's and orders an extra large meat lovers with extra cheese, and the phone company releases the call details. Furthermore, maybe that person had just previously talked to their insurance company, who get the data set, correlate the call times, find out that our friend is at higher risk for coronary disease due to his diet and decide to up his premium ...

Jeremy: Yes. Anything I tell a company that I don't explicitly give them permission to tell someone else should be considered private.

Do I realistically expect this? No. Would I be justifiably angry if I found this not to be the case? Yes.

Deciding what should be private is easy for some -- health and finance spring to mind -- but more difficult to pin down for others. My food order? *I* don't care, but maybe if my insurance company was adjusting my premium based on my diet I would. Point is, while it's good practice to assume nothing you do or say will remain private (particularly online), we should still demand privacy from those who we trust with our information.

Anonymous, yes, my example was silly. But it still illustrates a point. Think about all the actions you perform out in the world. What is, what should be, and what possibly can be, private?

As far as your example about your food choices and your insurance company (something Reto also mentioned), certainly you have seen the "Ordering Pizza in 2010" video from the ACLU. It's a good one. I myself have those same fears.

But let us compare what AOL has done, with what is already happening. Take, for example, the Google privacy policy: "We may combine personal information collected from you with information from other Google services or third parties [emphasis mine] to provide a better user experience, including customizing content for you"

In other words, Google admits that they could very well "mashup" your data with data that they have received from (third party) insurance companies.

So your nightmare scenario is already out there. The release of this AOL data didn't add anything new. You've already got companies that you use preparing themselves to "mashup" your data. The AOL data is the least of our worries.

it's not too difficult to identify the more verbose users. i've identified several just today, including names, addresses, phone, email, myspace, and in some cases credit and ssn. there are definitely some identified queries that would cause a great deal of embarrassment if known, and the potential for identity theft or blackmail is high. i could be more understanding of aol if this data was stolen, but to be so incredibly dense as to release it to the public without considering the ramifications frankly borders on mental retardation. surely aol has data mined this stuff before. i just can't imagine anyone would be that stupid.