AOL, Netflix and the end of open access to research data

The authors of the Netflix de-anonymization study contacted me to point out that they originally published a draft of their results a mere two weeks after Netflix released its dataset. Netflix has known about their study for over a year.

Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.

In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company’s system of DVD recommendation. In order to protect its customers’ privacy, Netflix anonymized the data set by removing any personal details.

{This story demonstrates the incredible ease of re-identifying anonymized data. Consider the implications for the nation’s treasure trove of health data: anonymized or de-identified health records are clearly not safe either. Electronic health records contain far more pieces of identifiable information than Netflix ratings, making them far easier to re-identify. Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The researchers gave an example about what they could learn by re-identifying the data of one Netflix user: ‘First, we can immediately find his political orientation based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. He did not like “Super Size Me” at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, “Bent” and “Queer as folk” were rated one star out of five. He is a cultish follower of “Mystery Science Theater 3000”. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details.’ See Bill Yasnoff’s blog about Netflix—-he argues that the ease of re-identification of health data is why we need health trusts. If we consent, research can be done safely inside the health trust and we don’t have to risk releasing sensitive data.~Dr. Deborah Peel, Patient Privacy Rights}