The Second Netflix Challenge and Privacy Research

Okay, if you’re just catching up with this story, go read this first — Netflix’s announcement that it was canceling its second Netflix Prize challenge over privacy concerns.

Next, head over to 33bits.org, blog of one of the co-authors of the paper on de-anonymizing Netflix users from the first Netflix Prize challenge data, to read the authors’ open letter to Netflix about the canceled second challenge.

Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas. You will be able to move forward with the next competition, and just as importantly, it will enable you to become a leader in privacy-preserving data analysis. One potential outcome could be an enterprise-ready system which would be useful to any company or organization that outsources analysis of sensitive customer data.

I find that paragraph from the post particularly interesting. This seems similar to the conversations between security researchers and the companies whose products they find ways to exploit. That has often been a very hostile conversation, but it seems (speaking from the outside of the security community) to have improved over time. (For instance, check out this security research guidelines document from PayPal.) Is there a way for privacy research to head in a similar direction, so that companies view external researchers as in some way beneficial? If anything that seems like a bigger challenge to me; at least everyone usually agrees that security holes should be fixed, while most companies do not agree, publicly at least, that privacy breaches are really a problem (e.g., “Get over it.“).

David Karger

I have believed for some time, and argued to several search engine providers, that a good way to balance the need for research access against the need for privacy would be for companies to offer a sandbox inside the company where researchers could run their code over sensitive data and get measurements (but not data) out. It isn’t a panacea as various tricks could be used to encrypt data in the results, but I think that with supervision it could work well enough.

Don McIntosh

Seems like a good opportunity for Netflix. Pity they gave up on it.

@David Karger: You may already be aware of this… Many national stats offices take this approach, which they call data laboratories. Some also have solutions (or are in process of developing them) that allow researchers to run some queries online through apps and/or Web services that run against sensitive data but confidentialize the results before passing them back to the researcher. The confidentialization techniques used are designed by data privacy researchers (many of whom work in stats offices).

bowerbird

stupid! just give people a chance to opt-in to the experiment.

-bowerbird

Marcin Wojnarski

@David: there is a competition platform, TunedIT Challenges (http://tunedit.org/challenges), which allows just for this: keeping all sensitive data secret on a server and letting participants evaluate their algorithms against these data, without releasing them. This approach was used recently in a data mining contest related to dna microarray data analysis: http://tunedit.org/challenge/RSCTC-2010-A

Featured Video

Big Data and the Hypocrisy of Privacy: Alicia Asín on data, privacy, and the colossal amount of data the IoT will generate.