Why you can't really anonymize your data

One of the joys of the last few years has been the flood of real-world datasets being released by all sorts of organizations. These usually involve some record of individuals’ activities, so to assuage privacy fears, the distributors will claim that any personally-identifying information (PII) has been stripped. The idea is that this makes it impossible to match any record with the person it’s recording.

Something that my friend Arvind Narayanan has taught me, both with theoretical papers and repeated practical demonstrations, is that this anonymization process is an illusion. Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the “anonymous” dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a Kaggle contest by matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.

All the known examples of this type of identification are from the research world — no commercial or malicious uses have yet come to light — but they prove that anonymization is not an absolute protection. In fact, it creates a false sense of security. Any dataset that has enough information on people to be interesting to researchers also has enough information to be de-anonymized. This is important because I want to see our tools applied to problems that really matter in areas like health and crime. This means releasing detailed datasets on those areas to researchers, and those are bound to contain data more sensitive than movie rentals or photo logs. If just one of those sets is de-anonymized and causes a user backlash, we’ll lose access to all of them.

So, what should we do? Accepting that anonymization is not a complete solution doesn’t mean giving up, it just means we have to be smarter about our data releases. Below I outline four suggestions.

Keep the anonymization

Just because it’s not totally reliable, don’t stop stripping out PII. It’s a good first step, and makes the reconstruction process much harder for any attacker.

Acknowledge there’s a risk of de-anonymization

Don’t make false promises to users about how anonymous their data is. Make the case to them that you’re minimizing the risk and possible harm of any data leaks, sell them on the benefits (either for themselves or the wider world) and get their permission to go ahead. This is a painful slog, but the more organizations that take this approach, the easier it will be. A great model is Reddit, which asked their users to opt-in to sharing their data. They got a great response.

Limit the detail

Look at the records you’re getting ready to open up to the world, and imagine that they can be linked back to named people. Are there parts of it that are more sensitive than others, and maybe less important to the sort of applications you have in mind? Can you aggregate multiple people together into cohorts that represent the average behavior of small groups?

Learn from the experts

There’s many decades of experience of dealing with highly sensitive and personal data in sociology and economics departments across the globe. They’ve developedtechniques that could prove useful to the emerging community of data scientists, such as subtle distortions of the information to prevent identification of individuals, or even the sort of locked-down clean-room conditions that are required to access detailed IRS data.

There’s so much good that can be accomplished using open datasets, it would be a tragedy if we let this slip through our fingers with preventable errors. With a bit of care up front, and an acknowledgement of the challenges we face, I really believe we can deliver concrete benefits without destroying people’s privacy.

Pete Warden is the tech lead of the TensorFlow Mobile team, and was formerly the CTO of Jetpac, which was acquired by Google in 2014 for its deep learning technology optimized to run on mobile and embedded devices. He's previously worked at Apple on GPU optimizations for image processing, and has written several books on data processing for O'Reilly.