No silver bullet: De-identification still doesn't work

Paul Ohm’s 2009 article Broken Promises of Privacy spurred a debate in legal and policy circles on the appropriate response to computer science research on re-identification techniques. In this debate, the empirical research has often been misunderstood or misrepresented. A new report by Ann Cavoukian and Daniel Castro is full of such inaccuracies, despite its claims of “setting the record straight.”

In a response to this piece, Ed Felten and I point out eight of our most serious points of disagreement with Cavoukian and Castro. The thrust of our arguments is that (i) there is no evidence that de-identification works either in theory or in practice and (ii) attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do.

Specifically, we argue that:

There is no known effective method to anonymize location data, and no evidence that it’s meaningfully achievable.

Computing re-identification probabilities based on proof-of-concept demonstrations is silly.

Cavoukian and Castro ignore many realistic threats by focusing narrowly on a particular model of re-identification.

Cavoukian and Castro concede that de-identification is inadequate for high-dimensional data. But nowadays most interesting datasets are high-dimensional.

Penetrate-and-patch is not an option.

Computer science knowledge is relevant and highly available.

Cavoukian and Castro apply different standards to big data and re-identification techniques.

Quantification of re-identification probabilities, which permeates Cavoukian and Castro’s arguments, is a fundamentally meaningless exercise.

Data privacy is a hard problem. Data custodians face a choice between roughly three alternatives: sticking with the old habit of de-identification and hoping for the best; turning to emerging technologies like differential privacy that involve some trade-offs in utility and convenience; and using legal agreements to limit the flow and use of sensitive data. These solutions aren’t fully satisfactory, either individually or in combination, nor is any one approach the best in all circumstances.

Change is difficult. When faced with the challenge of fostering data science while preventing privacy risks, the urge to preserve the status quo is understandable. However, this is incompatible with the reality of re-identification science. If a “best of both worlds” solution exists, de-identification is certainly not that solution. Instead of looking for a silver bullet, policy makers must confront hard choices.

Rather than wait to write a complete response to some of the other points in the “No silver bullet” full paper, I thought I would address key ones here directly in the interest of timeliness.

(1) Heritage Health Prize (HHP)
As noted, Arvind Narayanan performed a re-identification attack on the HHP data set before it was released for the competition. Admittedly the techniques he used were well thought out and attempted to draw additional inferences from the data to make them more easily match external data. But at the end of the day _not a single patient in that data set was re-identified during the attack_. That is arguably quite compelling evidence that the assumptions we made during the de-identification were reasonable.

(2) NYC Taxi Data
The authors rightly criticize the NYC taxi data release. I think most people working in this space would agree that this data was poorly anonymized by any standard. This ought not be taken as an example of what should be done. It should be dismissed as obviously poor practice. The practical challenge is that many data custodians who are releasing data are not using some of the best practices to de-identify their data. We would be much better off if we can disseminate effective de-identification techniques to these organizations to help them improve their practices, otherwise there will be more stories like this.

(3) De-identifying Location Data
A paper on this topic is in the process of publication. Stand by …

(4) Other Controls
In practice, data is not always disclosed with no controls. Often contractual and other administrative controls are imposed on the data recipients. In the case of the HHP the terms-of-use prohibited re-identification attacks (on the final data set), and I know that this has acted as a deterrent. Risks are managed through a number of controls, and one of those controls is to modify the data itself. Public data are a special case where the range of controls is limited, but that is not the only case: a lot of data sharing is not for public use files.

Regarding point (1) only, this is a very strange claim. In the referenced paper, Narayanan himself claims that about one eight of the patients in the dataset would be identifiable with the addition of a small amount of information. That he does not proceed to specifically identify any single person cannot be used as a counter-claim to this, nor does the fact that he rested after his efforts say anything about the security of the dataset.

Felten and Narayanan make some very good points. Additional research into de-identification is a fine thing, but at the moment, it is quite clear that if you are releasing data that contains location information or is high dimensional on a significant sized population, you should assume that a significant number of the records in the dataset can be de-anonymized. As Felten and Narayanan quite correctly state, real attackers do not accept logical constraints on the class of methods that they use. Logical inferences made based on such constraints therefore provide little or no assurance about what real attackers are likely to be able to do.

“Narayanan himself claims that about one eight of the patients in the dataset would be identifiable with the addition of a small amount of information.”

That “small amount of information” he is referring to is knowing seven diagnosis codes attributed to 17,000 patients. It was meant as an exercise to show the sensitivity of the results to changes in the assumptions. Fair enough. But is it reasonable to assume an adversary would know this much information? All models require a reasonable set of assumptions, and this is a conversation worth having.

But let’s not forget that the released data set was also based on a random subsample of the Heritage Health Prize data set, which is in itself a subset of the population of California. We can’t make a reliable claim that the one in eight were identifiable without attempting to match them to the population (not that Narayanan was trying, as he was doing a sensitivity analysis). Just knowing who is in the Heritage Provider Network’s data would be difficult (I don’t know my neighbors insurance provider).

Why would you need to know who is in the dataset to make an inference? If you suspect that your neighbor is in the dataset, you can look in the dataset and see if there is somebody who matches what you know about your neighbor. If nobody matches, then you know your neighbor wasn’t in the dataset. If somebody matches, then you can infer a significant probability that the matching person is your neighbor. (The exact probability depends on circumstances.)

Yes! Determining whether or not an individual is present in a sample dataset, given a known identifier or two, is the relevant use case, from my perspective. It isn’t a rule-out scenario, of course. The point is that truly effective de-identification needs to work from both directions. I’m not convinced that it can.

Khaled El Emam is correct, in pointing out that measurement of re-identification probabilities is a topic that has been the subject of extensive research attention for the past 30 years. Statistics and probability haven’t changed, not that you’d know given all the Bayesian (fan club) noise! Instead, access to computational resources has changed. Try running SAS jobs on a nice IBM mainframe – it is amazing! My understanding is that distributed systems and R programs have made computationally-intensive analysis widely accessible. That is why I am concerned: Data privacy is very important! I have worked as a statistical analyst in settings subject to HIPAA privacy rules for PHI (protected health information). One was for active and retired military personnel. The other was for children with special health care needs. That is where I learned that a risk-based approach to healthcare data anonymization is tempting but dangerous.

This conversation would do well to move beyond a binary attitude towards de-identification as a privacy solution toward a risk-based outlook. We’re finally moving along that trajectory re: information security… imagine if we writ large decided to toss all those models in light of the fact that there’s no such thing a perfect security. At the end of the day ‘performance standards’ against which de-identification/anonymization are measured — laws/regs — are risk-based and not strict liability (note that I’m not talking about the release of direct identifiers as may be prohibited by for eg, HIPAA, state data breach laws).

While we’re at it, for many kinds of attack, re-identification need not be fully successful to have a significant impact. If you can narrow the set of possible matches by a few orders of magnitude, or the likelihood that you’re looking at your desired target by a few sigmas, that will do fine.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.