De-identification methods for open health data: the case of the Heritage Health Prize
claims dataset.

Abstract

BACKGROUNDThere are many benefits to open datasets. However, privacy concerns have hampered
the widespread creation of open health data. There is a dearth of documented methods
and case studies for the creation of public-use health data. We describe a new methodology
for creating a longitudinal public health dataset in the context of the Heritage Health
Prize (HHP). The HHP is a global data mining competition to predict, by using claims
data, the number of days patients will be hospitalized in a subsequent year. The winner
will be the team or individual with the most accurate model past a threshold accuracy,
and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends
on April 3, 2013.OBJECTIVETo de-identify the claims data used in the HHP competition and ensure that it meets
the requirements in the US Health Insurance Portability and Accountability Act (HIPAA)
Privacy Rule.METHODSWe defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard
for disclosing the competition dataset. Three plausible re-identification attacks
that can be executed on these data were identified. For each attack the re-identification
probability was evaluated. If it was deemed too high then a new de-identification
algorithm was applied to reduce the risk to an acceptable level. We performed an actual
evaluation of re-identification risk using simulated attacks and matching experiments
to confirm the results of the de-identification and to test sensitivity to assumptions.
The main metric used to evaluate re-identification risk was the probability that a
record in the HHP data can be re-identified given an attempted attack.RESULTSAn evaluation of the de-identified dataset estimated that the probability of re-identifying
an individual was .0084, below the .05 probability threshold specified for the competition.
The risk was robust to violations of our initial assumptions.CONCLUSIONSIt was possible to ensure that the probability of re-identification for a large longitudinal
dataset was acceptably low when it was released for a global user community in support
of an analytics competition. This is an example of, and methodology for, achieving
open data principles for longitudinal health data.