Khaled El Emam noted that de-identification has been simplified through automation. The process of de-identification in practice involves assessing risk, classifying the variables in the file, and mapping the data. These contribute to specifications in an automated anonymization engine through which the original data are run to produce the anonymized data for release.

Adversaries (that is, those who might re-identify the data) may include academia, the media, a person’s acquaintances, the data recipient, and malicious actors. Interestingly, there is no apparent economic case for malicious re-identification of health data; the bigger concern is the media.

An identifier must satisfy three general criteria: it must be replicable, distinguishable, and knowable. Replicable means that the identifier is sufficiently stable over time and has the same values for the data subject in different data sources (for example, blood glucose level is not replicable, but date of birth is replicable). A potential identifier is distinguishable if there is sufficient variation in the values of the field that it can distinguish among data subjects (example: a diagnosis field will have low distinguishability in a database of only breast cancer patients but high distinguishability in a claims database). An identifier must be knowable by an adversary, and how much an adversary knows will depend on whether the adversary is an acquaintance of the data subject or not. If an adversary is not an acquaintance, the types of information that are available include inferences from existing identifiers, such as date of hospital discharge at birth and public data such as voter registration lists.

Determining risk is a solvable computational problem. Make assumptions about the knowledge of the adversary and how many quasi-identifiers it has, consider all combinations of these, and then manage the risk for every combination.

Some special types of data require specialized techniques. There are good techniques to de-identify geo-spatial information (including movement trajectories), dates and long sequences of dates (for example, transactional data), and streaming data—that is, data that is continuously being updated.

If de-identified properly, open data is not particularly useful for further attacks because it has no identifiable information, and the success rate of linking these data to other data should be small. Decent data can be created for public release, and we can add terms of use or conditions in order to release higher quality data.

Brad Malin described the de-identification system for DNA sequence data that his team constructed. The database contains 2 million patients and biospecimens for 200,000 patients, and the data is being used by 200 researchers (subject to a DUA with the National Institutes of Health).

His team published a paper in June 2014 on a probabilistic model for patient disclosure based on estimating population uniqueness across datasets (Sattar et al. 2014). One needs to be cognizant of data over time: if a data holder anonymizes someone in different ways at different points in time, this may actually make that person easier to identify.

Research has shown the variety of characteristics and behaviors that can distinguish an individual. These characteristics and behaviors include demographics, diagnosis codes, lab tests, DNA, health survey responses, location visits, pedigree structure, movie review, social network structure, search queries, Internet browsing, and smart utility meter usage. A study he conducted found that re-identification risk was substantially greater for a HIPAA limited dataset than a dataset protected with HIPAA Safe Harbor methods.

A simplified view of risk is that the probability of re-identification is approximately equal to the product of the probability of an attack, and the probability of a re-identification conditional on an attack. Deterrents to attack include DUAs, access gateways, unique login IDs and passwords, and audits. Data characteristics that affect the conditional probability of a re-identification include uniqueness, replicability, availability, and cost.

Latanya Sweeney began her remarks by noting that this conversation is not much different than it was in 1997, but the world has changed a lot since then. The Data Privacy Lab at Harvard University initiated the DataMap project (thedatamap.org) to document where personal health data goes outside of the doctor-patient relationship. Maps show the flow of data from the patient to various entities and from the physician and hospital back to the patient. Flows that do not directly involve the patient are numerous, and less than half of the documented data flows are covered by HIPAA, including inpatient discharge data transmitted without explicit identifiers.

A study she led found that only three of the 33 states that sell or share de-identified versions of their hospital inpatient discharge data are using HIPAA standards to protect the data. In a separate study, her team purchased a public use version of patient-level hospital discharge data from Washington State, and using accounts of accidents published in newspapers in 2011, they was able to re-identify 43 percent of a sample of 81 accident victims in the hospital discharge data based on characteristics reported in both sources.

With colleagues she submitted a FOIA request to determine who are the buyers of publicly available health data, and found that predictive analytic companies are the big buyers. They are producing data products that exploit publicly available health data.

There are four ways to add transparency to the system: (1) public notice of privacy breaches should be required; (2) data holders should be required to list publicly those with whom they share data; (3) each person should be able to acquire copies of their personal data from any entity holding their data; and (4) each person should also be able to acquire an audit trail of the of the organizations with which the data was shared.

Re-identification is a key part of the cycle of improving the protection of data. We improve protective techniques only after protections fail. For example, encryption techniques have improved because they were used, problems were identified, and better techniques were developed. We now have strong encryption, and we need the prevention of re-identification to advance to that stage as well.

Denise Love explained that NAHDO has been involved for years in discussions regarding these issues with states. The state data agencies have come up solutions to balance transparency and confidentiality.

The state inpatient discharge and all-payer claims data systems are essential to public health and multiple other purposes, including public safety, injury and disease surveillance, health planning, market share analyses, quality assessments and improvement, and identification of overuse/underuse/misuse of health care services.

There is a critical “iron triangle” to public data, representing three principles of data policy: transparency, data utility, and data safety. There must be a balance among all three. Over-emphasis on any one of the three does not serve the public good.

DUAs can mitigate the risk of inappropriate use. The Washington state story is the first breach that we’ve ever heard about. NAHDO spent a year developing guidelines for data release by states, which was published in January 2012, but Washington State was not following these guidelines.

Daniel Barth-Jones discussed his recent work using uncertainty analysis through a flow chart that lays out several components including intrusion scenarios and information on what variables are needed by an intruder for re-identification. Adding an uncertainty distribution at each step of the flowchart gives a sense of how the data protection and disclosure avoidance techniques can reduce re-identification risk.

Intrusion scenarios include a “nosy neighbor” attack, a mass marketing-type attack to re-identify as many individuals as possible for marketing purposes, and a demonstration attack by a researcher in academia or a journalist. There could be as many as 3,000 potential variables/data elements. However, since most often the data is not necessarily accurate and the intruder cannot build a complete population register, there are often false positives. Each step in the flow chart has a probabilistic distribution—then you can sample across the scenario with a hyper-grid multiple times, which gives us a robust idea of the re-identification risk. There are dependencies at each step in the chain to determine the economic motivation or benefit to the entity.

It is important to consider the impact of de-identification on statistical analysis. Poorly implemented de-identification can distort multivariate relationships and hide heterogeneities. Data reduction through sampling and other means can destroy the ability to identify heterogeneity among the races, or by educational level, for example.

A forthcoming paper by T.S. Gal et al. evaluates the impact of four different anonymization methods on the results obtained from three different types of regression models estimated with colon cancer and lung cancer data. For each combination the authors calculated the percentage of coefficients that changed significance between the original data and the anonymized data.

HIPAA lacks a penalty if data is re-identified by the user, even if these are false positives; currently there is no cost for false positive identification. We need to change the cost for false positive identification to change the economic incentives for efforts at re-identification.

Moderator Steve Cohen identified the following themes during these presentations: game theory, data resources that allow for a breach, and how to simulate the threat of disclosure. He asked the panelists to address where we are heading in the next five years to address these threats.

Malin responded that social media is a serious threat: people self-disclose and disclose about others. His team is doing research on how Twitter is used, and preliminary findings are that people talk more about others than about themselves, and it is a minefield of potential disclosures (for example, “pray for my mom, she has breast cancer”). Another challenge is that electronic health records are becoming commercialized, and start-ups are using data without regulation, which is a big loophole.

Sweeney added that no one is really studying the predictive analytics industry, so we don’t know how big an industry it is. Re-identification is a way of illustrating risk—it’s big although unquantified – we don’t know how much really goes on, because DUAs don’t stop it, they just hide it because the penalties are so draconian. Federal agencies should try to figure out how to link data in a secure way in the cloud to produce aggregated data for the public.

Barth-Jones stated that the future concern is harm from bad de-identification practice—from bad science and inefficiency. We should focus on reducing bad de-identification practices.

Love is concerned that data will be too protected, and opt-in/opt-out will be disastrous for public health and for population health (an example is when parents do not vaccinate their children).

El Emam noted that techniques are becoming more sophisticated, including protection of data. Risks can be managed with appropriate de-identification practices.

Malin recommended that data holders have a dialogue with the community regarding use of data for research purposes. They should create an advisory board and keep them in place, and make them partners. This will reduce the risk of research being shutdown in the event of a breach.

Survey Disclaimer

According to the Paperwork Reduction Act of 1995, no persons are required to respond to a collection of information unless it displays a valid OMB control number. The valid OMB control number for this information collection is 0990-0379. The time required to complete this information collection is estimated to average 5 minutes per response, including the time to review instructions, search existing data resources, gather the data needed, and complete and review the information collection. If you have comments concerning the accuracy of the time estimate(s) or suggestions for improving this form, please write to: U.S. Department of Health & Human Services, OS/OCIO/PRA, 200 Independence Ave., S.W., Suite 336-E, Washington D.C. 20201, Attention: PRA Reports Clearance Officer.