Anonymized genetic research data still carries privacy risks

New technology has allowed scientists to thoroughly comb the genome for …

Up until recently, looking for the changes in DNA that contribute to human genetic diseases was a laborious process that involved tracking the changes through the generations of individual families. The completion of the human genome has changed all of that, allowing researchers to check for hundreds of thousands of individual DNA changes in large populations, and to identify those changes that are associated with specific genetic diseases—as the number of people genotyped grows, data sharing might be able to increase the statistical power of these experiments. But researchers are now cautioning that sharing the data might allow someone to learn about the people who contribute DNA samples to these studies.

The research involves a technique called a Genome-Wide Association Study, or GWAS. These rely on a catalog of over half a million individual locations within the human genome that commonly vary between individuals. Technology like DNA chips allows researchers to take a DNA sample from a volunteer and obtain a genotype at all 500,000-plus sites (called SNPs, for single-nucleotide polymorphisms) in a single experiment. Because of this convenience, it's possible to obtain data from thousands of participants providing genetic mapping with a lot more statistical power than a few family pedigrees provide.

That power allows a GWAS to identify rare or weak genetic changes. By comparing two large panels—DNA donors with a specific genetic order, and those without—it's possible to detect specific SNPs that are more frequently associated with the disorder. These changes can either be a causal genetic change, or simply reside near enough to one that the two get inherited together.

The importance of having large numbers of individuals has led to calls for the sharing of GWAS data. So, for example, anyone who runs a new study of a specific disorder, like diabetes, could combine their new data with those from earlier studies to provide ever greater genetic resolution. It would allow scientists to build upon the work of others to an unprecedented degree.

The new work, however, suggests that this sort of data sharing poses a significant privacy risk for the individuals that have donated their DNA for the research. The first paper appeared in PLoS Genetics back in August. The researchers in this case were interested in the use of SNPs for forensic purposes, and they tested their ability to determine whether an individual's DNA was present in a sample that contains a mixture of DNA from multiple individuals. Using data from publicly available sources, the authors determined that an individual's presence in a sample could be determined even if it was contaminated with DNA from 100 other people.

Although they focused on forensics, the authors recognized that their work had wider implications. Researchers had been anonymizing genetic data for sharing by pooling it, which they expected would obscure the contribution of any individual. But the forensic researchers clearly demonstrated that, given an individual's SNP data, they could determine whether that individual was in a pool.

That apparently caught the attention of a different group of researchers, who designed tools specifically with the intention of identifying individuals within GWAS studies. Given an individual's SNP genotype, their software could identify not only whether that individual was present in a a given GWAS panel, but also which population they belonged to: those affected by the genetic disorder, or the unaffected, control population. The authors could also identify whether a close family member, such as a sibling or parent, was in the experimental population with a reasonable degree of certainty.

The authors also quantified some of the other factors that can influence successful identification, such as the size of the experimental group and the ethnic homogeneity of the control population. And it's important to emphasize that the techniques would require obtaining a DNA sample from someone to work, which is an invasion of privacy in and of itself.

But the general message is pretty clear: the current methods for sharing genetic data pose a real privacy risk for those who have generously volunteered to allow their DNA to be used for medical research, and the degree of risk is likely to increase with the ever-expanding volume of genetic data available. Researchers not only have an ethical obligation to protect these volunteers; in many cases, they're required to do so by law or federal policy, which set strict rules for the protection of privacy in regard to human samples.

11 Reader Comments

Unfortunately, this will also be abused by the federal government and medical insurers at the first opportunity regardless of what laws get passed against said usage...there will always be some exemption for "subpoenaed" data or some mishandling that allows it out in the wild.

If it's anonymized as poorly as internet search and the like, it's not anonymized at all.

I'm not getting the risk here. To do any of the things mentioned in the article doesn't the "attacker" already have to have the genetic material of the given person? So what kind of privacy is being lost? Just the fact that the person volunteered for this study? I guess that is something, but that isn't the basis for any of the issues described. I must be missing something.

The article says that one needs DNA from a specific person in order to identify if that person is in the pool... but the article also leaves open a side door. Suppose two siblings participate in a study. One of them has a particular genetic disorder and that sibling gives a DNA sample to their insurance company. (And let's suppose this was consensual and the insurance company agreed not to use the information to deny coverage.)

It seems from the article that the insurance company could then determine if the other sibling was in the study and whether or not that second sibling had any other genetic disorders. (I say "other" because the insurance company would already know the second sibling was at high risk for it because the first sibling had it, without needing to go through DNA analysis.)

If I'm interpreting the article correctly, that means that a decision by a person to give DNA information to an insurance company could infringe on the privacy of their relatives if they also participated in genetic studies.

Unfortunately, this will also be abused by the federal government and medical insurers at the first opportunity regardless of what laws get passed against said usage...there will always be some exemption for "subpoenaed" data or some mishandling that allows it out in the wild.

If it's anonymized as poorly as internet search and the like, it's not anonymized at all.

I am confused by what you say about how the Government will "abuse" this information at the first available opportunity. My question is, how so? You honestly think the government cares about being able to identify you through "descrete data?" You obviously must not pay your taxes....

Yes, you need someone's DNA in order to know whether they are included in one of the datasets, although there was an interesting article in the New Scientist a few months ago where they showed you could surreptitiously obtain samples of other peoples' DNA and have it sent to 23andMe or another company and get usable data back.

quote:

The article says that one needs DNA from a specific person in order to identify if that person is in the pool... but the article also leaves open a side door. Suppose two siblings participate in a study. One of them has a particular genetic disorder and that sibling gives a DNA sample to their insurance company. (And let's suppose this was consensual and the insurance company agreed not to use the information to deny coverage.)

It seems from the article that the insurance company could then determine if the other sibling was in the study and whether or not that second sibling had any other genetic disorders. (I say "other" because the insurance company would already know the second sibling was at high risk for it because the first sibling had it, without needing to go through DNA analysis.)

This isn't nearly as much of a concern as you might think. Last year GINA was signed into law, and it specifically prohibits insurance companies from using genetic information (including family history) to deny coverage etc.

The problem is not so much insurance company snooping (for once...), but the fact that for all the HIPAA protections and de-identifications that are being used, the NIH is not immune from subpoenas and law enforcement inquiries. This was confirmed to me when I asked the people who wrote the guidelines for GWAS data sharing arrangements. The latter are unfortunately binding for anyone who is requesting federal funding, and private foundations typically follow the same rules.

This is probably pretty unlikely to be a common issue, but the fact is that SNP datasets generated by GWAS studies are basically "super-fingerprints" of persons, making unambiguous identification easy if you have some DNA you want to match.

The damage is not so much what law enforcement or congress might do, but what you are forced to tell patients when you are asking them to donate samples for scientific studies: "NO, I cannot guarantee that noone will be able to grab your data..."

The importance of having large numbers of individuals has led to calls for the sharing of GWAS data.

Once a dataset is expanded beyond the original PI's, you start losing control of who can do what with the data, or who just gets the data. Just think how fast one's personal data gets disseminated on the net.

Legal safeguards really only work when people are willing to follow the rules, and as we know, there are plenty of individuals willing to circumvent the law to their own ends. We need a better methods for controlling anonymity. Maybe DRM has a place after all...

Quite a confusing write-up. There is absolutely zero privacy risk in sharing genetic data of individuals. While genetic data describes and indeed refers to how you are made, it bears no relation to who you are as a person. The privacy concern is directly linked to the sharing of personally identifying information associated with the genetic data.

Take the scenario where police have a genetic sample, and can identify that this sample is without doubt the same person as sample XXXXXXX in the registry of millions. This only identifies one individual vs millions of other individuals and not who the sample came from. It only becomes a privacy risk when the personally identifying information (name, address, sin) is brought into the picture, which is not part of your genetic information.

As for improper medical uses eg: refusing insurance, that is simply due to the fact that many counties including the USA have chosen a system of non-universal/private insurance for coverage. When you give a faceless company the ability to make profit by denying coverage, you can expect no less.

Originally posted by SgtCupCake:I am confused by what you say about how the Government will "abuse" this information at the first available opportunity. My question is, how so? You honestly think the government cares about being able to identify you through "descrete data?" You obviously must not pay your taxes....

Sorry, I'll put on my tin foil hat now. Don't trust the government as far as I could throw the washington monument.

Originally posted by mrjk:I'm not getting the risk here. To do any of the things mentioned in the article doesn't the "attacker" already have to have the genetic material of the given person? So what kind of privacy is being lost? Just the fact that the person volunteered for this study? I guess that is something, but that isn't the basis for any of the issues described. I must be missing something.

You aren't missing anything. The problem is both overblown and misstated. If I run a genetics study of volunteers and then send their genetic profiles to a second research center, those profiles will have no personal identifiers (just code numbers). The researchers who receive the genetic profiles will have no idea who is enrolled in my study, and they will not be able to identify anyone just by looking at the data.

Theoretically, if someone at the second research center had a copy of your genetic profile, he could scan the database of my volunteers and find your pattern. But, that could only happen if you previously voluteered to be genetically profiled by that research center. Therefore, they could find out that you participated in my study only if you had already participated in theirs. That's not much of a privacy risk.

The deidentified data (data stripped of unique identifiers like names, phone numbers, and SSN) commonly used in research is not protected by the HIPAA federal privacy law, which raises serious concerns. Without strong privacy protections, it is hard to track and ensure oversight of this data, and above all, prevent the risk of reidentification. A big challenge is figuring out how to deidentify data to reduce or eliminate privacy risks, while still keeping enough of it so it can be used for research purposes. The American Recovery and Reinvestment Act of 2009 (ARRA) provides opportunities to strengthen current deidentification standards and expand ways the data can be anonymized. These issues are discussed in greater detail in CDT paper http://www.cdt.org/healthpriva...90625_deidentify.pdf. Despite the issues facing identification, we now have greater protections in place for genetic data and how it is used than we did before. The Genetic Information Nondiscrimination Act provides individuals with new federal protections against genetic discrimination in health care coverage and employment situations. More on this at: http://blog.cdt.org/2009/10/08...ndiscrimination-act/