This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

Scientists should share. Methods, samples, and data — sharing these is a foundational aspect of the scientific method. Sharing enables researchers to replicate, validate, and build upon the work of colleagues. As Isaac Newton famously wrote: “If I have seen further it is by standing on the shoulders of giants.”

When scientists study humans, however, this impulse to share runs into another motivating force — respect for individual privacy. Clinical research has traditionally been conducted using de-identified data, and participants have been assured privacy. As digital information and computational methods have increased the ability to re-identify participants, researchers have become correspondingly more restrictive with sharing. Solutions are proposed in an attempt to maximize research value while protecting privacy, but these can fail — and, as Gymrek et al. have recently confirmed, biological materials themselves contain highly identifying information through their genetic material alone.

When George Church proposed the Personal Genome Project in 2005, he recognized this inherent tension between privacy and data sharing. He proposed an extreme solution: cutting the Gordian knot by removing assurances of privacy:

If the study subjects are consented with the promise of permanent confidentiality of their records, then the exposure of their data could result in psychological trauma to the participants and loss of public trust in the project. On the other hand, if subjects are recruited and consented based on expectation of full public data release, then the above risks to the subjects and the project can be avoided.

After this the PGP protocol was relaxed to no longer request public sharing of names — but the high likelihood of re-identification was consistently communicated. This possibility is discussed in the consent form. Participants are given a study guide and must pass an online exam demonstrating their knowledge of potential risks. Participants are then invited to share as much (or as little) as they wish on their public profile (e.g. health history, ZIP code, and direct-to-consumer genetic testing data). Genome sequencing is performed for some participants and shared publicly on the same profile pages. As our database has grown it should come as no surprise to find that many participants within it are, in fact, identifiable.

I’d also like to invite suggestions, here and elsewhere, for data use policies the PGP could request of those using participant data. Our project is committed to sharing data without restriction — but we could also publish some recommended behavior for working with this data.

Some participants may be unhappy having their identity connected to sensitive information on their profile. Some groups have suggested modifying data to minimize details (while maintaining scientific utility — e.g. removing the last two digits of the ZIP code). I would emphasize a different response. If participants are concerned about being connected to sensitive data, they should remove that data — not obfuscate the profile in the hopes of maintaining anonymity. The Erlich lab has demonstrated that Y-chromosomes can be matched to surnames, and we’re sure many other genetic data identification methods will be found in the future. One could imagine researchers in the future making facial predictions from genetic data. We expect all participants will eventually be identifiable through genome data alone.

In other words, my advice to participants is to treat your PGP profile as if your name were already on it. Sharing sensitive personal information can greatly benefit society, but don’t share because you think it’s anonymous. If you’re worried about people learning that you’ve used cocaine or had an abortion, remove those details — not your ZIP code.

I’m wondering if you would elaborate on your concluding advice for PGP participants: “If you’re worried about people learning that you’ve used cocaine or had an abortion, remove those details — not your ZIP code”? I know that the microbiome is all the rage now, making all research local, and so I can see where full zip code, and not just the truncated three-digit zip code that Latanya advises, would be useful to researchers. But is there any reason to think that that information is necessarily more scientifically valuable than trait information pertaining to drug addiction, genetic disease, or other sensitive health information, such that if participants wish to reduce (not eliminate) the risk of re-identification, they should scrub trait data rather than zip code?

Okay, not enough coffee, apparently. I think I see what you mean: It’s not about the relative scientific value of zip codes versus phenotype data. It’s that because genomic info is inherently identifying and you predict that participants will eventually be identifiable through that data alone, the only way to preclude people learning sensitive facts about them is not to include them in their research data in the first place; truncating one’s zip code won’t preclude their being re-identified directly from genomic data, and hence associated with the sensitive trait data they choose to share. So they should simply not include any sensitive information about themselves that they wouldn’t want known, rather than truncating their zip code, which would have no effect on that method of re-identification (and would limit some research to boot). Is that right?

I posted this on the DNA Art post, but since you linked to the same article about the artist making “3D masks” of anonymous individuals from their discarded DNA, and since it does have to do with re-identification, I thought I’d repost it here.

The art project strikes me as misleading and as likely to lead (indeed, already has led) to overblown reactions. It appears that she’s basing these masks on very little data: (1) presence or absence of the SRY gene, which tells her whether the cigarette butt or gum, or whatever, was in the mouth of a male or female; (2) mtDNA Haplotype, which tells you something pretty broad (e.g., “Eastern European”) about only half of the person’s ancestry (the half on his or her maternal line); and (3) the person’s genotype at the HERC2 gene, which *predicts,* probabilistically, one’s eye color — and only for those of European ancestry, because, alas, that’s the data we currently have. (So, for instance, in Europeans, those who are homozygous for HERC2 (AA), as all four of her samples are, have an 85% chance of having brown eyes; a 14% chance of having green eyes; and a 1% chance of having blue eyes. To take another example, I am heterozygous (AG), which gives me a 56% chance of having brown eyes; a 37% chance of having green eyes; and only a 7% chance of having blue eyes. Although she would have predicted that I would have brown eyes, I in fact have blue eyes.)

All of which is just to say that, so far as I can tell, she’s working with sex; ancestral groups that are usually very broad, and in any event only reflect half of the individual’s DNA (from which she presumably guesses hair color and texture and bone structure); and a decent guess at eye color. There are hundreds of thousands (at least) of people who would fit these descriptions even if each of her phenotype predictions were accurate, and in many cases, one or more of those predictions are probably going to be wrong.

And yet her masks, and the publicity they’ve generated, suggest that simply genotyping the saliva on a leftover dinner glass could easily re-identify someone by creating a 3D mask that resembled the proband’s unique face. As Big Think disappointingly puts it, “imagine walking into your local coffee shop and seeing your face up on the wall without ever having posed for a photo or portrait. . . . The only thing that [DNA] can’t tell you, apparently, is the specific age of the person.” Oh brother.

I appreciate that she acknowledges that the masks may look more like a “cousin” of the proband than the proband him- or herself, but even that is misleading unless we think of each person as having 100,000 “cousins.” Given that she intends her art to “spark a dialogue over genetic surveillance,” that’s a little troubling.

(Also, with respect to her “self-portrait,” which is being used to suggest how accurate her method is, I’m pretty sure that there are no genes for plucked eyebrows.)

But you, and others in the symposium, may have a different sense of how accurate these masks are likely to be. Curious to hear others’ thoughts.