Search

Subscribe

Identifying People from their DNA

The genetic data posted online seemed perfectly anonymous ­- strings of billions of DNA letters from more than 1,000 people. But all it took was some clever sleuthing on the Web for a genetics researcher to identify five people he randomly selected from the study group. Not only that, he found their entire families, even though the relatives had no part in the study ­-- identifying nearly 50 people.

[...]

Other reports have identified people whose genetic data was online, but none had done so using such limited information: the long strings of DNA letters, an age and, because the study focused on only American subjects, a state.

Comments

Interesting article, but note that the real risk here was due to the individuals publishing their DNA twice - once in a DNA study and the second time in a genealogy database. It was then possible to correlate the two publications.

Probably not a good idea to become a criminal if you or any of your extended family are into genealogy: presumably the police can track you (or your family members) down using the same technique.

The process the article describes is somewhat similar to what is known as “familial DNA”. The difference seems to be that the author went from thousands of SNP base pairs to the STR markers that are commonly associated with the FBI CODIS system. A true familial DNA search just skips the SNP step and does a one-to-many search of STR markers. California has done these kind of searches in high-visibility crime cases, but some states are beginning to regulate this.

Let me explain this again. Typical fingerprint searches are one-to-many – e.g., “run those prints against the entire data base and see who pops up”. Typical CODIS DNA searches are one-to-one – “Does the perp’s DNA match Bruce’s DNA?”. In theory, the search the author did (one-to-many) is no different than what is being done today with fingerprints.

However, there’s two rubs that should bother privacy advocates and John Q. Public.

First, DNA is being collected today as a normal part of many arrests in many states. This involuntary collection of DNA of all people – whether innocent of guilty – is creating a massive DNA data base. States will say that this collection is similar to taking fingerprints at the time of arrest and is perfectly reasonable. States also say that if found innocent, people can file to have their DNA removed from the data base, but few ever follow-up.

Second, DNA is not like a fingerprint. When a person voluntarily (e.g., job application) or involuntarily (e.g., arrest) gives up his fingerprints, he is consciously providing personal information only about himself. When the same individual provides a DNA sample for medical testing or because he is a suspect in a crime, he has now provided personal information about himself as well as his siblings, parents, and future progeny.

The nut of the problem is that giving up DNA waives not only your rights, but other people’s rights as well. This problem is compounded by the fact that this data lives forever and will be impacting children long after the DNA provider is gone.

This is why the law needs to limit the use of “familial DNA searches” and require the destruction of all DNA upon the death of the donor.

As someone who takes part in genealogy and has submitted data to similar DNA databases, I am not in any way bothered by the possibility that criminals in my near or extended family may be identified through such methods. My DNA is my property and I shall do with it as I please, and what pleases me is using it to track down my genealogy.

I am greatly concerned though that laws limiting familial DNA searches could significantly impact my ability to do so.

Just as people that are not affiliated with the police can ask questions and perform searches that the police aren't permitted to do, there is no violation of any criminal's rights due to voluntary submission of DNA data by non-criminals.

If the data is meaningful, it's probably not anonymous. The re-identification of this data was predictable and predicted.

"The PII fallacy has important implications for health-care and biomedical datasets. The “safe harbor” provision of the HIPAA Privacy Rule enumerates 18 attributes whose removal and/or modification is sufficient for the data to be considered properly de-identified, with the implication that such data can be released without liability. This appears to contradict our argument that PII is meaningless. The “safe harbor” provision, however, applies only if the releasing entity has “no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information.” As actual experience has shown, any remaining attributes can be used for re-identification, as long as they differ from individual to individual. Therefore, PII has no meaning even in the context of the HIPAA Privacy Rule."

Brian, I appreciate your opinion that “your DNA is your property and you can do with it what you please.” However, imagine you start a new job and are turned down for life insurance because of DNA your ancestor provided for some legitimate reason, or you are put on administrative leave from your job after you’ve been identified as one of a number of “people-of-interest” by police doing a familial search on an ancestors DNA.

Your ancestor gave up his DNA for some reason and now your rights are impacted.

I’m also an advocate for genetic genealogy, but this area has a great deal of long term potential for medical and law enforcement abuse.

The GINA act of 2008 attempts to address this from a medical and employment standpoint, but doesn’t preclude use of DNA data for determination of life insurance eligibility.

I'm not in any way bothered by the possibility that criminals in my near or extended family may be identified through such methods. My DNA is my property and I shal do with it as I please, and what pleases me is using it to track down my genealogy

First off under US law and as case law shows your DNA is not your property.

Secondly whilst you might not care about any "criminals" in your family, how about the rest of them who you could in effect turn into outcasts from society?

DNA tells us many things and in the near future much much more. For instance the pre-disposition to various diseases, and the likely hood of an early death.

In the US health care is not blind and it is almost certainly avaricious. We know of people who are effectivly unemployable because their health risk is to high for an insurer to willingly take on.

Do you realy want to anounce to all who find your DNA that your family may not be good "breeding stock"?

And don't think insurance companies won't go looking and exploit it irespective of what legislation says. They will simply "off shore" it to another country that does not have such legislation.

We have seen these illegal "Don't employ" registers in the UK with the building industry "black balling" workers who stand up for their and others workers right to not work in significantly dangerous conditions simply because some manager says it's "safe" to cut costs even though it's obvious to anybody with eyes and a brain that it is very far from safe.

So I sugest you actually have a think and then go and discuss the risks of your reckless behaviour with the rest of your family before they come and talk to you because they've been hurt by your hobby.

I am not in any way bothered by the possibility that criminals in my near or extended family may be identified through such methods.

Remember, it's not just breakers of laws you agree with, or that apply in your jurisdiction, that can be identified via DNA.

Are you bothered that people who are "criminals" because they break the law of a repressive dictatorship that prohibits distributing pro-democracy literature, might be identified? Or is that fine too because "a criminal is a criminal is a criminal"?

I wonder how familial DNA searches deal with bastards, and whether the existence of bastards can be used to undermine their admissibility in court (or to get warrants, or...). Bonus points if there are correlations between illegitimacy and criminality to make the appearance of the bastard problem even likelier in situations where someone might want to use familial DNA.

I, for one, think we should all have our dna on files which can then be used for such things as credit card purchases & identity cards. Good by passports. And password problems? Fugghedaboutit! DNA scans on every terminal. ^_^ ^_^

Related tangent/mindblow, using DNA to store digital info.
--The perfectly uniform fragment lengths and absence of homopolymers make it obvious that the synthesized DNA does not have a natural (biological) origin, and so imply the presence of deliberate design and encoded information.

So, would you use information about your APOE status to make a decision about the purchase of LTC insurance? Do you worry that the distribution and use of knowledge about APOE4 will impact the availability and cost of LTC insurance?

First, my DNA only reflects that of my parents. My APOE status (for example) wouldn't match that of a first cousin, much less that of a more distant relative. I have no siblings so I'm not putting them at risk either. I also have no children and will not have any, so no future risk there. I don't have the APOE4 risk alleles so I have no risk there.

Next, it surprises me that commenters here would essentially suggest security through obscurity to protect others by not making DNA available. We shed DNA constantly, everywhere. Anyone could collect DNA just by walking behind you. Within X number of years we will have cheap, hand held consumer grade full genome sequencers. Your DNA will be the private key used to start your car, unlock your tablet and so on. Like encryption a DNA sequence is easy to store but requires "cycles" to "compute" (produce in quantity at high fidelity as we shed skin cells). We are in a brief interregnum where DNA identification is a technology available to police agencies (at some expense) but not yet in our grubby paws. It'll get there. And yes one can clone another's DNA just as we clone a key, but locks exist to keep honest people out, not criminals, as our host here has said so often.

It's worth noting that only Y-DNA is useful for identifying a large enough set of people with any level of specificity so it's only men whose data could be an issue here. Mitochondrial DNA is not helpful - a 100% mitochondrial sequence match has a 99% chance of sharing a common direct female ancestor within 16 generations. That's a lot of people and those of us that TRY to track such things down can't do it. Autosomal DNA is a little better, but only very close matches stick out. A ninth cousin of mine shares a single 9cM, couple thousand SNP half match on one chromosome. Less than 0.1% match overall. Hardly over statistical noise but we both descend from Nicholas Pelletier, early 1600s France-Canada immigrant. As do about 95% of French Canadians.

Y DNA is better but still has to be done at high resolution to be very useful. Much old data in these databases only had a couple dozen STRs tested. The people in the story posted had their full sequence posted and very few people have full Y DNA sequences available. Getting to a surname is the whole point, and it's very useful for that.

But guess what - your great grandfather was adopted? You don't have that surname and escape that dragnet. Many surnames are overloaded, having been adopted multiple times in different geographic areas and won't match in the genetic sense. It's hard work and the family ties aren't always obvious, as those working in the field know. It's not as though governments have perfect knowledge of everyone's family tree. Even dedicated researchers don't. Many sites make efforts to protect others by masking the names of known living individuals or those born after 1930.

About the criminal to innocent job applicant to freedom fighter slippery slope several folks brought up... Perhaps I'd feel different were I an oppressed minority or had relatives living in repressive regimes. I don't, and I've done enough genealogy to be confident of that.

Security remains an arms race, and just as those seeking anonymity from fingerprints learned to wear gloves, and those seeking anonymity from drones wear magic hoodies, those seeking anonymity from this will develop appropriate measures to defend and defect.

GINA won't last forever. Genetic discrimination will eventually be the rule, not a crime or what if. Not saying that's good or bad, just inevitable.

While I share all your concerns regarding the misuse and abuse of DNA I believe we are all trying to glue shut Pandora's box after the secret is already out.

I remember 1996 discussions with high level officials at Sony about MP3 music players and how this would ruin their industry if they did not get aead of the curve. They decided to simply try to make downloading illegal, they succeed in getting the laws but failed miserably in constraining the growth of the MP3 market.

DNA sequencing will be similar. There are portable cheap products (under $10)in development which will sequence DNA in less than a minute. As Brian points out it will be effectively impossible to stop anyone, with an interest, from sequencing your DNA.

As we all know data once collected NEVER really dies, it always lives on in snap-shots and backups. Eventually even the most secure systems will leak data so we all have to expect that everyone will be instantly identifiable from any DNA they shed unintentionally. Even if the LEO does not have your exact DNA they can now narrow it down to a certain family. Frankly this worries me because so many of our civil rights really just come down to the degree of anonymity that living in a large group affords us.

I know that I have committed crimes in the past, or more correctly actions that would be considered criminal within the jurisdictional that I was living.

To give a specific example one time I was held up at knife point in a city with a curfew, in a place that I as a foreigner should not have been, well things went very badly for the attacker. I know I absolutely left DNA evidence at the site (courtesy of a large gash) now I would probably be the last person that LEO's would be looking at for this incident, yet with the help of DNA they could zoom in directly on my family, frankly I find this morally wrong.

Think about it, how could I justify what I was doing at that location, it was 25 years ago, my boss at the time is dead and any notes probably only ever existed on paper, yet the DNA, if properly collected, still lives on.

I personally think that this individual anonymity, that residing in a large group provides, is an essential part of the groups social contract, making it essential to all our lives that anonymity is maintained.

We may not own our DNA, but ever since the DMCA was passed, and biotech companies started patenting and copyrighting various genes or other genetic strings, I've wondered if the DMCA's no-reverse-engineering-a-copyrighted-and-encrypted-code clause would prevent LEOs from sequencing your DNA, provided you held a "poor man's copyright."

@Brian: "As someone who takes part in genealogy and has submitted data to similar DNA databases, [...] and what pleases me is using it to track down my genealogy."

Publishing fingerprints of DNA will be enough to make genealogy (and for LEO), without health insurance concerns.

My idea is to compare blocs of letters (A, C, T and G).
Replace A by 1, C by 2, T by 3 and G by 4. Convolute the resulting series of integers (i1, i2, i3, ...) with a published series of 100000(arbitrary choice) random double-precision float numbers n1, n2, n3, between 0 and 1....

The N-number of the result of convolution, denoted pN, is pN=n1*i_{N-1}+n2*i_{N-2}+n3*i_{N-3} ... you get the idea. This long sum stops at i1 or at n100000, it depends on N.

Then, for each integer k, publish the number max(p_{100000*k+1}, p_{100000*k+2}, p_{100000*k+3}, ..., p_{100000*k+99999}).

Two people whose DNA sequence have a common subsequence of 300000 consecutive letters will publish at least one common number, provided they used the same random numbers n1, n2, ...

This may also be used to make rsync work for binary files like tar, ...

What you did not mention is that we know how to take single examples of DNA, chop them up and in effect multiply them millions or more times. This is currently standard practice in labs and a small part of course work of undergraduates. The last time I looked all you need to do this was available with a delivery address, Internet connection and Credit Card. With the sums involved being considerably less than many people spend on their minor hobbies.

Thus it is possible currently to replicate fragments of a chosen persons DNA almost endlessly and contaminate a crime scene with them. Simple DNA testing will not show this up and will thus provide a false indication that a person was at the crime scene

I drew attention to this many years ago and later put the same issue up on this blog. Eventually an Australian researcher came to the same conclusion and published their findings and created a bit of noise in the media. The result was the addition of a prescreening stage and other changes in DNA testing to either detect or remove the replicated DNA fragments.

However this adds additional direct and indirect costs and many labs still do not see the need to do it because their LEA customers don't want to pay extra for it. Thus as in any cost sensitive free market the race to the bottom occurrs.

Further there have been improvments in the replication processes. Whilst I'm not aware of somebody getting to the point of "cloning human DNA" (because I've not bothered looking), it may not be required, as a mixture of replicating fragments and then synthesizing them back into full strands may prove to be a more viable route.

But these cheap DNA scanners when they arive will in all likely hood have significant failings, as nearly all new technology does. Thus they might be just like the simple DNA test susceptible to spoofing by replicated DNA fragments etc. And further as with other Bio-metrics trying to fix the faults will just open other avenues of attack.

But as you pointed out the real DNA is fairly easy to collect, how much would be needed to make a one time attack on the cheap scanner?

That is would a plastic finger coating over your finger with a few flakes of dandruf of your selected victim be sufficient to activate the scanner and let you in?

I've known for close to half a century how to fake fingerprints, it was fun to do as a child, I realised a few years later that anyone reading a particular Sherlock Holmes story would have been given sufficient clues on how to do it.

However fun though it was, it got me into trouble in my later life when working as a design engineer for a company making fingerprint scanners in the early days when they were very expensive. Basicaly my boss did not want to here how simple it was to spoof the technology (and still is). And when I showed it to the other engineers... well I was toast. It was again only untill another accademic researcher showed many years later a much more difficult method using Photo etch PCB and gelatin to make "Gummy Bear Prints" that the industry was forced to address the issue (but only for a while cheap vulnerable scanners are still being put into security products).

So how long do you think it will be after the cheap DNA scanners become available and anounced as the ultimate "high security", before somebody finds a 50cent attack against them?

10years? 5years? 3years? 1 year? or how about before it even gets to production, as it was with me and the fingerprint scanners. And then how long after that before criminals get to do it for significant gain?

I guess you did not think about it before you said "Your DNA will be the private key...". And why your statment of,

It surprises me that commenters here would essentially suggest security through obscurity

Is a nonsense by your own reasoning of the easy availability of peoples DNA. DNA is not going to be "security" only "liability" obscurity or not.

Further you have failed to realise there is a significant differencee between the idea of "Anyone could collect DNA just by walking behind you" and having your own DNA sequenced and then putting it into the public domain.

You not they bear the cost and time of doing it and you provide it to them for the price of downloading it which is as close to zero as makes little difference when talking Government and LEA budgets.

As we are seeing with GPS tracking it is the cost of doing it on mass that is the ultimate deterant to those in power commiting whole sale surveillance, by willfully removing the cost you are simply saying to them "bring it on".

But worse your actions can be held up by unscrupulous others as a demonstration of "normal behaviour" and used to falsely sway others opinions. As such I don't think you realise what harm you are doing to society in general.

So in the future burglars will first vacuum the seats on buses or the mattresses in hotels to collect a bag of dust into which they dip their gloved hands before committing the crime.
You find my signal amongst that noise, CSI!

You ducked or misunderstood the APOE question. The question was whether knowledge of APOE status would factor into decisions about purchase of LTC insurance. We actually know that people who have APOE4 do purchase LTC at higher rates. This impacts insurance companies and everyone else because the the overall risk of people in the insurance pool is higher. Either rates go up or LTC companies risk going out of business. The point is that your knowledge of your DNA impacts others.

Also, interesting that because you don't have an APOE4 allele you think you have "no risk". Think again.

Your DNA will be the private key used to start your car, unlock your tablet and so on

I scincerly hope not because it would be a stupid thing to do.

I'm not advocating or recommending this, but I am predicting it. The convenience outweighs the need for security. We live in a world where people can set up their iPhones to "unlock" with a simple swipe; most people won't be concerned by the flaws.

What you did not mention is that we know how to take single examples of DNA, chop them up and in effect multiply them millions or more times. This is currently standard practice in labs and a small part of course work of undergraduates. The last time I looked all you need to do this was available with a delivery address, Internet connection and Credit Card. With the sums involved being considerably less than many people spend on their minor hobbies.

Fragments of DNA won't cut it, in my imagination. I think you're talking about PCR amplification, which produces fragments but not full intact genomic strands of DNA. I foresee full genome sequences used as locks. Replicating portions of DNA is easy, producing a full set of 23 chromosomes with a minimal error rate remains "hard". Like factoring large prime numbers -- possible yes, but out of reach of most. Fragments aren't sufficient for positive identification in a lock sense, though they may be suffice for statistical identification LEO purposes.

as a mixture of replicating fragments and then synthesizing them back into full strands may prove to be a more viable route.

The exact same flaw exists with encryption. 512 bit RSA was good enough, 2048-bit DSA remains very difficult. Synthesizing them back into full strands is NOT easy or within our capabilities. Producing the DATA contained in the DNA is easy, producing the DNA itself is not.

But as you pointed out the real DNA is fairly easy to collect, how much would be needed to make a one time attack on the cheap scanner?

Not much. Just lock a rock to the window beats every door lock. Locks keep honest people out and do not stop a determined attacker.

So how long do you think it will be after the cheap DNA scanners become available and anounced as the ultimate "high security", before somebody finds a 50cent attack against them?

Not long. Then the tech will be enhanced. It will rely on the methylation patterns of your epigenetics rather than the sequence of your DNA. Nobody is (currently) publishing full epigenome data so that remains as private as people think their DNA is at the moment. A continued arms race. Also please note I did NOT refer to DNA authentication as "high security". People will always claim (see the new MEGA upload site) more security than something has, but most people don't care about security anyway, where it interferes with convenience.

Further you have failed to realise there is a significant differencee between the idea of "Anyone could collect DNA just by walking behind you" and having your own DNA sequenced and then putting it into the public domain.

Of course there is. Witness the Personal Genome Project, designed *specifically* to identify the legal, ethical and moral issues associated with publishing ones full genetic data. You would be impressed with the hour-long entrance exam and detailed informed consent process required to join.

As we are seeing with GPS tracking it is the cost of doing it on mass that is the ultimate deterant to those in power commiting whole sale surveillance, by willfully removing the cost you are simply saying to them "bring it on".

The technology exists and the cat cannot be put back in the bag. If society seeks to deter this capability, society will change. Simply choosing to have surnames inherited from one's mother instead of father would destroy the utility of Y-DNA for LEO and genealogists. Dropping inherited surnames entirely would serve the same purpose.

But worse your actions can be held up by unscrupulous others as a demonstration of "normal behaviour" and used to falsely sway others opinions. As such I don't think you realise what harm you are doing to society in general.

I'm confident that the post-neo-Luddite DNA-phobics will always be out there fighting the good fight.

You ducked or misunderstood the APOE question. The question was whether knowledge of APOE status would factor into decisions about purchase of LTC insurance. We actually know that people who have APOE4 do purchase LTC at higher rates. This impacts insurance companies and everyone else because the the overall risk of people in the insurance pool is higher. Either rates go up or LTC companies risk going out of business. The point is that your knowledge of your DNA impacts others.

You are bringing up the exact issues Bruce covered in Liars and Outliers. One can choose to "defect", have this tested, and purchase insurance appropriately, or not. Of course people will take advantage of knowing their APOE4 status to purchase insurance. And a ton of people WITHOUT such knowledge, but carrying deleterious alleles buy insurance too. They're also adding to the risk pool, it's just not known a priori. If this bothers society we can implement "shall issue" for LTC insurance or replace LTC insurance entirely with alternatives. In countries with socialized medicine people will be seeking treatment anyway, whether they know their status up front or not.

Also, interesting that because you don't have an APOE4 allele you think you have "no risk". Think again.

I'm not sure I 100% understand your point here. I am not sure if you are referring to medical risk or societal/financial risk, so no comment.

Some female burglers have adopted a different tactic to achive the same outcome.

Male burglers who are frequently caught tend to have a "smash and grab" attitude that makes it abundantly clear to you they've visited as soon as you step into the property. You inturn touch as little as possible and call the police who in turn touch as little as possible and get the scene of crime officers (forensics) in with a nice pristine crime scene to gather evidence from.

What some female burglars have done is not smash their way in, and only taken stuff that is not out on display or going to be quickly noticed by it's absence. Thus you come home and nothing appears amiss and by the time you do miss one or more items days if not months may have gone by and the crime scene is so contaminated that any signals left by the burglar are going to be difficult at best to detect let alone process. Further as a time/date for the crime will be almost impossible to establish the use of alibis becomes mute.

Another upside of this behaviour appears to be that the women can shift on their ill goto gains with little or no difficulty because the chances are it's not been reported as stolen, and may never do so...

It was pointed out to me a few years ago by an LEO I used to know quite well, that a very significant number of small low value antiques have been stolen property at one time or another and likewise the brikerbrack at "salvage yards" and "car boot sales". It might well have been bought and sold a half dozen times scince it was stolen but even if the original owner could identify it the chances are they would'nt and at best make a comment along the lines of "I used to have one just like this"...

In the Times article, Dr. Jeffrey R. Botkin of the University of Utah shows the same reactions we see over and over and over...

First, he questions why any bad person would do such a thing?

Then, he blames the authors of the identification study for increasing the risk.

Not to blame Dr. Botkin, these are classic mistakes made by almost everyone who doesn't "get" security. Bad guys will use vulnerabilities in ways that most honest people wouldn't have thought of. And sooner or later, they will figure out how to do such things, especially a technique as elementary as that presented in the paper.

I think you are fantasizing solutions, LTC insurance, socialized ones or otherwise, to delivering care to the ballooning numbers of people with AD.

You bring up defection. Read what I wrote. You only think you can defect in this case.

On APOE4 risk you wrote "I don't have the APOE4 risk alleles so I have no risk there". You were obviously discussing genetic risk and I was responding to that. It would be more accurate to write something like "neither of my alleles is APOE4 so I do not have a greater than average risk of AD as a result of my APOE status."

Many thanks another beneficial internet site. Exactly where more may well I buy this style of information written in a real fantastic process? I've a starting that we're just now operating upon, and i have recently been with the look out for this sort of data.