During the recent glut of genome-wide association studies, many researchers were compelled (or chose) not to make all their data public after publication due to vague privacy concerns. Instead, they often made available only genotype frequencies in sets of cases and controls, the idea being that individual-level information is lost when pooled together.

A new paper in PLoS Genetics shows that this assumption is wrong. The idea, obvious in retrospect, goes like this: assume some individual has genotype AA at a particular locus, while the frequency of A in the general population is 10% and the frequency in the pooled sample is 11%. This gives you some (very slight) amount of information suggesting that your individual is in the sample. If your individual is TT at another locus where the frequency of that allele is 25% in the general population and 27% in the pooled sample, this gives you another (again, tiny) bit of evidence that the individual is in the sample. Summing over thousands of loci, this actually becomes quite a bit of information. In fact, the authors are able to reliably tell whether an individual contributes DNA to a pooled sample, even if that contribution is around 0.1%.

In theory, then, police with an unknown person's DNA could match it against all published case/control studies to find out if that person was involved in the study. A more immediate application could be to determine whether a suspect contributed DNA to a crime scene where the mixing of DNA (ie. blood) from a large number of individuals has muddied the forensic evidence.