Forensic mathematics of DNA matching

A typical DNA case involves the comparison of two samples  an
unknown or evidence sample, such as semen from a rape, and a
known or reference sample, such as a blood sample from a
suspect.

If the DNA profile obtained from the two samples are
indistinguishable (they "match"), that of course is evidence for the
court that the samples have a common source  in this case, that the
suspect contributed the semen.

How strong is the evidence? If the DNA profile consists of a
combination of traits that figure to be extremely rare, the evidence
is very strong that the suspect is the contributor. To the extent
that the DNA profile is not so rare, it is easier to imagine that the
suspect might be unrelated to the crime and that he matches only by
chance.

DNA profile probability

Therefore it is essential to have some idea as to the probability
that a match would occur by chance. It is easiest to illustrate by
example how the probability is determined:

DNA Profile

Allele frequency from database

Genotype frequency for locus

Locus

Alleles

times allele observed

size of database

Frequency

formula

number

CSF1PO

10

109

432

p=

0.25

2pq

0.16

11

134

q=

0.31

TPOX

8

229

432

p=

0.53

p2

0.28

8

THO1

6

102

428

p=

0.24

2pq

0.07

7

64

q=

0.15

vWA

16

91

428

p=

0.21

p2

0.05

16

profile frequency=

0.00014

The allele 10 at the locus CSF1PO was observed 109 times in a
population sample of 432 alleles (216 people). Therefore it is
reasonable to estimate that there is a chance p=0.25 that
any particular CSF1PO allele, selected at random, would be a 10.
Similarly, the chance is about q=0.31 for a random CSP1PO
allele to be 11. Prior to typing the suspect, if we assume that he is
not the donor of the evidence then we can think of him as someone who
received a CSF1PO allele at random from each of his parents. The
chance to receive 10 from his mother and 11 from his father is
therefore pq, and to receive 11 from mother and 10 from
father is another pq, so the probability to be 10,11 by
chance is 2pq. Hence about 16% of people have the 10,11
genotype at the CSF1PO locus.

At the TPOX locus, since both alleles are the same there is only
one term  pp or p2, which
represents the combined probability of inheriting the allele 8 from
each parent. Hence about 28% of people have the same TPOX genotype
as does the evidence. It is to be expected that the proportion of
TPOX 8,8 people is still 28% even if attention is restricted only to
people who have a particular CSF1PO genotype such as 10,11. Therefore
the chance for a person to have the combined genotype in the two loci
is 28% of 16%  about 4%.

The calculations for the THO1 and vWA loci are similar, and
taking them into account whittles the overall chance for a random
person to have the combined genotype from 4% down to about 1/7000.

product rule

In summary, the probability of a particular multiple-locus genotype
is obtained by multiplication  by multiplying together the
frequencies of the per-locus genotypes, which is to say, by
multiplying together the frequencies of all the individual alleles
and including in addition a factor of 2 for each heterozygous locus.
This way to obtain the frequency of a DNA profile is called the
product rule.

The profile frequency is sometimes referred to as the random
match probability, or the chance of a random match.

verbal explanation

In the example case, the overall profile frequency is 0.00014 or
about 1/7000. Therefore, a summary of the evidence is that

either the suspect contributed the evidence, or an unlikely
coincidence happened  the once-in-7000 coincidence that an
unrelated person would by chance have the same DNA profile as that
obtained from the evidence.

A shorter summary is "common source, or unlikely coincidence."

Fallacies

"Prosecutor's fallacy"

correct statement vs. prosecutor's
fallacy

Correct statement

Prosecutor's fallacy

The chance is 1/7000 that some (particular) person other than
the suspect would leave a stain like the actual stain.

The chance is 1/7000 that someone (anyone) other than the
suspect left the stain.

are obviously different when shown side-by-side, but there is some
similarity. For example, both statments might carelessly be
paraphrased by the ambiguous statement

The chance is 1/7000 for someone other than the suspect to
produce the observed evidence.

Maybe this is how the "prosecutor's fallacy" got started.

Newspapers almost always write, incorrectly, that this means there is
only 1 chance in 7000 that a person other than the suspect left the
semen. (Why? See box.) To make such a statement
is to commit the prosecutor's fallacy. It is a fallacy
because it pretends that the probability that the suspect might be
the donor can be computed from the DNA evidence alone, which implies
illogically that other evidence in the case (even if the "suspect" is
a dead woman, or even if the suspect was filmed in the act) makes no
difference at all.

It seems logical therefore that DNA evidence alone cannot be a
proof  some additional information is necessary. However, the
amount of additional information that is necessary might be a very
small amount. For example, add to the DNA matching evidence (of 7000
to one) the mere knowledge that the suspect was arrested
before his DNA type was known, and you have something like a
proof.

"Defense attorney's fallacy"

Sometime the defense tries to minimize the impact of 7000 to one
matching odds by saying, "Since that means that there are hundreds of
men in this city with the same profile, there is only one chance in
several hundred that my client is the donor of the semen." That would
be good logic if the other evidence suggests that every man in the
city had the same access to the crime scene as did the suspect; not
otherwise.

Laboratory error

Besides
"common source", and
"unlikely coincidence",
a third possible explanation for a match between suspect and evidence
is error. The chance of an error that would cause a spurious match 
mishandling the evidence, PCR contamination  although
unquantifiable, is probably very small. Nonetheless, it seems likely
that the chance of error is often much larger than the extremely
small random match chances (such as 1 in 108) that occur,
so it may be more realistic and more fair in such cases to say "same
source, or (unlikely) error" rather than to say "same source, or
unlikely coincidence."

Microvariants

Sometimes the defense points out that there are sequence
variations in most alleles, so the suspect's allele 10 and the
evidence allele 10, which were reported by the analyst as matching,
may in reality be different. That's true, but irrelevant. The
analysis and statistics are consistent in treating "match" merely to
mean "same category," so the statistical conclusion of "either
common source, or once-in-7000 coincidence" is still
correct.

Limitations

The method of calculation described above makes several
assumptions, and in some cases some of those assumptions may be false
so it is important to be aware of them. There is a more thorough
discussion of all these issues in
"The Evaluation of Forensic DNA
Evidence".

relatives

The analysis above assumes that if suspect is not the donor, he is
unrelated to the donor. But common sense shows immediately that if
the suspect can make a case that a
relative of his, especially
his brother, is the donor, then
that goes a long way towards explaining away the coincidental
similarity between the suspect and the evidence. The defense always
needs to be aware of this possibility. There are
other computations that can be made
to deal with situations where relatives of the suspect (even
distant relatives)
may be worth considering.

heterogeneous population

The application of the product rule presumes that the relevant
loci and population be in
Hardy-Weinberg equilibrium and
linkage equilibrium.
These population genetic concepts have been found to hold to a
reasonable degree of accuracy for major populations and typical
forensically used loci. For mixed populations and inbred populations
the product rule is not as accurate. To the extent that the product
rule is inaccurate, the error usually works against suspect, unfairly
exaggerating the strength of the evidence.