Significance

We describe a method for identifying in distinct genetic datasets observations that represent the same person. By using correlations
among genetic markers close to one another in the genome, the method can succeed even if the datasets contain no overlapping
markers. We show that the method can link a dataset similar to those used in genomic studies with another dataset containing
markers used for forensics. Our approach can assist in maintaining backward compatibility with databases of existing forensic
genetic profiles as systems move to new marker types. At the same time, it illustrates that the privacy risks that can arise
from the cross-linking of databases are inherent even for small numbers of markers.

Abstract

Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the
challenge of record matching—the identification of dataset entries that represent the same individual. We show that records
can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing
in different datasets. Using two datasets for the same 872 people—one with 642,563 genome-wide SNPs and the other with 13
short tandem repeats (STRs) used in forensic applications—we find that 90–98% of forensic STR records can be connected to
corresponding SNP records and vice versa. Accuracy increases to 99–100% when ～30 STRs are used. Our method expands the potential
of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers
of markers—including databases of forensic significance.