Surveillance: Risk assessment server

Problem Statement:Given de-identified person-specific data, construct a method for
predicting the number of subjects whose information can be re-identified.

Description: A solution is Dr. Sweeney's Risk Assessment Server. Its architecture (a) uses a population
model, a meta-level database describing available databases, and an inference engine. An output
(b) is an "identifiability report" that plots estimates of the number of explicitly known
individuals whose information can be identified in the data. Re-identifications (c) appear in
graduated groupings termed as "binsizes". The inference engine finds shortest paths from the
given data to data containing explicit identifiers for the same populations. Dr. Sweeney's paper
[cite]
provides a real-world example from bioterrorism surveillance (d). Re-identifications result from
linking to hospital discharge data on medical history. A surprise is that age range releases cannot
thwart these re-identifications, no matter how aggregated (5-year ages shown).

(a)

(b)

(c)

(d)

Scientific Influence and Impact:
Dr. Sweeney's Risk Assessment Server originated with her study of the
identifiability of basic demographics, leading to my highly cited result "87% of the population of
the United States is uniquely identified by {date of birth, gender, ZIP}". Researchers replicated
these experiments. [Golle et al.] found 64% were uniquely identified in the US using more
recent information and a different model. [Malin et al.] explained the difference as model
artifacts and demonstrated that as you move to binsizes >= 5, there is no difference.