Abstract

Background

Electronic health records (EHRs) provide enormous potential for health research but
also present data governance challenges. Ensuring de-identification is a pre-requisite
for use of EHR data without prior consent. The South London and Maudsley NHS Trust
(SLaM), one of the largest secondary mental healthcare providers in Europe, has developed,
from its EHRs, a de-identified psychiatric case register, the Clinical Record Interactive
Search (CRIS), for secondary research.

Methods

We describe development, implementation and evaluation of a bespoke de-identification
algorithm used to create the register. It is designed to create dictionaries using
patient identifiers (PIs) entered into dedicated source fields and then identify,
match and mask them (with ZZZZZ) when they appear in medical texts. We deemed this
approach would be effective, given high coverage of PI in the dedicated fields and
the effectiveness of the masking combined with elements of a security model. We conducted
two separate performance tests i) to test performance of the algorithm in masking
individual true PIs entered in dedicated fields and then found in text (using 500 patient notes) and
ii) to compare the performance of the CRIS pattern matching algorithm with a machine
learning algorithm, called the MITRE Identification Scrubber Toolkit – MIST (using
70 patient notes – 50 notes to train, 20 notes to test on). We also report any incidences
of potential breaches, defined by occurrences of 3 or more true or apparent PIs in the same patient’s notes
(and in an additional set of longitudinal notes for 50 patients); and we consider
the possibility of inferring information despite de-identification.

Results

True PIs were masked with 98.8% precision and 97.6% recall. As anticipated, potential
PIs did appear, owing to misspellings entered within the EHRs. We found one potential
breach. In a separate performance test, with a different set of notes, CRIS yielded
100% precision and 88.5% recall, while MIST yielded a 95.1% and 78.1%, respectively.
We discuss how we overcome the realistic possibility – albeit of low probability –
of potential breaches through implementation of the security model.

Conclusion

CRIS is a de-identified psychiatric database sourced from EHRs, which protects patient
anonymity and maximises data available for research. CRIS demonstrates the advantage
of combining an effective de-identification algorithm with a carefully designed security
model. The paper advances much needed discussion of EHR de-identification – particularly
in relation to criteria to assess de-identification, and considering the contexts
of de-identified research databases when assessing the risk of breaches of confidential
patient information.