Database Deduplication

In our work, we merge many databases to figure out how many people have been killed in violent conflict. Merging is a lot harder than you might think.

Many of the database records refer to the same people–the records are duplicated. We want to identify and link all the records that refer to the same victims so that each victim is counted only once, and so that we can use the structure of overlapping records to do multiple systems estimation.

Database deduplication has been an active research area in statistics and computer science for decades. If records lack a unique identifier, like a social security number, finding the same person among many records may be hard because records that refer to the same person may have slightly different information. Sometimes names are spelled a little bit differently, sometimes birth or death dates are slightly different, sometimes different people have the same names, and so forth. In our work studying lists of people killed in violent conflicts, the records usually have imprecise information.

Database deduplication is also hard because there can be a lot of records involved. In our work on homicides in Colombia 2003-2011, we processed over 460,000 records. In our current work on Syria, we’re managing about 360,000 records. It’s a lot of data.

Database deduplication is so hard that in my five-part post, I’m only going to talk about the first step in record linkage, called blocking or indexing. In blocking, the goal is to reduce the number of records we consider as possible pairs so that we can calculate the likely matches without running for weeks. I show how we reduce the comparisons among the Syria datasets from 65 billion possible pairs of records to about 43 million pairs, a thousand-fold reduction. It’s a very geeky post in which I dig into the technical details, link to some of the key academic and applied work in this area, and I show actual code (in python and pandas) that does the work. It’s written in the jupyter notebook, which is a computing environment I’m in love with right now. Let’s dig in!

In our database deduplication work, we’re trying to figure out which records refer to the same person, and which other records refer to different people.

We write software that looks at tens of millions of pairs of records. We calculate a model that assigns each pair of records a probability that the pair of records refers to the same person. This step is called pairwise classification.

However, there may be more than just one pair of records that refer to the same person. Sometimes three, four, or more reports of the same death are recorded.

So once we have all the pairs classified, we need to decide which groups of records refer to the same person; together, the records that refer to a single person are called a cluster.

There may be 1, 2, or lots of records in a cluster. But heres’ a complication: if record A matches to record B, and record B matches record C, do all three match (A, B, C)? When you look at the cluster, you’ll find that maybe they do, and maybe they don’t.

Principled Data Processing

This post describes how we organize our work over ten years, twenty analysts, dozens of countries, and hundreds of projects: we start with a task. A task is a single chunk of work, a quantum of workflow. Each task is self-contained and self-documenting; I’ll talk about these ideas at length below. We try to keep each task as small as possible, which makes it easy to understand what the task is doing, and how to test whether the results are correct.

In the example I’ll describe here, I’m going to describe work from our Syria database matching project, which includes about 100 tasks. I’ll start with the first thing we do with files we receive from our partners at the Syrian Network for Human Rights (SNHR).

Multiple Systems Estimation

At HRDAG, we worry about what we don’t know. Specifically, we worry about how we can use statistical techniques to estimate homicides that are not observed by human rights groups. Based on what we’ve seen studying many conflicts over the last 25 years, what we don’t know is often quite different from what we do know.

The technique we use most often to estimate what we don’t know is called “multiple systems estimation.” In this medium-technical post, I explain how to organize data and use three R packages to estimate unobserved events.