In our database deduplication work, we write software that looks at tens of millions of pairs of records. The model assigns each pair a probability that that pair of records refers to the same person. This is called pairwise classification.

Once we have the records classified, we need to decide which groups of records refer to the same person; together, the records that refer to a single person are call a cluster.

There may be 1, 2, or lots of records in a cluster. But herein is the complexity: if record A matches to record B, and record B matches record C, do all three match (A, B, C)? When you look at clusters, you'll find that maybe they do, and maybe they don't.

This part of deduplication is called clustering, grouping "like" records and separating "unlike" records. The problem we encountered, for reasons that we understood only dimly, was that the computation was getting stuck.

I've done clustering with our deduplication framework since about 2008, and I've tried tons of different approaches. For the early years, we had a clustering implementation written by HRDAG consultant Jeff Klingner that was tuned for the kinds of linkage information we had at the time. But over time, our problems got bigger, and for a variety of reasons, we have tried to write less code that implements algorithms directly. That's what the larger open source community does, and usually pretty well, so we try to focus our code on the pieces that tie together other people's foundational work.

We tend to use solutions that solve most of the problem with models, then we bring lots of human attention to bear on the hard parts. Machine learning solutions tend to be good at most of the problem, but they almost always have a small part of the data which they get wrong. Clustering has similar strengths and weaknesses. The error is part of a probabilistic solution, and we accept it because eliminating every last bit of error tends to make the model fragile and overly specific (don't overfit! don't chase the noise!). A model with no error will do badly with new data.

Nonetheless, we need really precise answers from our deduplication so we can do multiple systems estimation. So we use the machine learning tools to do most of the work, and to highlight the parts of the problem that are likely to be wrong. Then we blanket that part of the problem with human classification, and force the outcomes to match the human decisions. In short, we brute-force the last few percent of the problem.

How many percent are the last few? Sometimes we brute-force most of it. But as problems get bigger, human beings can do proportionally less of them.

In this case, the particular combination of tools we put together for our dedeplication weren't working: the clustering would hang. At first, I enthusiastically began solving the wrong problem: I tried various parallel solutions (joblib, multiprocessing.Pool, CUDA-based hclust). And my favorite, process-based parallelism driven by database lookups using redis; I call this "real-dumb-parallelism." Of course none of these helped -- at all -- because I was solving the wrong problem.

The fundamental rule of making software faster is to "optimize what needs optimizing". What part of the code is slow, and why? If you don't know why the code is slow, then fixing things that aren't broken won't help. In the example below, I won't show how I learned which pieces are slow (it's already a very long post), but let me assure you that I put timing statements throughout the code to figure out what was wrong. Of course, I only did that after wasting a day or so with the more-fun parallelism approaches.

This turns out to be a common problem in computer science. Let's return to the example: if record A matches to record B, and record B matches record C, do all three match (A, B, C)? The solution (A, B, C) is called transitive closure (in the graph theory sense).

Unfortunately, there are some records that for weird reasons seem to be classified as probable matches to lots and lots of other records. Sometimes this occurs because the bad record has a lot of missing information, other times it's a clump of very common names and dates. These giant lumps of connected records are called black holes or snakes eating tails.

We can unpack the black holes (and all the clusters with more than two records) by using hierarchical agglomerative clustering (called HAC). The problem with HAC is that it's very slow, and prefers to deal with small datasets--ideally a few hundred records, though it can grind through 10-20K records if it has to.

Hybrid solution! First partition the data with transitive closure, then break the clusters up using HAC. Hooray, an answer. As a former colleague of mine used to say, now it's just a small matter of programming.

I planned to solve this problem using networkx for the transitive closure and scipy and fastcluster for HAC. Over the last few years, I've read, adapted, and written my own versions of a half-dozen-odd approaches to this problem. Some worked well in a specific context but badly in others. Other approaches never worked all that well. Sometimes they occasionally produced weird answers, sometimes they didn't scale very well. I've learned to let other people do the hard, super-general work on the algorithms, and there are great tools out there.

However, getting our data into these tools isn't entirely simple, and that's what this post is about. The networkx part is really easy, but HAC isn't. Allow me to explain.

We need to cluster. The two scipy functions we want to use are linkage and fcluster.[1] The trick is that linkage requires that we create a condensed distance matrix (cdm). Imagine a $n x n$ matrix mapping every record $i$ to every other record $j$, with the distance between $(i,j)$ in each cell. The cdm is the upper triangular of the distance matrix flattened into a single vector. Making the distances into a straight line makes all the subsequent calculations a lot easier for the computer, but it's hard for a person.

[1] If we used other clustering packages in other languages, like R's hclust, they also use a condensed distance matrix, so if you're going to do clustering, you probably want to figure this out.

Imagine that we have six records, (A, B, C, D, E, F), and for some of them, we have classification scores (recall that the score is the probability that two records represent the same person). The scores may come from any one of several algorithms: a random forest, a logistic regression, boosted stumps, or some other classification algorithm that produces probability scores.

There are 6 choose 2 = 15 possible pairs, but as a result of blocking, we only looked at seven of them (this is a Good Thing). They have the following pairwise similarity scores (this is classified pairs):

h1

h2

score

A

B

0.9

A

C

0.4

A

D

0.6

A

E

0.3

B

C

0.6

C

F

0.1

E

F

0.97

D

E

0.95

D

F

0.65

The model really thinks (A, B), (D, E), and (D, F) belong together because we usually start with the idea that a score greater than 0.5 means a match. It's not sure about (A, C), (A, D), or (B, C), they're all a little close to the threshold at 0.5. And it thinks (C, F) is a non-match. The corresponding distance matrix represents all the $ij$ pairs, like this:

A

B

C

D

E

F

A

inf

.9

.4

.6

.3

na

B

.9

inf

.6

na

na

na

C

.4

.6

inf

na

na

0.1

D

.6

na

na

inf

.95

.65

E

.3

na

na

.95

inf

.97

F

na

na

0.1

.65

.97

inf

All the na values mean that we didn't classify them; we'll treat their similarities as zero. And when $i == j$, the value of the simlilarity is infinite. Note that the matrix is symmetric around the diagonal. That means the information is duplicated, and that's inefficient. Software transforms this into a condensed distance matrix, which looks like this:

In [39]:

cdm_t=[.9,.4,.6,.3,0,.6,0,0,0,0,0,0.1,.95,.65,.97]

The condensed distance matrix is created by taking each row to the right of the diagonal (shown by the inf value) and appending it to the previous rows. It contains the same information as the full matrix, but it takes less than half the memory and it's easier to do some computations with it.

Oops, I left out one little part. We classify pairs in terms of their similarity. However, we cluster in terms of their distance. We have to invert the similarity probabilities into distances (i.e., take the complements), so that the cdm we're actually going to use looks like this:

defmake_indexer(hashptr):''' the returned indexer function takes two hashes and returns the cdm index location. the hashptr dict is a dict of all the hashids into their sorted sequence position. note that the closure binds the hashptr dict and the dlen calculation into the function's enclosing scope. this is a way to bind data to the function, very much like a class. the index calculation is explained in fgregg's answer to this stackoverflow question: https://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs '''d=len(hashptr)dlen=d*(d-1)/2defd_indexer(k1,k2):i,j=hashptr[k1],hashptr[k2]offset=(d-i)*(d-i-1)/2index=dlen-offset+j-i-1# index *is* an int, but some of the calcs above may # make it a float which triggers a DeprecationWarning# when it's used to index an array (which is the point)returnint(index)returnd_indexer

The make_indexer function creates a function, indexer, which finds the cdm position for two keys. It starts with a hashptr, which is just a dict that remembers all the keys and their sequence position. All the crazy adding of $d$, $i$, and $j$ is to navigate between the pairwise matrix and the linear matrix.

In [59]:

hashptr_t={'A':0,'B':1,'C':2,'D':3,'E':4,'F':5}indexer_t=make_indexer(hashptr_t)print("the index position of pair (A, B) is {}".format(indexer_t('A','B')))print("the index position of pair (E, F) is {}".format(indexer_t('E','F')))print("the cdm value for pair (A, E) is {}".format(cdm_t[indexer_t('A','E')]))print("the cdm value for pair (D, E) is {}".format(cdm_t[indexer_t('D','E')]))

the index position of pair (A, B) is 0
the index position of pair (E, F) is 14
the cdm value for pair (A, E) is 0.3
the cdm value for pair (D, E) is 0.95

How do we make the cdm without making the full pairwise matrix? First off, we can partition the data using transitive closure, then split the partitions with HAC. The partitions are called connected components.

While most components are pretty small, some have more than 100 records. And sometimes a really big "black hole" component appears, like the one in the current round of data that includes over 12K records. How do we make cdm's for all these components? Efficiently? Let's dig deeper, with some data.

The data are in two parts: the original records (input-records.csv) and the pairwise classifications (classified-pairs). I'm sorry that I can't include real data in this post (as I've noted before): my application is to find pairs of records about the deaths of people who have died in the Syrian conflict since 2011. Some of these records are public, but others are not.

This turns out to be really easy, thanks to networkx. Set a threshold over which a pair is a match to another pair (this is a parameter to be revisited later), then add edges to the graph; this is a graph as the idea is used in mathematical graph theory which thinks about nodes connected to edges; this is not in terms of a statistical graph. The resulting connected components are the pieces we will process with HAC.

In the example above, the (A, B, ..., F) is a connected component because there is a path with a score >= 0.5 from every point to every other point, like this:

Earlier I said that transitive closure sometimes makes clusters that are much too big. That's clear below. We've got plenty of clusters with > 11 records, and one with > 12K.

In [10]:

# I know I can use Counter for this kind of thing:# Counter([len(c) for c in connected_components])# but it produces a row for every value, and that's a mess. lens=OrderedDict([('1',0),('2',0),('3-10',0),('11-100',0),('101-1000',0),('1000+',0)])forccinconnected_components:cclen=len(cc)ifcclen==1:lens['1']+=1elifcclen==2:lens['2']+=1elifcclen<11:lens['3-10']+=1elifcclen<101:lens['11-100']+=1elifcclen<1001:lens['101-1000']+=1else:lens['1000+']+=1pprint(lens)

Now we can make a cdm. Before we dive all the way into real data, I'll walk through the example we worked on above with (A, B, ..., F).

In [11]:

defmake_cdm0(hashids,cp,prob_col):''' hashids is the list of hashes we need to cluster cp is the list of classified pairs with their distances probs_col tells us which column of cp has the distances we want. our measure in cp is of *similarity*, but the cdm wants distance, i.e., *dissimilarity*. We'll invert the measure below. '''hashids=sorted(list(hashids))# hashptr points into the list, keeping track of each hashid's positionhashptr={k:ifori,kinenumerate(hashids)}num_ids=len(hashids)# CAUTION! this creates a memory object (n choose 2) long. cdm_len=num_ids*(num_ids-1)/2cdm=np.ones([cdm_len])# assume max dissimilarity cpsub=cp.loc[(cp.hash1.isin(hashptr))&(cp.hash2.isin(hashptr))]cpsub=cpsub.ix[:,['hash1','hash2',prob_col]]indexer=make_indexer(hashptr)forrowincpsub.itertuples():index=indexer(row.hash1,row.hash2)# change similarity into distancecdm[index]=1.0-getattr(row,prob_col)returncdmprint('ready')

ready

In [41]:

# Here's how it looks with the example data#hashids_t=['A','B','C','D','E','F']cp_t=pd.DataFrame.from_records([('A','B',0.9),('A','C',0.4),('A','D',0.6),('A','E',0.3),('B','C',0.6),('C','F',0.1),('E','F',0.97),('D','E',0.95),('D','F',0.65)],columns=['hash1','hash2','prob'])make_cdm0(hashids_t,cp_t,'prob')

Ok, that looks familiar. It seems like it's working so far. Let's try it with real data.

In [9]:

# let's pick a few connected_component elements for testing at different sizestest_components=OrderedDict()foriin(3,7,11,52,106,174,12477):test_components[i]=-1forccinconnected_components:iftest_components.get(len(cc),None)==-1:test_components[len(cc)]=cc

In [13]:

# Ok, let's run it with real data, and time it. #fori,ccintest_components.items():start=time.time()cdm=make_cdm0(cc,cp,'xgb_prob')elapsed=time.time()-startprint("w len(cc)={}, time = {:3.1f}s".format(i,elapsed))cdm0=make_cdm0(test_components[52],cp,'xgb_prob')print('done.')

The make_cdm function has a pretty constant run time for up to 174 (and probably a lot more) records in a cluster. It only slows down, and not by much, when we give it a giant, 12K cluster to make the cdm.

But those 11 second times for clusters with only 3 records will kill us. There are more than 64K small clusters to process, and at 11 seconds each, well, that would take over a week to run. There must be a better way.

And there is. Look at the code again. Here's the line that makes it slow:

cpsub = cp.loc[(cp.hash1.isin(hashptr)) & (cp.hash2.isin(hashptr))]

What I'm trying to do is reduce the classified pairs to a subset I can iterate over. What are the pairs we need to add to the cdm? I was really focused on those pairs coming from the pairs that we'd classified. I had it stuck in my head that the optimization we did in the blocking step would somehow help us here. Nope.

For some reason, I was trying to avoid looking at all the possible pairs in a connected component -- remember, all the pairs means (n choose 2) $= n(n-1)$, and that can be really large. But these $n$ are small! And that's one of the two key pieces I missed.

The second piece I missed is that by subsetting the full list of classified pairs for each connected component, I was creating a constant cost. There are 55 million classified pairs, so this constant per-component cost is 11 seconds.

Instead of this approach, let's remember that pandas has a tool for fast lookups. Indexing rows is like creating a dict() to each row. Let's use that, and python's "easier to ask forgiveness than permission" (EAFP) try ... except statement. We'll generate all the pairs dynamically and look them up in the dataframe as we go along.

In [15]:

%%time
# The indexing does impose a fixed cost -- but only 14 seconds,
# and it's only done once. no problem.
cp.set_index(['hash1', 'hash2'], drop=False, inplace=True)

First note that the cdm made by the previous approach is the same as the one created here. The second line checks that the cdm for the test_component[52] is the same. Fast solutions are useless if they get the answer wrong.

This is a great example of an algorithm that does really well on small problems, but then fails to scale to a larger problem. Notice that it does small problems faster than can be timed. Around 50 records in the component, the time starts increasing: time quadruples when n doubles from 52 to 106; then time triples in the increase from 106 to 174. How many triples and quadruples are there when we add 12K more records? It's not going to finish. I killed it after 630 seconds.

Here's the takeaway: Every non-trivial problem has a mixed solution. The professor who taught me game theory in graduate school used to say this about 100x in every lecture, and it's burned into my brain.

The row-subset approach in make_cdm0 has a high fixed cost to select a subset out of 55M rows, but then it runs in time proportional to the number of records we give it, i.e., $O(n)$. The enumerate-all-pairs approach in make_cdm1 has no fixed cost and runs very fast with small inputs. But with even a few hundred records, it has to consider n choose 2 records $n(n-1)$ (in comp sci speak, that's $O(n^2)$ scaling). This increases a lot faster than $n$, so with more than a few records, it won't finish in reasonable time.

Let's do both: let's look at the number of inputs and use that to decide which approach to take.

In [27]:

defmake_cdm(hashids,cp,prob_col):''' given a list of hashids, a dataframe with the similarity scores of some of pairs of hashids and the dataframe column containing the score values return a condensed distance matrix for use in clustering '''hashids=sorted(list(hashids))hashptr={k:ifori,kinenumerate(hashids)}num_ids=len(hashids)indexer=make_indexer(hashptr)cdm_len=num_ids*(num_ids-1)/2cdm=np.ones([cdm_len])# assume max dissimilarity ifnum_ids<500:# if the cluster is small, look at all the pairs # This is O(n**2), but there's no setup time. So when # n is small, this is good. forhash1,hash2init.combinations(hashids,2):try:similarity=cp.at[(hash1,hash2),prob_col]exceptKeyError:continueindex=indexer(hash1,hash2)cdm[index]=1.0-similarity# change similarity into distanceelse:# if the cluster is not small, subset and check the known pairs # the cp subset (over the 55M pairs) takes a constant 11 seconds.cpsub=cp.loc[(cp.hash1.isin(hashptr))&(cp.hash2.isin(hashptr))]# then iterating over all of the pairs found is O(n) forrowincpsub.itertuples():index=indexer(row.hash1,row.hash2)# change similarity into distancecdm[index]=1.0-getattr(row,prob_col)returncdmprint('ready')

# now cluster this ccdefcluster_one_cc(cc,cp,prob_col,verbose=False):defnow():return'{:%M:%S}'.format(datetime.datetime.now())ifverbose:msg='at {} on pid={}: starting HAC on new cluster ({} recs)'print(msg.format(now(),os.getpid(),len(cc)))# handle the simple cases immediatelyiflen(cc)<=2:returncccdm=make_cdm(cc,cp,prob_col)ifverbose:msg='at {} on pid={}: made CDM, heading into HAC'print(msg.format(now(),os.getpid(),len(cc)))clustering=linkage(cdm,method='average')cl=fcluster(clustering,t=0.5,criterion="distance")clustersd=defaultdict(list)fori,mgiinenumerate(cl):clustersd[mgi].append(list(cc)[i])ifverbose:msg='{} on pid={}: finished HAC on new cluster ({} clusters)'print(msg.format(now(),os.getpid(),len(clustersd)))return[vforvinclustersd.values()]

First off, it's clear that scipy's clustering function, linkage works really well with the small clusters. However, it turns out that the algorithm for clustering that scipy uses are $O(n^3)$, and for 12K records, that's not going to work as the number of records increases past $10^5$ (in our case here). I stopped it after about 15 minutes.

Other people have solved this problem, in particular fastcluster was written with the best possible algorithms and really good code. It scales at $O(n^2)$, and the operations are so fast that we can cluster the really big component in about 8 seconds. There's still a lot of work to be done (the cluster flattening step is also a little slow), so in total, the big cluster runs in about 27 seconds.

I recommend the article introducing fastcluster, by Daniel Müllner. I learned a lot from the discussion of different implementations of the basic clustering algorithm. The data shown in Figure 1 is what we need to know here. It only shows performance on datasets up to $10^4$, but even there, the linkage step in scipy will take $> 10^3$ seconds while fastcluster is still less than 10 seconds. Extend that exponential trend ten-fold, and, well, this is why we need better code.

To use fastcluster, I changed one line in the code, shown below in fcluster_one_cc.

In [60]:

# now cluster this ccdeffcluster_one_cc(cc,cp,prob_col,verbose=False):defnow():return'{:%M:%S}'.format(datetime.datetime.now())ifverbose:msg='at {} on pid={}: starting HAC on new cluster ({} recs)'print(msg.format(now(),os.getpid(),len(cc)))# handle the simple cases immediatelyiflen(cc)<=2:returncccdm=make_cdm(cc,cp,prob_col)ifverbose:msg='at {} on pid={}: made CDM, heading into HAC'print(msg.format(now(),os.getpid(),len(cc)))# from http://danifold.net/fastcluster.html. clustering=fastcluster.linkage(cdm,method='average',preserve_input=False)cl=fcluster(clustering,t=0.5,criterion="distance")clustersd=defaultdict(list)fori,mgiinenumerate(cl):clustersd[mgi].append(list(cc)[i])ifverbose:msg='{} on pid={}: finished HAC on new cluster ({} clusters)'print(msg.format(now(),os.getpid(),len(clustersd)))return[vforvinclustersd.values()]

at 37:42 on pid=50196: starting HAC on new cluster (3 recs)
at 37:42 on pid=50196: made CDM, heading into HAC
37:42 on pid=50196: finished HAC on new cluster (1 clusters)
w len(cc)=3, time = 0.0s
at 37:42 on pid=50196: starting HAC on new cluster (7 recs)
at 37:42 on pid=50196: made CDM, heading into HAC
37:42 on pid=50196: finished HAC on new cluster (2 clusters)
w len(cc)=7, time = 0.0s
at 37:42 on pid=50196: starting HAC on new cluster (11 recs)
at 37:42 on pid=50196: made CDM, heading into HAC
37:42 on pid=50196: finished HAC on new cluster (3 clusters)
w len(cc)=11, time = 0.0s
at 37:42 on pid=50196: starting HAC on new cluster (52 recs)
at 37:42 on pid=50196: made CDM, heading into HAC
37:42 on pid=50196: finished HAC on new cluster (14 clusters)
w len(cc)=52, time = 0.0s
at 37:42 on pid=50196: starting HAC on new cluster (106 recs)
at 37:42 on pid=50196: made CDM, heading into HAC
37:42 on pid=50196: finished HAC on new cluster (31 clusters)
w len(cc)=106, time = 0.1s
at 37:42 on pid=50196: starting HAC on new cluster (174 recs)
at 37:43 on pid=50196: made CDM, heading into HAC
37:43 on pid=50196: finished HAC on new cluster (35 clusters)
w len(cc)=174, time = 0.3s
at 37:43 on pid=50196: starting HAC on new cluster (12477 recs)
at 38:03 on pid=50196: made CDM, heading into HAC
38:10 on pid=50196: finished HAC on new cluster (3365 clusters)
w len(cc)=12477, time = 27.2s

Even after fast, efficient HAC, some of the clusters will still be "overclustered," and here's where brute force comes back into play. All of our deduplication work is a dialogue between machine learning parts and lots and lots of human attention to the data.

In another post, I'll explain how we assess the quality of the clustering, and from that, how we post-process the clusters so that the resulting deduplication reflects the combination of the machine learning and human attention. So keep in mind: every non-trivial problem has a mixed solution. It's true even when it's not.