.
The naive implementation would examine every pair of clusters to find
the one with the best score. On the next iteration, it would do it
again. This adds up quick. Keeping these scores in a table would
make it much quicker, but the table size becomes the issue.

But we're on the right track. The problem with a table, besides the
size, is that most of the entries are useless. If clusters A and B
score well together, we certainly don't care how poorly clusters A and
Q or A and Z or A and J etc. score. So we'll keep a smaller table
that has a single entry for each cluster that is the cluster that it
scores best against. This table will start with the same number of
entries as we have leaf nodes, and it will gradually grow smaller as
we create bigger and bigger clusters.

Now the issue becomes how to cleverly maintain that table. First, let
us assume the steady state that each entry in the table has a key
which is a cluster and a value which is another cluster that produces
the best score when these are combined. This will be our invariant.
When we add a new cluster to the table, we have to do the following:

Find the other cluster that is the best match.

and

Update any cluster that would get a better score with the new
entry than with the best it already had.

This operations can proceed in a single pass. We walk the table and
test the score of merging the existing key with the new entry. We
track the best we find as we walk the table. At each entry, however,
we also check to see if the test score exceeds the score of that
entry. If it does, then the new cluster becomes the best match for
that entry, otherwise we leave the entry alone.

We start by putting 2 leaf nodes in the table, each of which matches the
other as the best combination. Then we incrementally add each leaf
node of our starting set. Once all the leaf nodes are added, we know
the best match for each leaf.

After initializing the table, we'll iterate by selecting the best
match in the table, deleting the matching elements from the table,
actually performing the merge, and then inserting the merged cluster
back into the table. Deleting the elements is a little tricky.
Obviously, we will remove the two entries that are keyed by those
elements, but it may be the case that other entries would be the best
match against the ones we are deleting. We have to fix these up if we
are to maintain the invariant. An easy way to maintain the invariant
is to simply remove and then re-insert any entry that matches one of
the two elements we are deleting.

This works pretty good, but we end up rescoring a lot of entries.
Since each rescore involves scanning the entire table, that still adds
up to quite a bit of work.

I'm somewhat lazy and I don't want to rescore entries unless I really
need to. When an entry in the table has to be rescored because it's
best matching cluster is being deleted, there's only two things that
could happen: either the new cluster about to be added will be an
even better score than the one we just deleted, in which case we'll be
paired with that one when it gets entered, or whatever it is we'll be
paired with will have an equal or worse score than what we had.

We'll change the invariant in this way. Each entry either contains
the actual best cluster that the entry key should merge with, or it
contains an upper bound on the best score that the key could have if
we knew what it was. So rather than rescore an entry when it becomes
invalid, we simply drop the matching element and retain the old score.

Now when we insert an element into the table, we still have to match
it against the remaining elements. If an entry has a deferred score,
and we equal or exceed that score when the new cluster is merged with
the entry, we can simply replace the entry and update the score. If
not, we retain the deferred score.

When we go to find the best entry in the table, we use the deferred
scores as upper bound estimates of the actual score for that entry.
If the best entry exceeds all the deferred scores, we can use it
without determining the real match of the deferred scores. However,
if we find some deferred scores that have an upper bound that exceeds
the best we find in the table, then we really need to find out the
actual best match for the deferred score. We do this by finally
scanning the table for that entry.

Recall that each time we iterate, we make the table smaller by one.
(We delete two nodes and add a single combined one). The longer we
defer scoring an entry, the smaller the table gets and the fewer
entries need to be scanned. This reduces the number of scoring
calculations that we have to do.

Finally, we can use a little bit of knowledge about the kind of things
we want to merge. In the problem I'm working on, each data point has
only a handful of terms from a selection of thousands. I *know* that
the data points won't score well against others unless they have
common terms. By keeping a table that maps terms to clusters, I can
skip computing scores for matches that simply won't happen. On the
other hand, some data points are duplicates of others and obviously
should be grouped together. There is no need to search for matches
and invoke the full clustering machinery. We'll just take the
duplicates and form initial clusters out of those before even
populating the table.

So how much does all this help? I can now cluster 20000 data points
by examining only 25 million possible clusters. This takes several
minutes, but we're a long ways from the 20 or so data points that we
started with.