Libraries have faced the problem of duplicate records for
more than 20 years; as they developed computer systems for managing
library catalogs and, in particular, for sharing bibliographic records
in a networked environment, duplicate records were identified as a
problem and techniques for eliminating them were developed.
Traditional de-duplication, however, is significantly different from
identifying all the citations of a particular work; in many ways,
identifying works is easier.

This chapter describes in detail the algorithm for identifying
author-title clusters and the characteristics of the source records
that make this problem hard. It presents a novel algorithm that uses
randomized full-text searches to detect potential duplicates. The
algorithm is analyzed and two kinds of failures are
examined-including a record in a cluster when it cites a different
work (false match) and creating two different clusters whose members
describe the same work (missed match). The chapter also surveys
other approaches to identifying related records-two early systems
used for de-duplication in library catalogs and a recent
multidatabase integration system.

The record comparison test addresses three sources of variation
between records that describe the same document: entry errors, which
are primarily typographic, differences in cataloging standards, and
abbreviations.

Entry errors are unintentional errors made during the creation and
transcription of bibliographic records. The most common entry errors
are simple misspellings and typographical errors, but improperly
formatted records and records which omit required fields are also
common. Typical format errors include placing information in the
wrong field, such as putting the month and year in the year field, or
not using the required syntax for a field, e.g. entering the year as a
two-digit number instead of a four-digit number.

Misspelling and typographical errors are common to most text databases
and the literature about them is substantial. O'Neill and
Vizine-Goetz's overview of quality control in text databases
[29] provides a thorough, though somewhat dated discussion
of the sources of errors and some techniques for correcting them;
Kukich [20] surveys techniques for automatically detecting
and correcting spelling errors.

While entry errors are an unintentional source of variability,
differing cataloging standards are often intentional. Many
bibliographies apply a consistent set of cataloging rules, but the
standards are quite different from one bibliography to another.

Cataloging standards affect what kind of information is
recorded in a citation and where it is recorded. If a citation
describes an extended abstract, for example, some bibliographies
record that fact in the title field and others record it in the
note or annote field.

Cataloging standards also determine what document type the
citation is assigned or whether a new citation is created for each
instance of publication. When an article is published more than once,
some databases include a record that describes one of the publications
fully and includes information about the other publication in one of
the comment fields. Similarly, many conference proceedings are
published as issues of a journal; sometime these articles are
cataloged as Bibtex InProceedings citations and other times as
Article.

Authors' names are a particular source of trouble. The same name can
be cataloged many ways: last name only, first initial and last name,
all initials and last name, first name and last name, etc. In a list
of authors, some bibliographies include only the first author (only
sometimes indicating that there are other authors) and others include
all the names.

Abbreviations are another source of problems caused by differences in
cataloging. Journal and conference names and publishers are
abbreviated in many non-standard ways, making comparisons using these
fields difficult. (Bibtex allows users to define abbreviations within
a Bibtex file; the abbreviations are expanded when a bibliography is
processed. These are not the abbreviations discussed here; rather, the
problem lies with abbreviations in the actual text of the field.)

The problems posed by abbreviations are not addressed
here. Abbreviations are seldom used in the author or title fields, so
they do not present a problem for identifying related records. The
problem is discussed in somewhat more detail in Chapter 6
in the section on authority control.

The algorithm for identifying clusters of related records considers
each record in the collection and determines whether it has
(approximately) the same author and title field as any other
record. There are two primary concerns for the algorithm's design.
First, the algorithm should avoid comparing each record against every
other record-an proposition. Second, the algorithm, and in
particular the test for similarity, must cope with the kinds of
entry errors described above.

The algorithm uses two rounds of comparisons to identify author-title
clusters. The first round creates a pool of potentially matching
records using a full-text search of the entire collection; the pool
consists of the results of three queries, with words randomly selected
from the author and title of the source record. (The index, however,
includes words that appear anywhere in a record, not just in the
author and title fields.) In the second round, each potential match is
compared to the source record, and an author-title cluster is created
for the matching records.

Three different queries are used to construct the potential match
pool. Each query includes one of the authors' last names and two words
from the title field. Title words that are either one- and two-letters long
or are in the stoplist, which includes the 50 most common words
in the index, are not used. In cases where there are not enough words
to construct three full queries, queries use fewer words. (And in some
extreme cases only one or two queries are performed.)

The second phase of testing compares the author and title field of
each record in the potential match pool against the source record. It
uses an n-gram-based approximate string matching algorithm to
compare the two fields. The string matching algorithm is explained in
detail below.

The algorithm is applied to every source record, regardless of whether
it has been placed in a cluster already; thus, a cluster with five
records will be checked by five different passes. The algorithm
enforces transitivity between records: If record A matches record B
and record B matches C, all three are placed in the same cluster,
whether or not A matches C.

The algorithm is tolerant of format errors and typographic
errors. Given a pair of records that cite documents with the same
author and title, the potential match pool for one record will contain
the other as long as one of the three queries contains words that
match the second record; a small number of errors is unlikely to cause
a problem. The approximate string match will tolerate variations as
large as transposed or missing words in a long title string.

Comparing two entries' author fields is more complex than comparing
title fields, because there is much wider variation in how names are
entered than in how titles are entered. Instead of comparing author
strings directly, the algorithm performs a more complicated
comparison.

To compare two author entries, each author string is separated into a
list of individual authors and each name is compared; the name
comparison considers each part of the name-first name, last name,
etc.-separately. If the names and the order of the names both match,
then the two author field are considered to be the same. When one
author list contains matching names, in the same order as the first
list, but omits names from the end of the last, the two strings are
also considered the same; testing suggested it was likely either there
was a cataloging error or that the same work had been published with
slightly different author lists. Finally, an author list that
includes a single name and the notation ``and others'' will match any
list of two or more names that has a matching first author. (The
Bibtex format specifies the use of ``and others,'' so variants like
``et al.'' are not handled.)

Individual author names are compared by separating the two name
strings into four parts: first name, middle name, last name, and
suffix (e.g. Jr.). Two parts of a name match if any of three
conditions is met:

A trigram comparison reports the strings are the same.

One of the strings is an initial and the other string is
a name that starts with that initial, or both strings are
the same initial

Either of the strings is blank.

Thus, ``G. Steele'' and ``Guy L. Steele Jr.'' are considered the
same. However, ``B. Clifford Neuman'' and ``Clifford Neuman'' are not
considered the same (even though Neuman is cited both ways).

An n-gram is a vector representation that includes all the
n-letter combinations in a string. The n-gram vector has a component
vector for every possible n-letter combination. (For an alphabet which
contains the letters 'a' to 'z' and the numbers '0' to '9', the vector
space is -dimensional.) The n-gram representation of a string
has a non-zero component vector for each n-letter substring it
contains, where the magnitude is equal to the number of times the
substring occurs. The string comparison algorithm uses 3-grams, or
trigrams, so I will limit further discussion to trigrams.

Formally, a trigram vector for a string is defined as

The trigram vector for the string ``record'' contains four
components: ``rec'', ``eco'', ``cor'', and ``ord.'' If the trigram
``eco'' appeared twice in the string, then .

The string comparison algorithm forms trigram vectors for the two
input strings and subtracts one vector from the other. The magnitude
of the resulting vector difference is compared to a threshold value;
if the magnitude of the difference is less than the threshold, the two
strings are declared to be the same. If each trigram appears only
once in an input string, the difference vector is the same as the list
of trigrams that appears in one string but not the other. In general,
the magnitude of the difference vector for two inputs and
is:

The threshold value was determined experimentally, using several dozen
pairs of strings which had been labelled the same or different, based
on how the algorithm should compare them. The threshold that worked
best varied linearly with the total number of distinct trigrams in the
two input strings. The threshold used for comparing two strings
with a total of n trigrams is

The variable threshold allowed the test to tolerate several errors in
a long title string, such as missings words or multiple misspellings,
and still work well with short title strings. A higher, fixed
threshold would have admitted many short strings that had completely
different words.

The threshold was chosen using a collection of about 50 pairs of
strings, each of which was labelled a priori as the same or
different. The computed threshold minimized the number of pairs that
was incorrectly identified. The sample strings, however, were not
chosen with great care and the effects of small variations in the
threshold were not studied.

however, begins the first trigram with the first letter and ends with
the trigram that includes the last three letters; this test is
slightly more tolerant of errors at the beginning and end of strings.

Figure: N-gram comparison of two strings with a letter-pair transposition

Consider the example comparison in Figure 3-1 between two strings
``Machine Vision'' and ``Machien Vision.'' The difference vector
contains eight trigrams, each appearing once, so the
magnitude of the difference is 2.828. The threshold, based on a count
of 15 trigrams in the input, is 2.861. The strings are declared to
be the same, because the magnitude of the difference is just less than
the threshold.

The trigram string comparator worked well enough to create
author-title clusters that tolerated small errors, but did not include
fields that a human cataloger would consider distinct. There were some
problems, however, and it seems likely that it could be improved. Many
similarity functions for string and document vectors are discussed in
the information retrieval literature; Salton [33] offers a
brief overview.

The first round, which identifies a pool of potential matches, greatly
reduces the number of comparisons that must be made to create
author-title clusters. The brute force approach-compare each record
to every other record-requires comparisons:

The pool of potential matches tends to be very small; the average pool
contained fewer than 30 records when the entire 240,000 record
collection was processed. The savings over pairwise comparison is
big-7.5 million comparisons instead of 31.2 billion.

The number of comparisons that must be performed grows with the
average pool size, but it is difficult to judge how the collection
size affects the average pool size. The pools are the results of a
full-text search that includes one of the authors' last names (unless
no authors are cited); although the number of occurrences of any word
tends to grow linearly with collection size, names behave differently.
The number of times ``Szolovits'' appears, for example, is a function
of two things: how many papers Szolovits has written and how many
duplicate citations for these papers are in the collection.

A closer examination of the potential pools that were produced while
processing the full collection helps to map out the possibilities for
collections up to hundreds of thousands of records, but does not
provide a definitive answer.

I recorded the size of the potential match pools for the first 5,800
records, based on a collection with 50,000, 125,000, and 240,000
records. (Only a single pool was created, but records outside the
first 50,000 were ignored for the first sample and records outside the
first 125,000 were ignored for the second sample.) The distribution of
pool sizes is roughly exponential. The majority of the pools
contained fewer than 13 records in each sample. The tail of the
exponential grows longer as the collection size increases; the largest
pool increases from 14,294 for the 50,000-record sample to 45,750 for
the 240,000-record sample.

Table: Statistics for size of potential match pools

Most of the pools are quite small, but occasionally large pools are
returned; the latter are often caused by books with with some missing
information, e.g. no author, or very common title words,
e.g. ``introduction to algorithms'' or, worst of all, both. A
reasonable optimization might be to ignore records that contain
extremely large potential pools, and treat them as failures for which
related records cannot be recognized.

Several papers describe specific systems for identifying duplicates
records. Two early duplicate detection projects are surveyed
below-the IUCS scheme [47] and the original OCLC
On-Line Union Catalog [18]; O'Neill, et. al [28]
presents some characteristics of duplicate records in the OCLC
catalog. Toney [42] describes a more recent effort, and
provides a good overview of the design space for library duplicate
detection systems.

The OCLC and IUCS projects were completed in the late 70s, when the
library community had just begun sharing bibliographic records among
institutions and building large online union catalogs. Subsequent work
reflects the general approach mapped out in these two seminal studies.

The database merge/purge problem, described by Hernández and
Stolfo [17], is very similar to the duplication
detection problem. In the merge/purge problem, several different
databases with similar records must be merged and the duplicates
purged; an example is joining several mailing lists, which may
contain inconsistent or incorrect addresses.

The recent information retrieval literature contains several reports
from Stanford on duplicate detection in a Usenet awareness service
[49] and several schemes for copy detection in a digital
library [36][35]. Duplicate detection in
information retrieval is a substantially different problem, because
the actual content of two documents is compared; the reported work
discusses several different approaches to identifying overlapping
content at the granularity of words and sentences.

The key difference between the OCLC and IUCS algorithms and my
algorithm is that the former identify duplicate records. Records
are duplicates when they describe exactly the same document and not
just the same work. Although the precise rules for defining a
duplicate vary somewhat between the algorithms, they both compare many
fields. The OCLC rules require that the following fields match for a
pair of records to be considered duplicates: author, title, publisher,
copyright date, date of impression (if text differs), edition, medium,
series, page or number of volumes, illustrator, and translator.

Most techniques for duplicate identification share the same gross
structure: They use two rounds of comparisons, identifying possible
duplicates on the first round and making more careful examination on
the second round. The first round uses a fixed-length key, created
from one or more field values; records with the same key are
considered potential duplicates. A longer key, incorporating several
fields, is used for the second round. (The second-round keys seem to
be used primarily to avoid loading entire records into memory for
comparison-a concern more important in 1979 than today.)

Duplicate detection keys summarize the information contained in the
record; they consist of a series of fixed-length encodings of field
values. In the OCLC and UICS schemes, different keys are used for
round one and round two of duplicate identification. Fields are
typically encoded in one of three ways:

Selecting certain characters from the field. The OCLC scheme uses
characters 1, 2, 3, 5, 8, 13, 21, and 34 from a normalized title
string to construct one of its keys. Character selection rules can be
more complicated, specifying certain characters from specific words,
such as the first and fourth characters of the second word.

For fields like reproduction code, which has a limited range of
values, each possible value can be assigned a specific value.

More complicated encodings construct hashes of normalized field
values. The second round of the OCLC scheme hashes the entire title
into a 109-bit key. The key is built by converting the trigrams that
do not span words to integers and applying a hash function to the
sequence of integers. The IUCS scheme builds a Harrison key
[14] from trigrams. The Hamming distance between two keys
can be compared, allowing tolerance of typographical errors and other
small variations between two strings.

The IUCS scheme uses a short, fixed-length title-date key for its
first round, taking eight characters from the beginning and end of
titles plus some date information. (MacLaury [27]
described the choice of title characters.) Records that have the same
title-date key are compared using three more keys: the first five
characters of the author field, a 72-bit Harrison key of the title
field, and the highest number taken from the various page fields.

The OCLC scheme is similar. The first round key looks for exact
matches of four fields-the date, record type, reproduction code, and
the 8-character title selection. The second round compares the entire key,
but defines 16 different conditions under which the keys can be
declared a match. These conditions examine different parts of the key,
looking for either an exact match or partial match. Partial matches
are defined for only a few fields of the key; a partial match of of
Harrison keys occurs when the bits set in one key field are a subset
of those set in the other key. One of the conditions for matching
illustrates the approach: Two records are duplicates if there is an
exact match of the Government Document Number and page fields, and
partial matches of the LCCN, ISBN, edition, and series fields; partial
matches occur when one record has no value in a particular
field.

The QUALCAT expert system [31] is an interesting
alternative to the key-based comparisons. A team of catalogers
developed a set of rules that describe whether a certain combination
of fields values makes it more or less likely that two records are
duplicates. Each record pair is described by a possibility that it is
a duplicate (poss) and the certainty it is a duplicate (cert). Initially, the cert is 0 and the poss is 100. Each
rule increases the cert or lowers the poss. Once all
possible rules have been applied, a pair of records with sufficiently
high cert and poss are automatically labelled duplicates,
records with low values are not, and records in between are referred
to a human cataloger for review.

The expert system was used to compare records during the second round
of QUALCAT testing. The first round was a fairly typical key-based
filter using Universal Standard Bibliographic Code (USBC). The USBC is
similar to the OCLC and IUCS first rounds keys, but Toney [42]
notes that it is more dependent on clean data. Quantitative results of
the QUALCAT system were not provided.

In the OCLC study, Hickey and Rypka [18] observe three
major causes of failure to match duplicates. A difference in the first
34 characters of the title string that the initial date-title keys is
drawn from causes 12 percent of the missed matches. The two other
causes are differences in place of publication and pagination.

One question that remains to be answered is whether the probalistic
first round proposed here offers any improvement over the key-based
approach. Because both systems were not available for testing, we can
at best hypothesize that the randomized text-search approach is more
tolerant of errors. Word order errors, differences in cataloging, and
typographical errors would all cause problems for the date-title
key. The randomized text-search approach can tolerate these errors as
long as one of the three queries uses words that do not contain
typographical errors. It is not affected by word errors or by
cataloging variations that affect which words are included in the
title.

The approach to the merge/purge problem used by Hernandez and Stolfo
[17] is similar to the one used for duplicate
detection. Their algorithm use keys to partition their data into sets
small enough that they are willing to apply computationally intensive
comparisons. The source data varies widely, so any particular key is
likely to miss many records; to overcome this problem, their algorithm
makes several independent runs over the data with different keys and
computes the transitive closure of the results. The accuracy of their
test jumped from 70 percent with a single pass to 90 percent with
three passes.

The algorithm for creating author-title clusters can fail in two ways.
It can create clusters that include citations for more than one work
(a false merge) or it can create two separate clusters that both
contain citations for the same document (a missed match). The
author-title cluster algorithm keeps these failures to a reasonable
minimum; analysis of a sample set indicates that less than 1 percent
of the clusters contained false merges and that missed matches
occurred about 5 percent of the time.

The clusters were analyzed using a 937-record sample. The sample was
selected so that each shares at least three title words with one or more
other records; this characteristic was intended to increase the
possibility of false merges. The clustering algorithm was run on the
sample, and its results were examined by hand. The algorithm
determined that the sample contained 554 unique clusters.

After the algorithm was run, I checked each cluster to verify that
each record described the same document. I checked for missed matches
looking for records in other clusters that contained similar title
fields. I identified potential missed matches of a record by doing
independent searches for each word in the title. (The search used a an
approximate string searching program that did not use a n-gram
approach.) If a different record was returned by at least two-thirds
of the agrep searches, it was compared by hand with the source record.

According to my manual check of the records, the sample actually
contained 554 unique works, 250 of which were cited by more than two
records. The author-title clusters contained four false merges and 30
missed matches.

Figure: Errors in author-title clusters for 937-record sample

The four false merges all involved closely related bibliographic
records. In three of the four clusters, the two different works were
part of a series, where each paper was labelled part one or part
two. Figure 3-2 shows an example. The papers had the same authors and
the approximate string match reported that the titles were the same
because they differed only in the final trigram. The final false
merge involved two different papers with very similar
titles-``Towards Dataflow Analysis of Communicating Finite State
Machines'' and ``Dataflow Analysis of Communicating Finite State
Machines''-and the same authors.

Figure: Two falsely matched records

The sample set is small enough that it is difficult to extrapolate the
results to a large collection with great precision, but it appears
that the failure rate would be acceptably low. Measured in
percentages, false merges occurred in 0.7 percent of the clusters and
5.7 percent of the clusters were missed matches that should have been
merged with another cluster.

I also used the results of two other algorithms to gauge the
relative success of the author-title clustering algorithm presented
here. The bibmerge program [1] creates title-date
clusters; if two records have the same title and date, they are placed
in the same cluster. Bibmerge has only
limited utility as a benchmark, because it uses a very simple
detection scheme that is not tolerant of formatting and typographical
errors. The program's source contains the comment: ``This script is
written in the most unprofessional manner available to me, the only
reason why I have not scrapped it is that is works.'' Nonetheless, it
was the only other duplicate detection system available for testing on
the same sample set.

Bibmerge found 650 unique clusters; it missed 126 matches and made one
false merge. The title-data clusters identified 10 record pairs that
were missed by the author-title clusters, because the title and date
field was the same but the author was different. In seven cases, the
author fields were improperly formatted. In two cases, the authors
were listed in a different order in each record. The final case was
bibmerge's lone false match: An issue of Computing Surveys
contained two responses to an earlier article, both of which were
titled ``The File Assignment Problem.''

A second check of the quality of the algorithm presented here is the
results reported for the OCLC duplicate detection scheme described
earlier, although the OCLC clusters were created under substanitally
different rules. The analysis of the OCLC algorithm is the most
substantial reported in the literature. Using a set of 184 pairs of
records that had been identified in advance as duplicates, the OCLC
system successfully identified 127 pairs, a 69 percent success rate,
compared to a 94 percent success rate for my author-title algorithm
The false match rate was measured by selecting 1,000 record pairs and
applying the duplicate test to each pair; only some of the pairs were
duplicates. The error rate on this sample was 1.3 percent-13 false
matches-which is comparable to the 4 false merges out of 554 (0.7
percent) in the author-title clusters.