BibTeX

Years of Citing Articles

Bookmark

OpenURL

Abstract

Although terminology differs, there is considerable overlap between record linkage methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (Mitchell 1997). Both are based on formal probabilistic models that can be shown to be equivalent in many situations (Winkler 2000). When no missing data are present in identifying fields and training data are available, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data (Friedman 1997) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Belin and Rubin

...which representative training data are available. Naïve Bayes methods have been extended to situations in which amixture of labeled training data and unlabeled data are used for text classification (=-=Nigam et al. 2000-=-). Parameter estimation was done using a version of the EM algorithm that is effectively identical to that used by Winkler (2000) and Larsen and Rubin (2001) when training data are not available. In t...

...e these two tails of distributions, then we can accurately estimate error rates at differing levels. This is known to be an exceptionally difficult problem (e.g. Vapnik 1999, Hastie, Thibshirani, and =-=Friedman 2001-=-). Our comparisons consist of a set of figures in which we compare a plot of thecumulative distribution of estimates of matches versus the true cumulative distribution with the truth represented by t...

...yes Nets, the large dimensionality (from 1,000 to 200,000) often rules out using methods that account for dependencies between identifying variables. Accounting for two-way dependencies (Sahami 1996, =-=Dumais et al. 1998-=-) did not yield improved classification rules for Bayesian Networks. Accounting for selected interactions involving two or more interactions did improve classification rules (Winkler 2000). Record lin...

...ble, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data (=-=Friedman 1997-=-) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Beli...

...cations of Bayes Nets, the large dimensionality (from 1,000 to 200,000) often rules out using methods that account for dependencies between identifying variables. Accounting for two-way dependencies (=-=Sahami 1996-=-, Dumais et al. 1998) did not yield improved classification rules for Bayesian Networks. Accounting for selected interactions involving two or more interactions did improve classification rules (Winkl...

...s are concentrated. Multiple blocking passes are needed to find duplicates in a subsequent blocking pass that are not found on a prior pass. Due to high typographical error rates in most files (e.g., =-=Winkler 1994-=-, 1995), it is quite unusual to find all matches in just one blocking pass. Unlike general text classification, in record linkage it is quite feasible to use an initial guess of parameters associated ...

...ind additional relationships that have not been previously conceived and modeled. Generally, accounting for partial agreement with string comparators makes dramatic improvements in matching efficacy (=-=Winkler 1990-=-b, 1995). From one pair of files to the next, typographical error rates can dramatically affect the probabilities P(agree field | M). For instance, in an urban area or a rural area, the P(agree first ...

...ameterestimation algorithms. Unsupervised learning methods have typically performed very poorly for general machine learning classification rules. The unsupervised learning methods of record linkage (=-=Winkler 1988-=-, 1993) performed relatively well because they were applied in a few situations that were extremely favorable. Five conditions are favorable application of the unsupervised EM methods. The first is th...

...e not available. In the latter situations, assumption CI was not needed. In record linkage, it is known that dropping assumption CI can yield better classification rules and estimates of error rates (=-=Winkler 1993-=-, Larsen and Rubin 2001). In text classification and other general applications of Bayes Nets, the large dimensionality (from 1,000 to 200,000) often rules out using methods that account for dependenc...

...nkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Belin and Rubin 1995, =-=Larsen and Rubin 2001-=-). Keywords: likelihood ratio, Bayesian Nets, EM Algorithm 1. INTRODUCTION Record linkage is the science of finding matches or duplicates within or across files. Matches are typically delineated using...

.../ Mtl; and, if k≠j , pt+1(i,k) = pt(i,k) Ek / Mk. 3. Repeat 1 and 2 for all classes Cj and all patterns i in Pj. Then each Ft is one cycle of iterative proportional fitting (e.g., Winkler 1989, 1993, =-=Meng and Rubin 1993-=-) and increases the likelihood. The last equation in step 2 assures that the new estimates add to a proper probability. If necessary, the procedure can be extended to general IProjections that also in...

... P(γ | M) and P(γ | U) using the EM algorithm. The EM algorithm is useful because it provides a means of optimally separating M and U. Better separation between M and U is possible when a general EM (=-=Winkler 1989-=-, 1993, Larsen 1996) that does not use assumption CI. The advantage of assumption CI is that it yields computational speed-ups on orders between 100 and 10,000 in contrast to methods that use dependen...

...1997) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (=-=Belin and Rubin 1995-=-, Larsen and Rubin 2001). Keywords: likelihood ratio, Bayesian Nets, EM Algorithm 1. INTRODUCTION Record linkage is the science of finding matches or duplicates within or across files. Matches are typ...

...ind additional relationships that have not been previously conceived and modeled. Generally, accounting for partial agreement with string comparators makes dramatic improvements in matching efficacy (=-=Winkler 1990-=-b, 1995). From one pair of files to the next, typographical error rates can dramatically affect the probabilities P(agree field | M). For instance, in an urban area or a rural area, the P(agree first ...

...ble. Five conditions are favorable application of the unsupervised EM methods. The first is that the EM must be applied to sets of pairs in which the proportion of matches M is greater than 0.05 (see =-=Yancey 2002-=- for related work). The second is that one class (matches) must be relatively well-separated from the other classes. The third is that typographical error must be relatively low. For instance, if twen...

...U) using the EM algorithm. The EM algorithm is useful because it provides a means of optimally separating M and U. Better separation between M and U is possible when a general EM (Winkler 1989, 1993, =-=Larsen 1996-=-) that does not use assumption CI. The advantage of assumption CI is that it yields computational speed-ups on orders between 100 and 10,000 in contrast to methods that use dependencies between variab...

... DC 20233-9100 Although terminology differs, there is considerable overlap between record linkage methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (=-=Mitchell 1997-=-). Both are based on formal probabilistic models that can be shown to be equivalent in many situations (Winkler 2000). When no missing data are present in identifying fields and training data are avai...

...ta may help in finding better estimates of error rates. If high quality, current geographic identifiers are associated with records, then accounting for frequency may not help matching (Winkler 1989, =-=Yancey 2000-=-). Across larger geographic regions (e.g., an entire ZIP code or County or State), accounting for frequency may improve matching efficacy. 3. METHODS AND DATA Our main theoretical method is to use the...