Bookmark

Computer Science > Databases

Title:
A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution

Abstract: Entity resolution (ER) is the task of identifying all records in a database
that refer to the same underlying entity, and are therefore duplicates of each
other. Due to inherent ambiguity of data representation and poor data quality,
ER is a challenging task for any automated process. As a remedy, human-powered
ER via crowdsourcing has become popular in recent years. Using crowd to answer
queries is costly and time consuming. Furthermore, crowd-answers can often be
faulty. Therefore, crowd-based ER methods aim to minimize human participation
without sacrificing the quality and use a computer generated similarity matrix
actively. While, some of these methods perform well in practice, no theoretical
analysis exists for them, and further their worst case performances do not
reflect the experimental findings. This creates a disparity in the
understanding of the popular heuristics for this problem. In this paper, we
make the first attempt to close this gap. We provide a thorough analysis of the
prominent heuristic algorithms for crowd-based ER. We justify experimental
observations with our analysis and information theoretic lower bounds.