期刊:Journal of Data and Information Quality [Association for Computing Machinery] 日期:2025-03-15
标识
DOI:10.1145/3721985
摘要
Entity resolution is the problem of identifying records that refer to the same entity from one or multiple databases. Applications of entity resolution range from health and social science research to national security and online commerce. Entity resolution can be viewed as a classification task where pairs of records are classified as matches (referring to the same entity) or non-matches (referring to different entities). Alternatively, clustering-based entity resolution methods generate clusters of records such that each cluster refers to one entity, and each entity is represented by one cluster. If ground truth data in the form of known matches and non-matches are available, then performance measures such as precision, recall, and the F-measure, are commonly used to evaluate the quality of entity resolution methods. In practical applications, however, ground truth data are often not available, or they can be incomplete or biased, making quality evaluation challenging. To overcome this gap, we develop multiple methods to evaluate the quality of an entity resolution result without the need of ground truth data by calculating estimated numbers of true and false matches, as well as missed matches. These allow the calculation of estimates for precision, recall, and the F-measure. Our methods are either based on analysing links (classified record pairs) or the clustering structure provided by an entity resolution method. We validate our methods on multiple data sets from diverse domains, showing they can obtain precision and recall estimates close to their true values.