Skip to content

Research at St Andrews

Using metric space indexing for complete and efficient record linkage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Record linkage is the process of identifying records that refer to the same real-world entities, in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their degree of similarity. Record linkage is usually performed in a three-step process: first groups of similar candidate records are identified using indexing, pairs within the same group are then compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as Locality Sensitive Hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity. Conversely, they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing to perform complete record linkage, which results in a parameter-free record linkage process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. Our experimental evaluation on real-world datasets from several domains shows that linkage using metric space indexing can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.
Close

Details

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining
Subtitle of host publication22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III
EditorsDinh Phung, Vincent S. Tseng, Geoff Webb, Bao Ho, Mohadeseh Ganji, Lida Rashidi
Place of PublicationCham
PublisherSpringer
Pages89-101
Number of pages12
ISBN (Electronic)9783319930404
ISBN (Print)9783319930398
DOIs
StatePublished - 2018
Event22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining - Melbourne, Australia
Duration: 3 Jun 20186 Jun 2018
Conference number: 22
http://prada-research.net/pakdd18/

Publication series

NameLecture Notes in Computer Science (Lecture Notes in Artificial Intelligence)
PublisherSpringer
Volume10939
ISSN (Print)0302-9743

Conference

Conference22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining
Abbreviated titlePAKDD 2018
CountryAustralia
CityMelbourne
Period3/06/186/06/18
Internet address

    Research areas

  • Entity resolution, Data matching, Similarity search, Blocking

Discover related content
Find related publications, people, projects and more using interactive charts.

View graph of relations

Related by author

  1. Linking Scottish Vital Event Records Using Family Groups

    Akgün, Ö., Dearle, A., Kirby, G. N. C., Garrett, E., Dalton, T. S., Christen, P., Dibben, C. J. L. & Williamson, L. E. P. 15 Jan 2019 (Accepted/In press) In : Historical Methods: a Journal of Quantitative and Interdisciplinary History.

    Research output: Contribution to journalArticle

  2. Learning From Past Links: Understanding the Limits of Linkage Quality

    Akgun, O., Dearle, A., Garrett, E. & Kirby, G. N. C. 6 Sep 2017

    Research output: Contribution to conferenceAbstract

  3. Probabilistic linkage of vital event records in Scotland using familial groups

    Akgun, O., Dalton, T. S., Dearle, A., Garrett, E. & Kirby, G. N. C. 11 May 2017

    Research output: Contribution to conferenceAbstract

ID: 252460914