Skip to content

Research at St Andrews

Record linking using metric space similarity search

Research output: Contribution to conferenceAbstract

Standard

Record linking using metric space similarity search. / Dearle, Alan; Kirby, Graham Njal Cameron; Akgun, Ozgur; Dalton, Thomas Stanley.

2017. Abstract from UK Administrative Data Research Network Annual Research Conference, Edinburgh, United Kingdom.

Research output: Contribution to conferenceAbstract

Harvard

Dearle, A, Kirby, GNC, Akgun, O & Dalton, TS 2017, 'Record linking using metric space similarity search' UK Administrative Data Research Network Annual Research Conference, Edinburgh, United Kingdom, 1/06/17 - 2/06/17, .

APA

Dearle, A., Kirby, G. N. C., Akgun, O., & Dalton, T. S. (2017). Record linking using metric space similarity search. Abstract from UK Administrative Data Research Network Annual Research Conference, Edinburgh, United Kingdom.

Vancouver

Dearle A, Kirby GNC, Akgun O, Dalton TS. Record linking using metric space similarity search. 2017. Abstract from UK Administrative Data Research Network Annual Research Conference, Edinburgh, United Kingdom.

Author

Dearle, Alan ; Kirby, Graham Njal Cameron ; Akgun, Ozgur ; Dalton, Thomas Stanley. / Record linking using metric space similarity search. Abstract from UK Administrative Data Research Network Annual Research Conference, Edinburgh, United Kingdom.

Bibtex - Download

@conference{5caabedd32764dbbbe8c81724c8cd9e2,
title = "Record linking using metric space similarity search",
abstract = "Record linking often employs blocking to reduce the computational complexity of full pairwise comparison. A key is formed from a subset of record attributes. Those records with the same key values are blocked together for detailed comparison. Use of a single blocking key fails to detect many true matches if records contain missing values or errors, since only those records with the same key values are compared. To address missing values, it is common to repeat the matching process using multiple blocking keys, to match records that are identical in a subset of the fields. The presence of erroneous values may be addressed by blocking using key values mapped to a canonical form (e.g. Soundex). However, this does not address other problems such as single digit transcription errors in dates.Blocking is used to categorise records that are candidate matches, in preparation for a pairwise comparison phase which may use various distance metrics, depending on the domain of the values being compared. Each blocking process defines a partition of records. The comparison operations are only applied to pairs of records within the same category.In some contexts, it may be useful to have flexible control over the precision/recall trade-off, depending on the intended use for the matched data, and the degree of conservatism required of the identified links. With blocking, this flexibility is limited by the number of sensible blocking keys that can be identified.In this talk, we describe experiments with a technique based on similarity searching over metric spaces, which appears to offer greater flexibility, and describe some preliminary results using an historic Scottish dataset.",
keywords = "record linkage",
author = "Alan Dearle and Kirby, {Graham Njal Cameron} and Ozgur Akgun and Dalton, {Thomas Stanley}",
year = "2017",
month = "4",
day = "2",
language = "English",
note = "UK Administrative Data Research Network Annual Research Conference : Social science using administrative data for public benefit, ADRN2017 ; Conference date: 01-06-2017 Through 02-06-2017",
url = "http://www.adrn2017.net",

}

RIS (suitable for import to EndNote) - Download

TY - CONF

T1 - Record linking using metric space similarity search

AU - Dearle,Alan

AU - Kirby,Graham Njal Cameron

AU - Akgun,Ozgur

AU - Dalton,Thomas Stanley

PY - 2017/4/2

Y1 - 2017/4/2

N2 - Record linking often employs blocking to reduce the computational complexity of full pairwise comparison. A key is formed from a subset of record attributes. Those records with the same key values are blocked together for detailed comparison. Use of a single blocking key fails to detect many true matches if records contain missing values or errors, since only those records with the same key values are compared. To address missing values, it is common to repeat the matching process using multiple blocking keys, to match records that are identical in a subset of the fields. The presence of erroneous values may be addressed by blocking using key values mapped to a canonical form (e.g. Soundex). However, this does not address other problems such as single digit transcription errors in dates.Blocking is used to categorise records that are candidate matches, in preparation for a pairwise comparison phase which may use various distance metrics, depending on the domain of the values being compared. Each blocking process defines a partition of records. The comparison operations are only applied to pairs of records within the same category.In some contexts, it may be useful to have flexible control over the precision/recall trade-off, depending on the intended use for the matched data, and the degree of conservatism required of the identified links. With blocking, this flexibility is limited by the number of sensible blocking keys that can be identified.In this talk, we describe experiments with a technique based on similarity searching over metric spaces, which appears to offer greater flexibility, and describe some preliminary results using an historic Scottish dataset.

AB - Record linking often employs blocking to reduce the computational complexity of full pairwise comparison. A key is formed from a subset of record attributes. Those records with the same key values are blocked together for detailed comparison. Use of a single blocking key fails to detect many true matches if records contain missing values or errors, since only those records with the same key values are compared. To address missing values, it is common to repeat the matching process using multiple blocking keys, to match records that are identical in a subset of the fields. The presence of erroneous values may be addressed by blocking using key values mapped to a canonical form (e.g. Soundex). However, this does not address other problems such as single digit transcription errors in dates.Blocking is used to categorise records that are candidate matches, in preparation for a pairwise comparison phase which may use various distance metrics, depending on the domain of the values being compared. Each blocking process defines a partition of records. The comparison operations are only applied to pairs of records within the same category.In some contexts, it may be useful to have flexible control over the precision/recall trade-off, depending on the intended use for the matched data, and the degree of conservatism required of the identified links. With blocking, this flexibility is limited by the number of sensible blocking keys that can be identified.In this talk, we describe experiments with a technique based on similarity searching over metric spaces, which appears to offer greater flexibility, and describe some preliminary results using an historic Scottish dataset.

KW - record linkage

M3 - Abstract

ER -

Related by author

  1. Probabilistic linkage of vital event records in Scotland using familial groups

    Akgun, O., Dalton, T. S., Dearle, A., Garrett, E. & Kirby, G. N. C. 11 May 2017

    Research output: Contribution to conferenceAbstract

  2. Evaluating population data linkage: assessing stability, scalability, resilience and robustness across many data sets for comprehensive linkage evaluation

    Dalton, T. S., Akgun, O., Al-Sediqi, A., Christen, P., Dearle, A., Garrett, E., Gray, A., Kirby, G. N. C. & Reid, A. 2 Apr 2017

    Research output: Contribution to conferenceAbstract

  3. An identifier scheme for the Digitising Scotland project

    Akgun, O., Al-Sidiqi, A., Christen, P., Dalton, T. S., Dearle, A., Dibben, C. J. L., Garrett, E., Gray, A., Kirby, G. N. C. & Reid, A. 2 Apr 2017

    Research output: Contribution to conferenceAbstract

ID: 250036196