What is record linkage

Record linkage (RL) refers to the task of linking records from different data sources. Specifically, the goal is to identify all the records (present in different sources) belonging to the same individuals. Different related data sources may contain records belonging to the same individuals. Records belonging to the same individual may not be identical since sometimes errors may be introduced unintentionally. As a result, the record linkage problem is very challenging. In the literature, record linkage is also referred to as Data linkage.

Our record linkage tools

We have developed two different algorithms for record linkage and both of them use hierarchical clustering. The first one is based on single linkage clustering and the latter one is based on complete linkage clustering. Currently the first one is available. We are still working on the complete linkage clustering based record linkage. More information can be found in the publication section.

How to use our record linkage tools

Our tools need some information about data sets provided by you. They use data files having .csv or .txt format. Therefore fields of records should be comma separated. Please download input_file_01.csv, input_file_02.csv, input_file_03.csv and input_file_04.csv. These files will serve as examples to understand the following terms.

  1. Number of input data sets: The number of data sets you want to link or merge. This is a required field. In this example we have used four datasets. Therefore we select 4 from the drop down menu.

  2. Select data files: Browse and upload your data files (these must be in .csv or .txt files). In our example, we browse and upload these 4 data files: input_file_01.csv, input_file_02.csv, input_file_03.csv and input_file_04.csv from our computer. You can choose any file names you want.

  3. Number of attribute types: This is the total number of distinct attributes present in all the input files collectively. In our example, input_file_01.csv and input_file_02.csv have ID, FirstName, LastName, SSN, DateOfBirth, Gender, SchoolID, MiddleName and SSID as attributes; input_file_03.csv has ID, FirstName, LastName and SSN as attributes; input_file_04.csv has ID, FirstName, LastName, DateOfBirth, Gender, SchoolID, MiddleName and SSID as attributes. Totally, we have 9 distinct attribute types, namely, ID, FirstName, LastName, SSN, DateOfBirth, Gender, SchoolID, MiddleName and SSID. So we select 9 as the number of attributes.
    Our single linkage clustering tool works on attributes that are common to all the input files. In our example, these attributes are ID, FirstName, and LastName.

  4. Names of attribute fields: Input distinct names for the unique attributes. In our example, we use ID, FirstName, LastName, SSN, DateOfBirth, Gender, SchoolID, MiddleName and SSID as names for these attributes. These attributes are used in the output file.

  5. Indices of attribute fields: This input connects the attributes present in each file with the unique attribute names input in the previous line. Indices start with 0 and end with (total number of attributes - 1). For our example: input_file_01.csv and input_file_02.csv should have 0,1,2,3,4,5,6,7,8. But input_file_03.csv should have 0,1,2,3,-1,-1,-1,-1,-1 as ID is at the 0th index, FirstName is at the 1st, LastName is at the 2nd and SSN is at the 3rd column of this file. No other attribute is present in this file. In the same way input_file_04.csv should have 0,1,2,-1,3,4,5,6,7. This file has no SSN column. So we put -1 in that field. DateOfBirth is in the 3rd column, Gender is in the 4th column, and so on.

  6. Number of comparisons: The number of comparisons you want. Given two records, the similarity (or the difference) between them is computed using the similarity (or difference) between some of the common attributes. In this input the user specifies the number of attributes that should be employed in similarity (or difference) computation. Distance between two attributes can be computed in a number of ways. Each such way is referred to as a distance metric. For our example, we have included 2 comparisons. More explanation is in the following point.

  7. Comparison methods: In this input the user specifies the distance metric to be used for every comparison. There are at most 3 fields for each comparison. The first one is the comparison method (i.e., the distance metric). Second one is the index of the field. The index should follow the ordering of the names of attribute fields you have input before. The third field (truncation count) is only applicable to Truncation Distance method.
    We employ 3 types of comparison methods. Edit distance method calculates the minimum number of operations required to transform one string to another. For example, if A="school" and B="shooms", we see that B has 'c' missing at the 1st index, 'l' has been changed to 'm' at index 4 and 's' has been added at the end. So if we insert 'c', change 'm' to 'l' and delete 's', then we recover the string "school". So the edit distance between A and B is 3. Another method is reversal distance method in which 2 attribute indices are used. For example if A="abdullah,mamun" and B="mamun,abdullah", the edit distance between A and B is large. In reversal distance calculation method, we first calculate the edit distance between A="abdullah,mamun" and B="mamun,abdullah" and then between A="abdullah,mamun" and B="abdullah,mamun" and take the minimum distance. So it finds no error here. Sometimes the first name and the last name are interchanged mistakenly. Reversal distance method solves this problem. The last method is the truncation distance method. This is the same as the edit distance method. But it only considers the amount of truncation count characters. Assume that A="jones" and B="jon" and the truncation count is 2. Then this method calculates the edit distance between A="jo" and B="jo".
    In our example we have used 2 comparison methods. You may use any number of comparison methods. We have used the edit distance on last name column (index 2) and the truncation method with truncation count 1 on first name (index 1) as some records only have first name initials.

  8. Block field index: Blocking is done to speedup the computation. Blocking can be thought of as a preliminary coarse clustering. The idea is to use one of the attributes for blocking. We generate every k-mer of this attribute and hash the records based these k-mers. A reliable common attribute should be used for blocking. Block field (i.e., attribute) is important in our algorithm for clustering efficiently. We have used LastName as the block field whose index is 2. In this input the user specifies the index of the attribute that has to be used for blocking.

  9. Block field length: This is an optional field. In this input the user specifies the value of k. This field is necessary for better performance. If you do not input any length, the default value is 7. We have also used 5 as the block field length in our example.

  10. Threshold value: This is an optional field. This threshold specifies when two records should be treated as the same. Specifically, this input is an integer value which indicates how many errors you tolerate (between two 'identical' records). If you input 0, then the tool finds exact matches. For real life applications, 1 or 2 is enough. If you do not provide any value, we select 1 as the default value. In our example, we have used 1 as the threshold value.

  11. Please enter your valid email address which is required to inform you when the output is ready. Filling out other user information is highly appreciated.

How to find output

We will send a download link for the output file to your provided email address. Please check this output file for the above four input data files and discussed configuration. You may also track your submission history.

Recommend our online tools