knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(diyar)
Consolidating information from multiple sources is often the first step in these investigations.
This vignette gives a brief introduction to the basics of record linkage as implemented by diyar
.
Let's begin by reviewing missing_staff_id
- a sample dataset containing incomplete staff information.
data(missing_staff_id)
missing_staff_id
A unique identifier that distinguishes one entity (staff) from another is often unavailable or incomplete as is the case with staff_id
in this example.
links()
can be used to create one.
The identifier is created as an S4
class (pid
) with useful information about each group in its slots.
The simplest strategy would be to select one attribute as a distinguishing characteristic for each entity. This is the simple deterministic approach to record linkage.
In the example below, we use initials
and hair_colour
as distinguishing characteristics.
missing_staff_id$p1 <- links(criteria = missing_staff_id$initials) missing_staff_id$p2 <- links(criteria = missing_staff_id$hair_colour) missing_staff_id[c("initials", "hair_colour", "p1", "p2")]
Unsurprisingly, the uniqueness of identifiers p1
and p2
correspond to the uniqueness of the initials
and hair_colour
respectively.
Both identifiers represent different outcomes - p1
identifies records 3 and 4 as the same person, while p2
has it as records 4 and 5.
To maximise coverage, links()
can implement an ordered multistage deterministic approach to record linkage. For example, we can say that records with matching initials
should be linked to each other first, then other records with a matching hair_colour
should then be added to each group. This is referred to as group expansion.
missing_staff_id$p3 <- links( criteria = as.list(missing_staff_id[c("initials", "hair_colour")]) ) missing_staff_id[c("initials", "hair_colour", "p1", "p2", "p3")]
We see that p3
now identifies records 3, 4 and 5 as the same person.
The logic here is that since record 4 has the same initial
as record 3 and also has the same hair_colour
as record 5, all three are therefore linked as part of the same entity.
Note that records 3 and 5 have only been linked due to their shared link with record 4. If record 4 is removed from this dataset, and the analysis repeated, records 3 and 5 will not be linked and therefore remain separate entities.
During group expansion the following rules are applied.
tie_sort
argument.At each stage, additional match criteria can be specified.
This is done through a sub_criteria
object.
This is an S3
class containing attributes to be compared and functions for the comparisons.
A sub_criteria
object is used for evaluated, fuzzy and/or nested matches.
For example, we can compare hair_colour
and branch_office
without any order (priority) to them.
This is the equivalent of saying matching hair color OR/AND
branch office.
scri_1 <- sub_criteria( missing_staff_id$hair_colour, missing_staff_id$branch_office, operator = "or" ) scri_2 <- sub_criteria( missing_staff_id$hair_colour, missing_staff_id$branch_office, operator = "and" ) missing_staff_id$p4 <- links( criteria = "place_holder", sub_criteria = list(cr1 = scri_1), recursive = TRUE ) missing_staff_id$p5 <- links( criteria = "place_holder", sub_criteria = list(cr1 = scri_2), recursive = TRUE ) missing_staff_id[c("hair_colour", "branch_office", "p4", "p5")]
There is no limit to the number of sub_criteria
that can be specified but each sub_criteria
must be paired to a criteria
.
Any unpaired sub_criteria
will be ignored.
As mentioned, a sub_criteria
can be nested.
For example, scri_3
below is the equivalent of saying (scri_1
; matching hair colour OR
branch office) AND
(matching initials OR
branch office).
scri_3 <- sub_criteria( scri_1, sub_criteria( missing_staff_id$initials, missing_staff_id$branch_office, operator = "or"), operator = "and" ) missing_staff_id$p6 <- links( criteria = "place_holder", sub_criteria = list(cr1 = scri_3), recursive = TRUE ) missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6")]
Evaluated matches can be implemented with user-defined functions. The only requirement for this is that they:
x
and y
, where y
is the value for one observation being compared against the value of all other observations - x
.TRUE
or FALSE
.For example, there are variations of the same hair_colour
and branch_office
values in missing_staff_id
.
A quick look and we see that using the last word of each value will improve the linkage result.
We can create and pass a function to the sub_criteria
object that will make this comparison. After doing this below (p7
), we see that record 6 has now been linked with records 1 and 7, which was not the case earlier.
# A function to extract the last word in a string last_word_wf <- function(x) tolower(gsub("^.* ", "", x)) # A logical test using `last_word_wf`. last_word_cmp <- function(x, y) last_word_wf(x) == last_word_wf(y) scri_4 <- sub_criteria( missing_staff_id$hair_colour, missing_staff_id$branch_office, match_funcs = c(last_word_cmp, last_word_cmp), operator = "or" ) missing_staff_id$p7 <- links( criteria = "place_holder", sub_criteria = list(cr1 = scri_4), recursive = TRUE ) missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6", "p7")]
A sub_criteria
can provide a lot of flexibly in terms of how attributes are compared however, it comes at the cost of processing time. This is because links()
is an iterative function, comparing batches of record-pairs in iterations.
This generally leads to a lower maximum memory usage but longer run times needed to analyse the multiple batches. There are three modes of a batched
analysis with links()
- "yes"
, "semi"
and "no"
. These help manage the maximum memory usage or maximum number of iterations expended to complete the analyses.
For instance, below is a match criteria for a rolling match of records within three days of each other. With print()
, we can see the record-pair batches compared at each iteration.
dfr <- data.frame(x = 1:5) roll_window_funx <- function(x, y){ match <- abs(x - y) <= 2 print(data.frame(y, x, match)) cat("\n") return(match) } roll_window_scri <- sub_criteria( dfr$x, match_funcs = roll_window_funx )
With the "yes"
option, the linkage takes 5 iterations (run time) but only creates 5 record-pairs (max memory usage) are compared at each iteration.
dfr$b.p1 <- links( criteria = "place_holder", sub_criteria = list(cr1 = roll_window_scri), batched = "yes", recursive = TRUE )
Conversely, the "no"
option completes the linkage in 1 iteration but creates 15 record-pairs in that single iteration.
dfr$b.p2 <- links( criteria = "place_holder", sub_criteria = list(cr1 = roll_window_scri), batched = "no" )
The "semi"
option is a balance between the "yes"
and "no"
options. The number of record-pairs increases as matches are identified. This generally leads to a lower maximum memory usage compared to the "no"
option and fewer number of iterations compared to the "yes"
options.
dfr$b.p3 <- links( criteria = "place_holder", sub_criteria = list(cr1 = roll_window_scri), batched = "semi", recursive = TRUE ) dfr
There are variations of links()
such as links_wf_probabilistic()
, links_af_probabilistic()
and links_wf_episodes()
for specific use cases such as probabilistic record linkage and grouping temporal events.
The implementation of probabilistic record linkage is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity. In summary, m_probabilities
and u_probabilities
, which are the probabilities of a true and false match respectively are used to calculate a final match score for each record-pair. Records below or above a certain score_threshold
are considered matches or non-matches respectively. See help(links_wf_probabilistic)
for a more detailed explanation of the method. Below we see the same analysis as above but as a probabilistic record linkage.
missing_staff_id$p9 <- links_wf_probabilistic( attribute = list(missing_staff_id$hair_colour, missing_staff_id$branch_office), cmp_func = c(last_word_cmp, last_word_cmp), probabilistic = TRUE ) missing_staff_id[c("hair_colour", "branch_office", "p7", "p9")]
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.