knitr::opts_chunk$set( collapse = T, comment = "#>" )
plot_pid <- diyar:::plot_pid
Linking datasets to consolidate information is a common task in research, particularly in those using "big data". Deterministic record linkage is the simplest and most common method of doing this. However, its accuracy relies on data quality. links()
provides a convenient and flexible approach of doing this while handing missing or incorrect data.
links()
can be used to link datasets in a variety of ways. Examples include;
criteria
criteria
and one (or more) matching sub_criteria
. See record matchingData matching is done in ordered stages. A match at each stage is considered more relevant than those at subsequent stages. Matched records are assigned a record group identifier. This is the record identifier (sn
) of one of the matching records therefore, a familiar sn
can prove useful.
Each matching stage has an attribute or characteristic to match (criteria
). These must be passed as a list
of atomic
vectors.
Additional matching conditions (sub_criteria
) can also be specified at each stage. This is done by supplying a list
of sub_crieria()
. Each sub_criteria()
must be paired to a stage (criteria
). This is done by naming each element in the list, using the syntax - "cr[n]"
. Where n
is the corresponding stage (criteria
) of interest (See examples). Any unpaired sub_criteria
will be ignored. Records will only be assigned to the same group if they have the same criteria
, and at least one matching attribute in each sub_criteria()
. See examples below.
library(diyar); data(patient_list); dbs <- patient_list[c("forename","surname","sex")]; dbs # 1 stage <- Matching surname only dbs$pids_a <- links(criteria = dbs$surname, display = "none", group_stats = T) # 2 stage - Matching surname, then matching sex dbs$pids_b <- links(criteria = list(dbs$surname, dbs$sex), display = "none") dbs
Matched records are assigned to record groups. These are stored as pid
objects (S4
object class). A pid
object is a group identifier with slots
for additional information about each group.
For a pid
object created in multiple stages (criteria
), the following information will be displayed (format.pid
);
r substr(format(dbs$pids_a[1]),1,3)
- group identifierr substr(format(dbs$pids_a[1]),6,12)
- stage when that record was linkedSee ?links
for further details.
to_df() transforms pid
objects to data.frames
.
cat("to_df(`pid`)") to_df(dbs$pids_a)
Attributes are checked for a case sensitive exact match. Those supplied via a sub_criteria()
can also be compared by any logical
test applicable to two atomic
vectors. The logical test must be supplied as a function
. This function must have two named arguments - x
and y
. Where y
is the value for one observation to be compared against all other observations (x
). See some examples below.
dbs2 <- patient_list[c("forename","surname","sex")]; dbs2 # Matching `sex`, and `forename` initials lgl_1 <- function(x, y) substr(x, 1, 1) == substr(y, 1, 1) dbs2$pids_1 <- links(criteria = list(dbs2$sex), sub_criteria = list( cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_1)), display = "none") # Matching `sex`, and `forename` with character length of 5 lgl_2 <- function(x, y) nchar(x) == nchar(y) & nchar(x) == 5 dbs2$pids_2 <- links(criteria = list(dbs2$sex), sub_criteria = list( cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_2)), display = "none") # Matching `sex`, and `forename` that ends with "y" lgl_3 <- function(x, y) substr(x, length(x) - 1, length(x)) == "y" dbs2$pids_3 <- links(criteria = list(dbs2$sex), sub_criteria = list( cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_3)), display = "none") # Matching `sex` and at least one `forename` that ends with "y" lgl_4 <- function(x, y) substr(y, length(y) - 1, length(y)) == "y" dbs2$pids_4 <- links(criteria = list(dbs2$sex), sub_criteria = list( cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_4)), display = "none")
User-defined logical tests can be very useful for implementing complex matching conditions. For example, you can match observations if they are within a range or list of values. See the example below.
Opes_c <- Opes[c("date_of_birth", "name", "hair_colour")] Opes_c <- Opes["date_of_birth"] # approximate age Opes_c$age <- as.numeric(round((Sys.Date() - as.Date(Opes_c$date_of_birth, "%d/%m/%Y"))/365.5)) # Match individuals between ages 35 and 55 rng1 <- function(x, y) x %in% 35:55 Opes_c$pids_a <- links(criteria = "place_holder", sub_criteria = list(cr1 = sub_criteria(Opes_c$age, match_funcs = rng1)), display ="none") # Match individuals with a 5-year age gap between themselves rng2 <- function(x, y) abs(y - x) %in% 0:5 Opes_c$pids_b <- links(criteria = "place_holder", sub_criteria = list(cr1 = sub_criteria(Opes_c$age, match_funcs = rng2)), display ="none") # Match individuals with more than a 5-year age gap between each other rng3 <- function(x, y) (y - x) > 5 Opes_c$pids_c <- links(criteria = "place_holder", sub_criteria = list(cr1 = sub_criteria(Opes_c$age, match_funcs = rng3)), display ="none") Opes_c[c("age","pids_a", "pids_b", "pids_c")]
links()
goes through repeated checks for matches based on the logical test specified. At each iteration, one observation (y
) is compared with every other observation (x
). As a result, two observations (i.e. x[2]
and x[3]
) can pass the same logical test compared with a third observation (i.e. y
). But not pass the same test when compared with each other. As an example, in Opes_c$pids_c
above, records 1, 2, 4, 5 and 8 are within a 5-year gap of each other. This has happened because there's more than a 5-year gap between themselves and record 7.
Always aim for the combination of conditions that'll give more true positive matches. Linking records on matching surnames and then sex (pid_b
) led to a different result. Compared to linking on sex before surnames (pid_c
below) (See group expansion). In pid_b
, a match on the individual's surname was considered more relevant than a match on their sex. The opposite was the case for pid_c
.
dbs$pids_c <- links(criteria = list(dbs$sex, dbs$surname), display = "none") dbs
Both results are logically correct but pids_b
is not the most practical option. For instance, records 3 and 6 which were linked but could be cousins and not the same individual. A better combination would be forename at stage 1, followed by surname and sex at stage 2. See pid_d
below;
dbs_2 <- patient_list; dbs_2 dbs_2$cri_2 <- paste(dbs_2$surname, dbs_2$sex,sep="-") dbs_2$pid_d <- links(sn = dbs_2$rd_id, list(dbs_2$forename, dbs_2$cri_2), display = "none") dbs_2
plot_pid(pid = dbs_2$pid_d)
schema(dbs_2$pid_d)
Below are additional examples using different combinations of the same criteria
and sub_criteria
data(Opes); Opes # 1 stage linkage # stage 1 - name AND (department OR hair_colour OR date_of_birth) Opes$pids_a <- links(criteria = Opes$name, sub_criteria = list(cr1 = sub_criteria(Opes$department, Opes$hair_colour, Opes$date_of_birth)), display = "none" ) Opes[c("name","department","hair_colour","date_of_birth","pids_a")] # 1 stage linkage # stage 1 - name AND ((department OR hair_colour) AND (date_of_birth)) Opes$pids_b <- links(criteria = Opes$name, sub_criteria = list(cr1 = sub_criteria(Opes$department, Opes$hair_colour), cr1 = sub_criteria(Opes$date_of_birth)), display = "none") Opes[c("name","department","hair_colour","date_of_birth","pids_b")] # 1 stage linkage # stage 1 - name AND ((department OR hair_colour) AND (dd-mm OR dd-yyyy OR mm-yyyy)) Opes$pids_c <- links(criteria = Opes$name, sub_criteria = list(cr1 = sub_criteria(Opes$department, Opes$hair_colour), cr1 = sub_criteria(Opes$db_pt1, Opes$db_pt2, Opes$db_pt3)), display = "none") Opes[c("name","department","hair_colour","date_of_birth","pids_c")] # 1 stage linkage # stage 1 - name AND ((department) AND (hair_colour) AND (dd-mm OR dd-yyyy OR mm-yyyy)) Opes$pids_d <- links(criteria = Opes$name, sub_criteria = list(cr1 = sub_criteria(Opes$department), cr1 = sub_criteria(Opes$hair_colour), cr1 = sub_criteria(Opes$db_pt1, Opes$db_pt2, Opes$db_pt3)), display = "none") Opes[c("name","department","hair_colour","date_of_birth","pids_d")]
By default, matches at different stages are linked together. The more stages (criteria
) there are in an instance of links()
, the larger a record group becomes. Priority is always given to matches that occurred at earlier stages. As a result, it's important that criteria
are listed in decreasing order of certainty. You can disable the link between stages by changing the expand
argument to FALSE
. See a simple example below.
data(Opes); Opes # 2 stage linkage # stage 1 - link "Opes" in departments that start with the letter "P" THEN # stage 2 - bridge these to "Opes" whose hair colour starts with the letter "B" dept_func <- function(x, y) substr(x, 1, 1) == substr(y, 1, 1) & substr(y, 1, 1) == "P" hair_func <- function(x, y) substr(x, 1, 1) == substr(y, 1, 1) & substr(y, 1, 1) == "B" Opes$p_exp_a <- links(criteria = list(Opes$name, Opes$name), sub_criteria = list( cr1 = sub_criteria(Opes$department, match_funcs = dept_func), cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)), display = "none") # 2 stage linkage # stage 1 - link "Opes" in departments that start with the letter "P" THEN # stage 2 - link "Opes" whose hair colour starts with the letter "B" Opes$p_nexp_a <- links(criteria = list(Opes$name, Opes$name), sub_criteria = list( cr1 = sub_criteria(Opes$department, match_funcs = dept_func), cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)), expand = F, display = "none") Opes[c("name","department","hair_colour", "p_exp_a", "p_nexp_a")]
Note that changing expand
to TRUE
is not the same as reversing the relevance of your matching conditions. See below.
# The same as `p_exp_a` Opes$p_exp_b <- links(criteria = list(Opes$name, Opes$name), sub_criteria = list( cr2 = sub_criteria(Opes$department, match_funcs = dept_func), cr1 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)), display = "none") # Not the same as `p_nexp_a` Opes$p_nexp_b <- links(criteria = list(Opes$name, Opes$name), sub_criteria = list( cr2 = sub_criteria(Opes$department, match_funcs = dept_func), cr1 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)), expand = F, display = "none") Opes[c("name","department","hair_colour", "p_exp_a", "p_exp_b", "p_nexp_a", "p_nexp_b")]
Example 1 of group expansion
data(patient_list_2); patient_list_2 patient_list_2$pids_a <- links( sn = patient_list_2$rd_id, criteria = list(patient_list_2$forename, patient_list_2$surname, patient_list_2$sex), display = "none") patient_list_2
plot_pid(pid = patient_list_2$pids_a)
schema(patient_list_2$pids_a)
Example 2 of group expansion
df <- data.frame( forename = c("John", "John", "Jon", NA_character_, "Mary", "Mary", "Mariam", "John", "Henry", "Obinna", "Tomi"), surname = c("Swan", "Swanley", "Swanley", NA_character_, "Jane", "Jan", "Janet", "Swan", "Henderson", "Nelson", "Abdulkareem"), age = c(12, 11, 10, 10, 5, 6, 6, 12, 30, 31, 2500) ) df$pids_a <- links(criteria = list(df$forename, df$surname, df$age), display = "none") df
plot_pid(df$pids_a)
schema(df$pids_a)
You can also specify the opposite behaviour. Where a record group becomes smaller with each subsequent stage. If there is no sub_criteria
, the easiest way of doing this is to concatenate the attributes in each stage. See the example below.
data(Opes); Opes # 1 stage linkage # stage 1 - link "Opes" that are in the same department AND have the same day and month of birth Opes$cri <- paste(Opes$name, Opes$department, Opes$db_pt1, sep = " ") Opes$p_shk_c <- links(criteria = list(Opes$cri), display = "none") Opes[c("name","department","db_pt3", "p_shk_c")]
If there's a sub_criteria
, you'll need to use the shrink
argument. See the example below. The same analysis is repeated using sub_criteria
.
# 3 stage linkage # stage 1 - link "Opes" that have the same name, THEN WITHIN THESE # stage 2 - link "Opes" that are in the same department, THEN WITHIN THESE # stage 3 - link "Opes" have the same day and month of birth Opes$p_shk_d <- links(criteria = list(Opes$name, Opes$department, Opes$db_pt1), sub_criteria = list( cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match), cr2 = sub_criteria("place_holder", match_funcs = diyar::exact_match), cr3 = sub_criteria("place_holder", match_funcs = diyar::exact_match) ), shrink = T, display = "none") Opes[c("name","department","db_pt3", "p_shk_c", "p_shk_d")]
Although p_shk_c
and p_shk_d
are identical, the pid_cri
slot for both are different. This reflects the different ways they were created. p_shk_d
is the equivalent of three compounding (using strata
) instances of links()
. See below.
# 1 stage linkage # stage 1 - link "Opes" that have the same name Opes$p_shk_e1 <- links(criteria = list(Opes$name), sub_criteria = list( cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match)), shrink = T, display = "none") # Another attempt is made always made at the next for records with no links Opes$e1 <- ifelse(Opes$p_shk_e1@pid_cri == 0, "No Hits", as.character(Opes$p_shk_e1)) # 1 stage linkage # stage 1 - link "Opes" that are in the same department Opes$p_shk_e2 <- links(criteria = list(Opes$department), strata = Opes$e1, sub_criteria = list( cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match)), shrink = T, display = "none") # Another attempt is made always made at the next for records with no links Opes$e2 <- ifelse(Opes$p_shk_e1@pid_cri == 0, "No Hits", as.character(Opes$p_shk_e1)) # 1 stage linkage # stage 3 - link "Opes" have the same day and month of birth Opes$p_shk_e3 <- links(criteria = list(Opes$db_pt1), strata = Opes$e2, sub_criteria = list( cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match)), shrink = T, display = "none") Opes[c("name","department","db_pt3", "p_shk_e1", "p_shk_e2", "p_shk_e3", "p_shk_c", "p_shk_d")]
Note that the expand
functionality is not always interchangeable with that of shrink
. See an example of the difference below.
data(Opes); Opes Opes$p_cmp1 <- links(criteria = list("place_holder", "place_holder"), sub_criteria = list( cr1 = sub_criteria(Opes$department, match_funcs = dept_func), cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)), expand = T, shrink = F, display = "none") Opes$p_cmp2 <- links(criteria = list("place_holder", "place_holder"), sub_criteria = list( cr1 = sub_criteria(Opes$department, match_funcs = dept_func), cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)), expand = F, shrink = T, display = "none") # `p_cmp1` is not the same as `p_cmp2` Opes[c("name", "department", "hair_colour", "p_cmp1", "p_cmp2")]
Records with missing values are excluded from particular stages of the linkage process. If a record has missing values at every stage, it's assigned to a unique group where it's the only member.
It's common for databases to use specific characters or numbers to represent missing or unknown data e.g. N/A
, Nil
, 01/01/1100
, 111111
e.t.c. These pseudo-missing values must need to be re-coded to NA
, so that links()
recognises them as missing values. If this is not done, it can lead to a cascade of false matches as seen below.
patient_list_2$forename <- ifelse(patient_list_2$rd_id %in% 1:3, "Nil", patient_list_2$forename) # 2 stage linkage # Stage 1 - forename # Stage 2 - surname patient_list_2$pids_b <- links(criteria = list(patient_list_2$forename, patient_list_2$surname), display = "none") patient_list_2[c("forename","surname","pids_b")]
In the example above, records 1-3 are assigned to the same record group even though record 3 seems to be a different person. However, changing "Nil"
to NA
let links()
know that "Nil"
is not a valid forename
, and it is skipped. See below.
# `NA` as the proxy for missing value patient_list_2$forename <- ifelse(patient_list_2$forename == "Nil", NA, patient_list_2$forename) patient_list_2$pids_d <- links(sn = patient_list_2$rd_id, criteria = list(patient_list_2$forename, patient_list_2$surname), display = "none") patient_list_2[c("forename","surname", "pids_b", "pids_d")]
criteria
over sub_criteria
sub_criteria
are required for user-defined logical tests. However, it takes more time to compare sub_criteria
than it does criteria
. You can save some time by only using a sub_criteria
when the matching conditions can not be specified as a criteria
. For example, the logical test lgl_1
(above) was supplied as a sub_criteria
. links()
will create that group identifier faster, if it's supplied as a criteria
. See below.
# Matching `sex`, and `forename` initials dbs2 dbs2$initials <- substr(dbs2$forename, 1, 1) dbs2$pids_1b <- links(criteria = dbs2$initials, display = "none") dbs2[c("forename", "initials", "pids_1", "pids_1b")]
Similarly, avoid using single-attribute sub_criteria
that are paired to the same criteria
. Instead, concatenate these together as one criteria
. For example, the two implementations below (pids_e
and pids_f
) will have the same result. But pids_f
will take less time.
# 1 stage linkage # stage 1 - name AND ((department) AND (hair_colour) AND (year_of_birth)) Opes$month_of_birth <- substr(Opes$date_of_birth, 4, 5) Opes$pids_e <- links(criteria = Opes$name, sub_criteria = list(cr1 = sub_criteria(Opes$department), cr1 = sub_criteria(Opes$hair_colour), cr1 = sub_criteria(Opes$month_of_birth)), display = "none") Opes$cri <- paste(Opes$name, Opes$month_of_birth, Opes$department, Opes$hair_colour, sep="-") # 1 stage linkage # stage 1 - name AND department AND hair_colour AND date_of_birth Opes$pids_f <- links(criteria = Opes$cri, display = "none") Opes[c("name","department","hair_colour","month_of_birth","pids_e","pids_f")]
This time difference could become more noticeable with larger datasets.
links()
is a convenient tool for routine data linkage. But it's more useful in handling complex matching conditions, missing data and multistage linkage. The more complex the scenario, the more likely its convenience outweighs the complexity of scripting them. Simple matching conditions or one-stage linkages are possible with links()
. But this takes slightly more time than base R
alternatives. See the example below.
Opes[c("department")] Opes$links_a <- links(Opes$department, display = "none") Opes$links_b <- match(Opes$department, Opes$department[!duplicated(Opes$department)]) Opes[c("department", "links_a", "links_b")]
This time difference could become more noticeable with larger datasets.
As a general rule, the more unique the criteria
, the earlier it should be used in the linkage process. Always use a reasonable order of relevance for each dataset. Not too strict and not too lax. links()
will try to reduce false mismatches due to random data entry errors or missing values. Always balance the use of matching attributes with their availability and accuracy.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.