knitr::opts_chunk$set(
  collapse = T,
  comment = "#>"
)
plot_pid <- diyar:::plot_pid

Introduction

Linking datasets to consolidate information is a common task in research, particularly in those using "big data". Deterministic record linkage is the simplest and most common method of doing this. However, its accuracy relies on data quality. links() provides a convenient and flexible approach of doing this while handing missing or incorrect data.

Uses

links() can be used to link datasets in a variety of ways. Examples include;

Implementation

Data matching is done in ordered stages. A match at each stage is considered more relevant than those at subsequent stages. Matched records are assigned a record group identifier. This is the record identifier (sn) of one of the matching records therefore, a familiar sn can prove useful.

Each matching stage has an attribute or characteristic to match (criteria). These must be passed as a list of atomic vectors.

Additional matching conditions (sub_criteria) can also be specified at each stage. This is done by supplying a list of sub_crieria(). Each sub_criteria() must be paired to a stage (criteria). This is done by naming each element in the list, using the syntax - "cr[n]". Where n is the corresponding stage (criteria) of interest (See examples). Any unpaired sub_criteria will be ignored. Records will only be assigned to the same group if they have the same criteria, and at least one matching attribute in each sub_criteria(). See examples below.

library(diyar);
data(patient_list); 
dbs <- patient_list[c("forename","surname","sex")]; dbs

# 1 stage <- Matching surname only
dbs$pids_a <- links(criteria = dbs$surname, display = "none", group_stats = T)

# 2 stage - Matching surname, then matching sex
dbs$pids_b <- links(criteria = list(dbs$surname, dbs$sex), display = "none")

dbs

Matched records are assigned to record groups. These are stored as pid objects (S4 object class). A pid object is a group identifier with slots for additional information about each group.

For a pid object created in multiple stages (criteria), the following information will be displayed (format.pid);

See ?links for further details.

to_df() transforms pid objects to data.frames.

cat("to_df(`pid`)")
to_df(dbs$pids_a)

Record matching

Attributes are checked for a case sensitive exact match. Those supplied via a sub_criteria() can also be compared by any logical test applicable to two atomic vectors. The logical test must be supplied as a function. This function must have two named arguments - x and y. Where y is the value for one observation to be compared against all other observations (x). See some examples below.

dbs2 <- patient_list[c("forename","surname","sex")]; dbs2

# Matching `sex`, and `forename` initials
lgl_1 <- function(x, y) substr(x, 1, 1) == substr(y, 1, 1) 
dbs2$pids_1 <- links(criteria =  list(dbs2$sex), 
                     sub_criteria = list(
                       cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_1)),
                     display = "none")

# Matching `sex`, and `forename` with character length of 5
lgl_2 <- function(x, y) nchar(x) == nchar(y) & nchar(x) == 5
dbs2$pids_2 <- links(criteria =  list(dbs2$sex), 
                    sub_criteria = list(
                      cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_2)),
                    display = "none")

# Matching `sex`, and `forename` that ends with "y"
lgl_3 <- function(x, y) substr(x, length(x) - 1, length(x)) == "y"
dbs2$pids_3 <- links(criteria =  list(dbs2$sex), 
                    sub_criteria = list(
                      cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_3)),
                    display = "none")

# Matching `sex` and at least one `forename` that ends with "y"
lgl_4 <- function(x, y) substr(y, length(y) - 1, length(y)) == "y"
dbs2$pids_4 <- links(criteria =  list(dbs2$sex), 
                    sub_criteria = list(
                      cr1 = sub_criteria(dbs2$forename, match_funcs = lgl_4)),
                    display = "none")

User-defined logical tests can be very useful for implementing complex matching conditions. For example, you can match observations if they are within a range or list of values. See the example below.

Opes_c <- Opes[c("date_of_birth", "name", "hair_colour")]
Opes_c <- Opes["date_of_birth"]
# approximate age
Opes_c$age <- as.numeric(round((Sys.Date() - as.Date(Opes_c$date_of_birth, "%d/%m/%Y"))/365.5))

# Match individuals between ages 35 and 55 
rng1 <- function(x, y) x %in% 35:55
Opes_c$pids_a <- links(criteria = "place_holder",
                       sub_criteria = list(cr1 = sub_criteria(Opes_c$age, match_funcs = rng1)),
                       display ="none")

# Match individuals with a 5-year age gap between themselves
rng2 <- function(x, y) abs(y - x) %in% 0:5
Opes_c$pids_b <- links(criteria = "place_holder",
                       sub_criteria = list(cr1 = sub_criteria(Opes_c$age, match_funcs = rng2)),
                       display ="none")

# Match individuals with more than a 5-year age gap between each other 
rng3 <- function(x, y) (y - x) > 5
Opes_c$pids_c <- links(criteria = "place_holder",
                       sub_criteria = list(cr1 = sub_criteria(Opes_c$age, match_funcs = rng3)),
                       display ="none")

Opes_c[c("age","pids_a", "pids_b", "pids_c")]

links() goes through repeated checks for matches based on the logical test specified. At each iteration, one observation (y) is compared with every other observation (x). As a result, two observations (i.e. x[2] and x[3]) can pass the same logical test compared with a third observation (i.e. y). But not pass the same test when compared with each other. As an example, in Opes_c$pids_c above, records 1, 2, 4, 5 and 8 are within a 5-year gap of each other. This has happened because there's more than a 5-year gap between themselves and record 7.

Always aim for the combination of conditions that'll give more true positive matches. Linking records on matching surnames and then sex (pid_b) led to a different result. Compared to linking on sex before surnames (pid_c below) (See group expansion). In pid_b, a match on the individual's surname was considered more relevant than a match on their sex. The opposite was the case for pid_c.

dbs$pids_c <- links(criteria =  list(dbs$sex, dbs$surname), display = "none")

dbs

Both results are logically correct but pids_b is not the most practical option. For instance, records 3 and 6 which were linked but could be cousins and not the same individual. A better combination would be forename at stage 1, followed by surname and sex at stage 2. See pid_d below;

dbs_2 <- patient_list; dbs_2

dbs_2$cri_2 <- paste(dbs_2$surname, dbs_2$sex,sep="-")
dbs_2$pid_d <- links(sn = dbs_2$rd_id, list(dbs_2$forename, dbs_2$cri_2), display = "none")

dbs_2
plot_pid(pid = dbs_2$pid_d)
schema(dbs_2$pid_d)

Below are additional examples using different combinations of the same criteria and sub_criteria

data(Opes); Opes

# 1 stage linkage
  # stage 1 - name AND (department OR hair_colour OR date_of_birth)
Opes$pids_a <- links(criteria = Opes$name,
                     sub_criteria = list(cr1 = sub_criteria(Opes$department, 
                                                            Opes$hair_colour, 
                                                            Opes$date_of_birth)),
                     display = "none"
                     )

Opes[c("name","department","hair_colour","date_of_birth","pids_a")]

# 1 stage linkage 
  # stage 1 - name AND ((department OR hair_colour) AND (date_of_birth)) 
Opes$pids_b <- links(criteria = Opes$name, 
                     sub_criteria = list(cr1 = sub_criteria(Opes$department, 
                                                            Opes$hair_colour),
                                         cr1 = sub_criteria(Opes$date_of_birth)),
                     display = "none")

Opes[c("name","department","hair_colour","date_of_birth","pids_b")]

# 1 stage linkage 
  # stage 1 - name AND ((department OR hair_colour) AND (dd-mm OR dd-yyyy OR mm-yyyy))
Opes$pids_c <- links(criteria = Opes$name, 
                     sub_criteria = list(cr1 = sub_criteria(Opes$department, 
                                                            Opes$hair_colour),
                                         cr1 = sub_criteria(Opes$db_pt1, Opes$db_pt2, Opes$db_pt3)),
                            display = "none")

Opes[c("name","department","hair_colour","date_of_birth","pids_c")]

# 1 stage linkage 
  # stage 1 - name AND ((department)  AND (hair_colour) AND (dd-mm OR dd-yyyy OR mm-yyyy))
Opes$pids_d <- links(criteria = Opes$name, 
                     sub_criteria = list(cr1 = sub_criteria(Opes$department),
                                         cr1 = sub_criteria(Opes$hair_colour),
                                         cr1 = sub_criteria(Opes$db_pt1, Opes$db_pt2, Opes$db_pt3)),
                            display = "none")

Opes[c("name","department","hair_colour","date_of_birth","pids_d")]

Multistage linkage

Group expansion

By default, matches at different stages are linked together. The more stages (criteria) there are in an instance of links(), the larger a record group becomes. Priority is always given to matches that occurred at earlier stages. As a result, it's important that criteria are listed in decreasing order of certainty. You can disable the link between stages by changing the expand argument to FALSE. See a simple example below.

data(Opes); Opes

# 2 stage linkage
  # stage 1 - link "Opes" in departments that start with the letter "P" THEN 
  # stage 2 - bridge these to "Opes" whose hair colour starts with the letter "B"
dept_func <- function(x, y) substr(x, 1, 1) == substr(y, 1, 1) & substr(y, 1, 1) == "P" 
hair_func <- function(x, y) substr(x, 1, 1) == substr(y, 1, 1) & substr(y, 1, 1) == "B"

Opes$p_exp_a <- links(criteria = list(Opes$name, Opes$name),
                     sub_criteria = list(
                       cr1 = sub_criteria(Opes$department, match_funcs = dept_func),
                       cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)),
                     display = "none")

# 2 stage linkage
  # stage 1 - link "Opes" in departments that start with the letter "P" THEN 
  # stage 2 - link "Opes" whose hair colour starts with the letter "B"
Opes$p_nexp_a <- links(criteria = list(Opes$name, Opes$name),
                     sub_criteria = list(
                       cr1 = sub_criteria(Opes$department, match_funcs = dept_func),
                       cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)),
                     expand = F,
                     display = "none")

Opes[c("name","department","hair_colour", "p_exp_a", "p_nexp_a")]

Note that changing expand to TRUE is not the same as reversing the relevance of your matching conditions. See below.

# The same as `p_exp_a` 
Opes$p_exp_b <- links(criteria = list(Opes$name, Opes$name),
                     sub_criteria = list(
                       cr2 = sub_criteria(Opes$department, match_funcs = dept_func),
                       cr1 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)),
                     display = "none")

# Not the same as `p_nexp_a`
Opes$p_nexp_b <- links(criteria = list(Opes$name, Opes$name),
                     sub_criteria = list(
                       cr2 = sub_criteria(Opes$department, match_funcs = dept_func),
                       cr1 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)),
                     expand = F,
                     display = "none")

Opes[c("name","department","hair_colour", "p_exp_a", "p_exp_b", "p_nexp_a", "p_nexp_b")]

Example 1 of group expansion

data(patient_list_2); patient_list_2

patient_list_2$pids_a <- links(
  sn = patient_list_2$rd_id,
  criteria = list(patient_list_2$forename, 
                  patient_list_2$surname,
                  patient_list_2$sex), 
  display = "none")

patient_list_2
plot_pid(pid = patient_list_2$pids_a)
schema(patient_list_2$pids_a)

Example 2 of group expansion

df <- data.frame(
  forename = c("John", "John", "Jon", 
               NA_character_, "Mary", "Mary",
               "Mariam", "John", "Henry",
               "Obinna", "Tomi"),
  surname = c("Swan", "Swanley", "Swanley",
              NA_character_, "Jane", "Jan",
              "Janet", "Swan", "Henderson", 
              "Nelson", "Abdulkareem"), 
  age = c(12, 11, 10, 10, 5, 6, 6, 12, 30, 31, 2500)
)

df$pids_a <- links(criteria = list(df$forename, df$surname, df$age), display = "none")

df
plot_pid(df$pids_a)
schema(df$pids_a)

Group shrinkage

You can also specify the opposite behaviour. Where a record group becomes smaller with each subsequent stage. If there is no sub_criteria, the easiest way of doing this is to concatenate the attributes in each stage. See the example below.

data(Opes); Opes

# 1 stage linkage
  # stage 1 - link "Opes" that are in the same department AND have the same day and month of birth 
Opes$cri <- paste(Opes$name, Opes$department, Opes$db_pt1, sep = " ")
Opes$p_shk_c <- links(criteria = list(Opes$cri),
                     display = "none")

Opes[c("name","department","db_pt3", "p_shk_c")]

If there's a sub_criteria, you'll need to use the shrink argument. See the example below. The same analysis is repeated using sub_criteria.

# 3 stage linkage
  # stage 1 - link "Opes" that have the same name, THEN WITHIN THESE
  # stage 2 - link "Opes" that are in the same department, THEN WITHIN THESE
  # stage 3 - link "Opes" have the same day and month of birth 
Opes$p_shk_d <- links(criteria = list(Opes$name, Opes$department, Opes$db_pt1),
                      sub_criteria = list(
                        cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match),
                        cr2 = sub_criteria("place_holder", match_funcs = diyar::exact_match),
                        cr3 = sub_criteria("place_holder", match_funcs = diyar::exact_match)
                      ),
                      shrink = T,
                     display = "none")

Opes[c("name","department","db_pt3", "p_shk_c", "p_shk_d")]

Although p_shk_c and p_shk_d are identical, the pid_cri slot for both are different. This reflects the different ways they were created. p_shk_d is the equivalent of three compounding (using strata) instances of links(). See below.

# 1 stage linkage
  # stage 1 - link "Opes" that have the same name
Opes$p_shk_e1 <- links(criteria = list(Opes$name),
                      sub_criteria = list(
                        cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match)),
                      shrink = T,
                     display = "none")

# Another attempt is made always made at the next for records with no links 
Opes$e1 <- ifelse(Opes$p_shk_e1@pid_cri == 0, "No Hits", as.character(Opes$p_shk_e1))

# 1 stage linkage
  # stage 1 - link "Opes" that are in the same department
Opes$p_shk_e2 <- links(criteria = list(Opes$department),
                      strata = Opes$e1,
                       sub_criteria = list(
                        cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match)),
                      shrink = T,
                     display = "none")

# Another attempt is made always made at the next for records with no links
Opes$e2 <- ifelse(Opes$p_shk_e1@pid_cri == 0, "No Hits", as.character(Opes$p_shk_e1))

# 1 stage linkage
  # stage 3 - link "Opes" have the same day and month of birth 
Opes$p_shk_e3 <- links(criteria = list(Opes$db_pt1),
                      strata = Opes$e2,
                       sub_criteria = list(
                        cr1 = sub_criteria("place_holder", match_funcs = diyar::exact_match)),
                      shrink = T,
                     display = "none")

Opes[c("name","department","db_pt3", "p_shk_e1", "p_shk_e2", "p_shk_e3", "p_shk_c", "p_shk_d")]

Note that the expand functionality is not always interchangeable with that of shrink. See an example of the difference below.

data(Opes); Opes

Opes$p_cmp1 <- links(criteria = list("place_holder", "place_holder"),
                      sub_criteria = list(
                        cr1 = sub_criteria(Opes$department, match_funcs = dept_func),
                        cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)),
                     expand = T,
                     shrink = F,
                     display = "none")

Opes$p_cmp2 <- links(criteria = list("place_holder", "place_holder"),
                      sub_criteria = list(
                        cr1 = sub_criteria(Opes$department, match_funcs = dept_func),
                        cr2 = sub_criteria(Opes$hair_colour, match_funcs = hair_func)),
                     expand = F,
                     shrink = T,
                     display = "none")

# `p_cmp1` is not the same as `p_cmp2`
Opes[c("name", "department", "hair_colour", "p_cmp1", "p_cmp2")]

Handling missing values

Records with missing values are excluded from particular stages of the linkage process. If a record has missing values at every stage, it's assigned to a unique group where it's the only member.

It's common for databases to use specific characters or numbers to represent missing or unknown data e.g. N/A, Nil, 01/01/1100, 111111 e.t.c. These pseudo-missing values must need to be re-coded to NA, so that links() recognises them as missing values. If this is not done, it can lead to a cascade of false matches as seen below.

patient_list_2$forename <- ifelse(patient_list_2$rd_id %in% 1:3, "Nil", patient_list_2$forename)
# 2 stage linkage
    # Stage 1 - forename
    # Stage 2 - surname

patient_list_2$pids_b <- links(criteria = list(patient_list_2$forename, 
                                               patient_list_2$surname), 
                               display = "none")

patient_list_2[c("forename","surname","pids_b")]

In the example above, records 1-3 are assigned to the same record group even though record 3 seems to be a different person. However, changing "Nil" to NA let links() know that "Nil" is not a valid forename, and it is skipped. See below.

# `NA` as the proxy for missing value
patient_list_2$forename <- ifelse(patient_list_2$forename == "Nil", NA, patient_list_2$forename)

patient_list_2$pids_d <- links(sn = patient_list_2$rd_id, 
                               criteria = list(patient_list_2$forename, 
                                               patient_list_2$surname), 
                               display = "none")

patient_list_2[c("forename","surname", "pids_b", "pids_d")]

Tips

criteria over sub_criteria

sub_criteria are required for user-defined logical tests. However, it takes more time to compare sub_criteria than it does criteria. You can save some time by only using a sub_criteria when the matching conditions can not be specified as a criteria. For example, the logical test lgl_1 (above) was supplied as a sub_criteria. links() will create that group identifier faster, if it's supplied as a criteria. See below.

# Matching `sex`, and `forename` initials
dbs2
dbs2$initials <- substr(dbs2$forename, 1, 1)
dbs2$pids_1b <- links(criteria = dbs2$initials,
                     display = "none")

dbs2[c("forename", "initials", "pids_1", "pids_1b")]

Similarly, avoid using single-attribute sub_criteria that are paired to the same criteria. Instead, concatenate these together as one criteria. For example, the two implementations below (pids_e and pids_f) will have the same result. But pids_f will take less time.

# 1 stage linkage 
  # stage 1 - name AND ((department)  AND (hair_colour) AND (year_of_birth))
Opes$month_of_birth <- substr(Opes$date_of_birth, 4, 5)
Opes$pids_e <- links(criteria = Opes$name, 
                     sub_criteria = list(cr1 = sub_criteria(Opes$department),
                                         cr1 = sub_criteria(Opes$hair_colour),
                                         cr1 = sub_criteria(Opes$month_of_birth)),
                            display = "none")

Opes$cri <- paste(Opes$name, Opes$month_of_birth, Opes$department, Opes$hair_colour, sep="-")

# 1 stage linkage 
  # stage 1 - name AND department AND hair_colour AND date_of_birth
Opes$pids_f <- links(criteria = Opes$cri,  display = "none")

Opes[c("name","department","hair_colour","month_of_birth","pids_e","pids_f")]

This time difference could become more noticeable with larger datasets.

Convenience vs efficiency

links() is a convenient tool for routine data linkage. But it's more useful in handling complex matching conditions, missing data and multistage linkage. The more complex the scenario, the more likely its convenience outweighs the complexity of scripting them. Simple matching conditions or one-stage linkages are possible with links(). But this takes slightly more time than base R alternatives. See the example below.

Opes[c("department")]

Opes$links_a <- links(Opes$department, display = "none")

Opes$links_b <- match(Opes$department, Opes$department[!duplicated(Opes$department)])

Opes[c("department", "links_a", "links_b")]

This time difference could become more noticeable with larger datasets.

Conclusion

As a general rule, the more unique the criteria, the earlier it should be used in the linkage process. Always use a reasonable order of relevance for each dataset. Not too strict and not too lax. links() will try to reduce false mismatches due to random data entry errors or missing values. Always balance the use of matching attributes with their availability and accuracy.



OlisaNsonwu/diyar documentation built on April 22, 2024, 6:27 p.m.