linkr_multi: Locally Optimal Linking Of Many Data Frames

Description Usage Arguments Details See Also Examples

View source: R/linkr_multi.R

Description

Links a series of data frames sequentially: At each iteration, the function selects one element from all already matched tuples (found by linking data frame 1...d) and links it to the next data frame d+1 until no more data frames are available. All elements of a tuple are assigned the same identifier in the stacked data frame. Each tuple will include at most one element from every data frame d. The solution is a local approximation to the globally optimal solution.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
linkr_multi(
  df,
  by,
  slice,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  na_matches = "na",
  pool = "last",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

Arguments

df

data frame to link.

by

character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b.

slice

used to split df into a list of data frames.

strata

character vector of variables to join exactly if any. Can be a named vector as for by.

method

the name of the distance metric to measure the similarity between the key columns.

assignment

should one-to-one assignments be constructed?

na_matches

should NA and NaN values match one another for any exact join defined by strata?

pool

one of four string values: "previous", "average", "last" or "random" (see details).

caliper

caliper value on the same scale as the distance matrix (before multipled by C).

C

scaling parameter for the distance matrix.

verbose

print distance summary statistic.

...

parameters passed to distance metric function.

Details

Splits df by slice into a list of data frames (indexed 1,...,d,...,D) and applies linkr to every element of this list. Each data frame d is linked to a pool of candidates. The candidate pool is defined by one observation from each matched tuple (which might only have a single element, i.e. a singleton) found in the data frames indexed 1...(d-1). By default, the last observation for each matched tuple is used (pool='last'). Other options to construct the candidate pool include:

For more details see the help file of linkr.

See Also

assignment linkr

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library(dplyr)
data(greens3)

linkr_multi(
  df=filter(greens3, election=="BTW"), 
  by='city', 
  slice='year',
  method='lcs',
  caliper=15) %>% 
arrange(match_id,year) %>% 
 data.frame

sumtxt/lychee documentation built on July 15, 2020, 1:51 a.m.