linkr_multi: Locally Optimal Linking Of Many Data Frames
In sumtxt/lychee: Optimal Linking and Joining of Data Frames in R

Description Usage Arguments Details See Also Examples

View source: R/linkr_multi.R

Links a series of data frames sequentially: At each iteration, the function selects one element from all already matched tuples (found by linking data frame 1...d) and links it to the next data frame d+1 until no more data frames are available. All elements of a tuple are assigned the same identifier in the stacked data frame. Each tuple will include at most one element from every data frame d. The solution is a local approximation to the globally optimal solution.

linkr_multi(
  df,
  by,
  slice,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  na_matches = "na",
  pool = "last",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

`df`	data frame to link.
`by`	character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b.
`slice`	used to split `df` into a list of data frames.
`strata`	character vector of variables to join exactly if any. Can be a named vector as for `by`.
`method`	the name of the distance metric to measure the similarity between the key columns.
`assignment`	should one-to-one assignments be constructed?
`na_matches`	should NA and NaN values match one another for any exact join defined by `strata`?
`pool`	one of four string values: "previous", "average", "last" or "random" (see details).
`caliper`	caliper value on the same scale as the distance matrix (before multipled by `C`).
`C`	scaling parameter for the distance matrix.
`verbose`	print distance summary statistic.
`...`	parameters passed to distance metric function.

Splits df by slice into a list of data frames (indexed 1,...,d,...,D) and applies linkr to every element of this list. Each data frame d is linked to a pool of candidates. The candidate pool is defined by one observation from each matched tuple (which might only have a single element, i.e. a singleton) found in the data frames indexed 1...(d-1). By default, the last observation for each matched tuple is used (pool='last'). Other options to construct the candidate pool include:

pool='random': pool includes a randomly drawn element from each matched tuple.
pool='previous': pool includes all observations from the data frame indexed d-1.
pool='average': pool includes a new observation with the average value per key variable for every matched tuple. This option will only work when the variable(s) defined by the parameter by are numeric.

For more details see the help file of linkr.

assignment linkr

library(dplyr)
data(greens3)

linkr_multi(
  df=filter(greens3, election=="BTW"), 
  by='city', 
  slice='year',
  method='lcs',
  caliper=15) %>% 
arrange(match_id,year) %>% 
 data.frame