match_xs_ys: Collapse multiple possible matches between two data.tables

View source: R/match_xs_ys.R

match_xs_ysR Documentation

Collapse multiple possible matches between two data.tables

Description

Cascading character matching for sets of columns across two data.tables

Usage

match_xs_ys(
  dt_x,
  dt_y,
  xs,
  ys,
  reverse = FALSE,
  lower = TRUE,
  incomparables = NA,
  use_fastmatch = FALSE,
  ...
)

Arguments

dt_x

A data.table where you wish to find corresponding index values in dt_y

dt_y

A data.table where matching values in dt_x should be found, and indices from dt_y returned

xs

A chr vector of names in dt_x to match upon. Intrinsic order is potentially important, and must correspond to order and length of ys. Non-character columns will be coerced to character via as.character, with a warning.

ys

A chr vector of names in dt_y that correspond to xs in scope and length. Non-character columns will be coerced to character via as.character, with a warning.

reverse

Logi. Do you wish to perform the match in reverse, that is find indices in dt_x where corresponding values in dt_y match? Defaults to FALSE

lower

Should values within xs and ys be lowercased? Defaults to TRUE. If possible and requested to use fmatch, keeping this at default will not set any hash indices, but will also not result in any benefits upon repeat runs.

incomparables

For match; set to NA by default to prevent matches on NA

use_fastmatch

Use fmatch() from the fastmatch package if installed? Defaults to FALSE. See details, and also note interaction with the lower parameter (above).

...

Additional arguments to pass to Reduce()

Details

Addresses a common merge use case between two tables where a single, robust key is not available, and one must rely on one or more fields between the two tables to make a best-attempt merge. In this scenario, the order of values in xs, and the corresponding order in ys is critical, and should correspond to one's best-guess (expectation) of specificity, since this function calls Reduce to collapse the list of match results into a single vector.

If the fastmatch package is loaded, and use_fastmatch is TRUE, will use fastmatch::fmatch for performance, else will use base::match. This is intentional due to the fact that fmatch will technically modify dt_y in-place by appending a hash index to fields flagged within ys. To ensure this function is truly side-effect-free by defalt, you must set this option explicitly. Also, there is no repeat- run benefit (but also no side-effect) with the default setting of lower=TRUE.

Value

An integer vector of length xs or length ys (since it is required that length(xs) == length(ys)) containing matching indices, else NA. The indices by default denote the positions of values in dt_y that match dt_x, unless reverse = TRUE, in which case the reverse.

Additionally, a console message containing match statistics. If called with accumulate=TRUE, statistics are printed out for each step in cumulative fashion, displaying the (any) additional coverage provided by each additional pair of match elements.

Note

By default, this function coerces both xs and ys to lowercase via tolower. match is called with defaults, i.e. match(x, y, nomatch = NA_integer_, incomparables = NULL), and this is enforced, with no plans to make optional.

The length requirement for formals xs and ys apply to the arguments themselves, and not to the length of variables represented by the values within each. In other words, a standard call to match does not require equal-length inputs, nor does this function. What is required, though, is that if e.g. xs is a vector of length 2, representing two fields within dt_x, then the length of ys must also be 2, even if you wish to match a single field within dt_y to each field within dt_x (no recycling is performed on argument length for xs or ys).

Examples

library(data.table)
set.seed(10)
# dt_x is the table you want to append to
dt_x <- data.table(
  key_a = sample(LETTERS[1:15], replace = FALSE)
)
# dt_y has one or more target fields you wish to pull, using the keys
# from dt_x.
dt_y <- data.table(
  col_a = unlist(Map(c, LETTERS[seq(1, 10, by = 2)], list(NA_character_))),
  col_b = unlist(Map(c, list(NA_character_), LETTERS[seq(2, 11, by = 2)])),
  targ = sample(1:100L, 10, replace = FALSE)
)
# this is used for indexing results out
mvec <- match_xs_ys(dt_x, dt_y, c("key_a", "key_a"), c("col_a", "col_b"))
# pull over results
dt_x[, targ := dt_y$targ[mvec]]

# also useful for quick tests:
match_xs_ys(dt_x, dt_y, c("key_a", "key_a"), c("col_a", "col_b"),
            accumulate = TRUE)

slin30/wzMisc documentation built on Jan. 27, 2023, 1 a.m.