match_xs_ys | R Documentation |
Cascading character matching for sets of columns across two data.tables
match_xs_ys( dt_x, dt_y, xs, ys, reverse = FALSE, lower = TRUE, incomparables = NA, use_fastmatch = FALSE, ... )
dt_x |
A data.table where you wish to find corresponding index values in dt_y |
dt_y |
A data.table where matching values in dt_x should be found, and indices from dt_y returned |
xs |
A chr vector of names in dt_x to match upon. Intrinsic order is potentially
important, and must correspond to order and length of ys. Non-character columns
will be coerced to character via |
ys |
A chr vector of names in dt_y that correspond to xs in scope and
length. Non-character columns will be coerced to character via |
reverse |
Logi. Do you wish to perform the match in reverse, that is find indices
in dt_x where corresponding values in dt_y match? Defaults to |
lower |
Should values within xs and ys be lowercased? Defaults to |
incomparables |
For |
use_fastmatch |
Use |
... |
Additional arguments to pass to |
Addresses a common merge use case between two tables where a single, robust key is not available,
and one must rely on one or more fields between the two tables to make a best-attempt merge.
In this scenario, the order of values in xs, and the corresponding order in ys is
critical, and should correspond to one's best-guess (expectation) of specificity, since this
function calls Reduce
to collapse the list of match results into a single vector.
If the fastmatch
package is loaded, and use_fastmatch is TRUE
,
will use fastmatch::fmatch
for performance, else will use base::match
. This
is intentional due to the fact that fmatch
will technically modify dt_y
in-place by
appending a hash index to fields flagged within ys. To ensure this function is
truly side-effect-free by defalt, you must set this option explicitly. Also, there is no repeat-
run benefit (but also no side-effect) with the default setting of lower=TRUE
.
An integer
vector of length xs or length ys (since it is required that
length(xs) == length(ys)
) containing matching indices, else NA
. The indices by
default denote the positions of values in dt_y that match dt_x, unless
reverse = TRUE
, in which case the reverse.
Additionally, a console message containing match statistics. If called with accumulate=TRUE
,
statistics are printed out for each step in cumulative fashion, displaying the (any) additional
coverage provided by each additional pair of match elements.
By default, this function coerces both xs and ys to lowercase via tolower
.
match
is called with defaults, i.e. match(x, y, nomatch = NA_integer_, incomparables = NULL)
,
and this is enforced, with no plans to make optional.
The length requirement for formals xs and ys apply to the arguments themselves,
and not to the length of variables represented by the values within each. In other words, a standard
call to match
does not require equal-length inputs, nor does this function. What is required,
though, is that if e.g. xs is a vector of length 2
, representing two fields within
dt_x, then the length of ys must also be 2
, even if you wish to match a single
field within dt_y to each field within dt_x (no recycling is performed on argument
length for xs or ys).
library(data.table) set.seed(10) # dt_x is the table you want to append to dt_x <- data.table( key_a = sample(LETTERS[1:15], replace = FALSE) ) # dt_y has one or more target fields you wish to pull, using the keys # from dt_x. dt_y <- data.table( col_a = unlist(Map(c, LETTERS[seq(1, 10, by = 2)], list(NA_character_))), col_b = unlist(Map(c, list(NA_character_), LETTERS[seq(2, 11, by = 2)])), targ = sample(1:100L, 10, replace = FALSE) ) # this is used for indexing results out mvec <- match_xs_ys(dt_x, dt_y, c("key_a", "key_a"), c("col_a", "col_b")) # pull over results dt_x[, targ := dt_y$targ[mvec]] # also useful for quick tests: match_xs_ys(dt_x, dt_y, c("key_a", "key_a"), c("col_a", "col_b"), accumulate = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.