calculate_minmax_pairwise | R Documentation |
calculate_minmax_pairwise
will, as the name suggests, calculate the pairwise differences among a provided vector and determine either
the maximum or minimum distance among the combinations. By default, any group with the min/max value are returned in columns and the index pair
provided on each row. If you only care about the first match or group, the returned data can be subset with various parameters. The typical use-case
for this function could be to determine which dates among several sources, are closest in alignment; for instance, if some date of births are discrepant
between various data systems, it may be useful to determine which pair are closest to take as the 'true' value. For more complex record validation,
there are entire R packages dedicated to this topic that have better a coverage of tools.
calculate_minmax_pairwise(
x,
method = min,
only_distance = FALSE,
first_group = FALSE,
first_index = FALSE,
...
)
x |
Vector of numeric type (e.g. numeric, integer, date). |
method |
Either 'min' or max' (provide unquoted). |
only_distance |
Return just the min or max distance discovered (default, |
first_group |
Return just the first group that matched the min/max distance (default, |
first_index |
Return just the first index of the pairwise matches (default, |
... |
Additional parameters passed to |
The core calculation being performed is via outer()
, which is very useful for inner-product operations. This (helper) function simply provides some
additional formatting to find the index at which the max and min differences occured in that original vector.
Index values from the provided vector that have the min/max distance.
## Not run:
# Create long formatted test data, as if dates came from different data sources
test_data <- data.frame(ID = c(1,1,1,2,2,3,3,3,3),
dob_type = c('source1', 'source2', 'source3', 'source1', 'source2', 'source1', 'source2', 'source3', 'source4'), # Various sources
dob = c(100,101,9999,22,222,100,1000,900,901))
# Find the matrix for each ID
lapply(split(test_data$dob, f = factor(test_data$ID)), calculate_minmax_pairwise)
lapply(split(test_data$dob, f = factor(test_data$ID)), calculate_minmax_pairwise, first_index = TRUE, first_group = TRUE)
# Return as a vector only for max/min distances found by ID (base R)
vapply(split(test_data$dob, f = factor(test_data$ID)), calculate_minmax_pairwise, only_distance = TRUE, FUN.VALUE = numeric(1), USE.NAMES = FALSE)
# Use with dplyr
library(dplyr)
library(magrittr)
test_data %>%
group_by(ID) %>%
mutate(newdate = dob[calculate_minmax_pairwise(dob, first_group = TRUE, first_index = TRUE)])
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.