calculate_minmax_pairwise: Calculate the min/max distance among vector values

calculate_minmax_pairwiseR Documentation

Calculate the min/max distance among vector values

Description

calculate_minmax_pairwise will, as the name suggests, calculate the pairwise differences among a provided vector and determine either the maximum or minimum distance among the combinations. By default, any group with the min/max value are returned in columns and the index pair provided on each row. If you only care about the first match or group, the returned data can be subset with various parameters. The typical use-case for this function could be to determine which dates among several sources, are closest in alignment; for instance, if some date of births are discrepant between various data systems, it may be useful to determine which pair are closest to take as the 'true' value. For more complex record validation, there are entire R packages dedicated to this topic that have better a coverage of tools.

Usage

calculate_minmax_pairwise(
  x,
  method = min,
  only_distance = FALSE,
  first_group = FALSE,
  first_index = FALSE,
  ...
)

Arguments

x

Vector of numeric type (e.g. numeric, integer, date).

method

Either 'min' or max' (provide unquoted).

only_distance

Return just the min or max distance discovered (default, FALSE).

first_group

Return just the first group that matched the min/max distance (default, FALSE).

first_index

Return just the first index of the pairwise matches (default, FALSE).

...

Additional parameters passed to method, for min and max functions.

Details

The core calculation being performed is via outer(), which is very useful for inner-product operations. This (helper) function simply provides some additional formatting to find the index at which the max and min differences occured in that original vector.

Value

Index values from the provided vector that have the min/max distance.

Examples

## Not run: 
# Create long formatted test data, as if dates came from different data sources
test_data <- data.frame(ID = c(1,1,1,2,2,3,3,3,3),
                        dob_type = c('source1', 'source2', 'source3', 'source1', 'source2', 'source1', 'source2', 'source3', 'source4'), # Various sources
                        dob = c(100,101,9999,22,222,100,1000,900,901))

# Find the matrix for each ID
lapply(split(test_data$dob, f = factor(test_data$ID)), calculate_minmax_pairwise)
lapply(split(test_data$dob, f = factor(test_data$ID)), calculate_minmax_pairwise, first_index = TRUE, first_group = TRUE)

# Return as a vector only for max/min distances found by ID (base R)
vapply(split(test_data$dob, f = factor(test_data$ID)), calculate_minmax_pairwise, only_distance = TRUE, FUN.VALUE = numeric(1), USE.NAMES = FALSE)

# Use with dplyr
library(dplyr)
library(magrittr)
test_data %>%
   group_by(ID) %>%
   mutate(newdate = dob[calculate_minmax_pairwise(dob, first_group = TRUE, first_index = TRUE)])

## End(Not run)

al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.