# find.matches: Find Close Matches In Hmisc: Harrell Miscellaneous

 find.matches R Documentation

## Find Close Matches

### Description

Compares each row in `x` against all the rows in `y`, finding rows in `y` with all columns within a tolerance of the values a given row of `x`. The default tolerance `tol` is zero, i.e., an exact match is required on all columns. For qualifying matches, a distance measure is computed. This is the sum of squares of differences between `x` and `y` after scaling the columns. The default scaling values are `tol`, and for columns with `tol=1` the scale values are set to 1.0 (since they are ignored anyway). Matches (up to `maxmatch` of them) are stored and listed in order of increasing distance.
The `summary` method prints a frequency distribution of the number of matches per observation in `x`, the median of the minimum distances for all matches per `x`, as a function of the number of matches, and the frequency of selection of duplicate observations as those having the smallest distance. The `print` method prints the entire `matches` and `distance` components of the result from `find.matches`.
`matchCases` finds all controls that match cases on a single variable `x` within a tolerance of `tol`. This is intended for prospective cohort studies that use matching for confounder adjustment (even though regression models usually work better).

### Usage

``````find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10)
## S3 method for class 'find.matches'
summary(object, ...)
## S3 method for class 'find.matches'
print(x, digits, ...)

matchCases(xcase,    ycase,    idcase=names(ycase),
xcontrol, ycontrol, idcontrol=names(ycontrol),
tol=NULL,
maxobs=max(length(ycase),length(ycontrol))*10,
maxmatch=20, which=c('closest','random'))
``````

### Arguments

 `x` a numeric matrix or the result of `find.matches` `y` a numeric matrix with same number of columns as `x` `xcase` numeric vector to match on for cases `xcontrol` numeric vector to match on for controls, not necessarily the same length as `xcase` `ycase` a vector or matrix `ycontrol` `ycase` and `ycontrol` are vectors or matrices, not necessarily having the same number of rows, specifying a variable to carry along from cases and matching controls. If you instead want to carry along rows from a data frame, let `ycase` and `ycontrol` be non-overlapping integer subscripts of the donor data frame. `tol` a vector of tolerances with number of elements the same as the number of columns of `y`, for `find.matches`. For `matchCases` is a scalar tolerance. `scale` a vector of scaling constants with number of elements the same as the number of columns of `y`. `maxmatch` maximum number of matches to allow. For `matchCases`, maximum number of controls to match with a case (default is 20). If more than `maxmatch` matching controls are available, a random sample without replacement of `maxmatch` controls is used (if `which="random"`). `object` an object created by `find.matches` `digits` number of digits to use in printing distances `idcase` vector the same length as `xcase` `idcontrol` `idcase` and `idcontrol` are vectors the same length as `xcase` and `xcontrol` respectively, specifying the id of cases and controls. Defaults are integers specifying original element positions within each of cases and controls. `maxobs` maximum number of cases and all matching controls combined (maximum dimension of data frame resulting from `matchControls`). Default is ten times the maximum of the number of cases and number of controls. `maxobs` is used to allocate space for the resulting data frame. `which` set to `"closest"` (the default) to match cases with up to `maxmatch` controls that most closely match on `x`. Set `which="random"` to use randomly chosen controls. In either case, only those controls within `tol` on `x` are allowed to be used. `...` unused

### Value

`find.matches` returns a list of class `find.matches` with elements `matches` and `distance`. Both elements are matrices with the number of rows equal to the number of rows in `x`, and with `k` columns, where `k` is the maximum number of matches (`<= maxmatch`) that occurred. The elements of `matches` are row identifiers of `y` that match, with zeros if fewer than `maxmatch` matches are found (blanks if `y` had row names). `matchCases` returns a data frame with variables `idcase` (id of case currently being matched), `type` (factor variable with levels `"case"` and `"control"`), `id` (id of case if case row, or id of matching case), and `y`.

### Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

### References

Ming K, Rosenbaum PR (2001): A note on optimal matching with variable controls using the assignment algorithm. J Comp Graph Stat 10:455–463.

Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiology 56:230-237. Note: These papers were not used for the functions here but probably should have been.

`scale`, `apply`

### Examples

``````y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5))
x <- rbind(c(.09,.21), c(.29,.39))
y
x
w <- find.matches(x, y, maxmatch=5, tol=c(.05,.05))

set.seed(111)       # so can replicate results
x <- matrix(runif(500), ncol=2)
y <- matrix(runif(2000), ncol=2)
w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03))
w\$matches[1:5,]
w\$distance[1:5,]
# Find first x with 3 or more y-matches
num.match <- apply(w\$matches, 1, function(x)sum(x > 0))
j <- ((1:length(num.match))[num.match > 2])
x[j,]
y[w\$matches[j,],]

summary(w)

# For many applications would do something like this:
# attach(df1)
# x <- cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects
# attach(df2)
# y <- cbind(age, sex)
# mat <- find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age

# Demonstrate matchCases
xcase     <- c(1,3,5,12)
xcontrol  <- 1:6
idcase    <- c('A','B','C','D')
idcontrol <- c('a','b','c','d','e','f')
ycase     <- c(11,33,55,122)
ycontrol  <- c(11,22,33,44,55,66)
matchCases(xcase, ycase, idcase,
xcontrol, ycontrol, idcontrol, tol=1)

# If y is a binary response variable, the following code
# will produce a Mantel-Haenszel summary odds ratio that
# utilizes the matching.
# Standard variance formula will not work here because
# a control will match more than one case
# WARNING: The M-H procedure exemplified here is suspect
# because of the small strata and widely varying number
# of controls per case.

x    <- c(1, 2, 3, 3, 3, 6, 7, 12,  1, 1:7)
y    <- c(0, 0, 0, 1, 0, 1, 1,  1,  1, 0, 0, 0, 0, 1, 1, 1)
case <- c(rep(TRUE, 8), rep(FALSE, 8))
id   <- 1:length(x)

m <- matchCases(x[case],  y[case],  id[case],
x[!case], y[!case], id[!case], tol=1)
iscase <- m\$type=='case'
# Note: the first tapply on insures that event indicators are
# sorted by case id.  The second actually does something.
event.case    <- tapply(m\$y[iscase],  m\$idcase[iscase],  sum)
event.control <- tapply(m\$y[!iscase], m\$idcase[!iscase], sum)
n.control     <- tapply(!iscase,      m\$idcase,          sum)
n             <- tapply(m\$y,          m\$idcase,          length)
or <- sum(event.case * (n.control - event.control) / n) /
sum(event.control * (1 - event.case) / n)
or

# Bootstrap this estimator by sampling with replacement from
# subjects.  Assumes id is unique when combine cases+controls
# (id was constructed this way above).  The following algorithms
# puts all sampled controls back with the cases to whom they were
# originally matched.

ids <- unique(m\$id)
idgroups <- split(1:nrow(m), m\$id)
B   <- 50   # in practice use many more
ors <- numeric(B)
# Function to order w by ids, leaving unassigned elements zero
align <- function(ids, w) {
z <- structure(rep(0, length(ids)), names=ids)
z[names(w)] <- w
z
}
for(i in 1:B) {
j <- sample(ids, replace=TRUE)
obs <- unlist(idgroups[j])
u <- m[obs,]
iscase <- u\$type=='case'
n.case <- align(ids, tapply(u\$type, u\$idcase,
function(v)sum(v=='case')))
n.control <- align(ids, tapply(u\$type, u\$idcase,
function(v)sum(v=='control')))
event.case <- align(ids, tapply(u\$y[iscase],  u\$idcase[iscase],  sum))
event.control <- align(ids, tapply(u\$y[!iscase], u\$idcase[!iscase], sum))
n <- n.case + n.control
# Remove sets having 0 cases or 0 controls in resample
s             <- n.case > 0 & n.control > 0
denom <- sum(event.control[s] * (n.case[s] - event.case[s]) / n[s])
or <- if(denom==0) NA else
sum(event.case[s] * (n.control[s] - event.control[s]) / n[s]) / denom
ors[i] <- or
}
describe(ors)
``````

Hmisc documentation built on Sept. 12, 2023, 5:06 p.m.