# find.matches: Find Close Matches In Hmisc: Harrell Miscellaneous

## Description

Compares each row in `x` against all the rows in `y`, finding rows in `y` with all columns within a tolerance of the values a given row of `x`. The default tolerance `tol` is zero, i.e., an exact match is required on all columns. For qualifying matches, a distance measure is computed. This is the sum of squares of differences between `x` and `y` after scaling the columns. The default scaling values are `tol`, and for columns with `tol=1` the scale values are set to 1.0 (since they are ignored anyway). Matches (up to `maxmatch` of them) are stored and listed in order of increasing distance.
The `summary` method prints a frequency distribution of the number of matches per observation in `x`, the median of the minimum distances for all matches per `x`, as a function of the number of matches, and the frequency of selection of duplicate observations as those having the smallest distance. The `print` method prints the entire `matches` and `distance` components of the result from `find.matches`.
`matchCases` finds all controls that match cases on a single variable `x` within a tolerance of `tol`. This is intended for prospective cohort studies that use matching for confounder adjustment (even though regression models usually work better).

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11``` ```find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10) ## S3 method for class 'find.matches' summary(object, ...) ## S3 method for class 'find.matches' print(x, digits, ...) matchCases(xcase, ycase, idcase=names(ycase), xcontrol, ycontrol, idcontrol=names(ycontrol), tol=NULL, maxobs=max(length(ycase),length(ycontrol))*10, maxmatch=20, which=c('closest','random')) ```

## Arguments

 `x` a numeric matrix or the result of `find.matches` `y` a numeric matrix with same number of columns as `x` `xcase` `xcontrol` vectors, not necessarily of the same length, specifying a numeric variable used to match cases and control `ycase` `ycontrol` vectors or matrices, not necessarily having the same number of rows, specifying a variable to carry along from cases and matching controls. If you instead want to carry along rows from a data frame, let `ycase` and `ycontrol` be non-overlapping integer subscripts of the donor data frame. `tol` a vector of tolerances with number of elements the same as the number of columns of `y`, for `find.matches`. For `matchCases` is a scalar tolerance. `scale` a vector of scaling constants with number of elements the same as the number of columns of `y`. `maxmatch` maximum number of matches to allow. For `matchCases`, maximum number of controls to match with a case (default is 20). If more than `maxmatch` matching controls are available, a random sample without replacement of `maxmatch` controls is used (if `which="random"`). `object` an object created by `find.matches` `digits` number of digits to use in printing distances `idcase` `idcontrol` vectors the same length as `xcase` and `xcontrol` respectively, specifying the id of cases and controls. Defaults are integers specifying original element positions within each of cases and controls. `maxobs` maximum number of cases and all matching controls combined (maximum dimension of data frame resulting from `matchControls`). Default is ten times the maximum of the number of cases and number of controls. `maxobs` is used to allocate space for the resulting data frame. `which` set to `"closest"` (the default) to match cases with up to `maxmatch` controls that most closely match on `x`. Set `which="random"` to use randomly chosen controls. In either case, only those controls within `tol` on `x` are allowed to be used. `...` unused

## Value

`find.matches` returns a list of class `find.matches` with elements `matches` and `distance`. Both elements are matrices with the number of rows equal to the number of rows in `x`, and with `k` columns, where `k` is the maximum number of matches (`<= maxmatch`) that occurred. The elements of `matches` are row identifiers of `y` that match, with zeros if fewer than `maxmatch` matches are found (blanks if `y` had row names). `matchCases` returns a data frame with variables `idcase` (id of case currently being matched), `type` (factor variable with levels `"case"` and `"control"`), `id` (id of case if case row, or id of matching case), and `y`.

## Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

## References

Ming K, Rosenbaum PR (2001): A note on optimal matching with variable controls using the assignment algorithm. J Comp Graph Stat 10:455–463.

Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiology 56:230-237. Note: These papers were not used for the functions here but probably should have been.

`scale`, `apply`

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109``` ```y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5)) x <- rbind(c(.09,.21), c(.29,.39)) y x w <- find.matches(x, y, maxmatch=5, tol=c(.05,.05)) set.seed(111) # so can replicate results x <- matrix(runif(500), ncol=2) y <- matrix(runif(2000), ncol=2) w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03)) w\$matches[1:5,] w\$distance[1:5,] # Find first x with 3 or more y-matches num.match <- apply(w\$matches, 1, function(x)sum(x > 0)) j <- ((1:length(num.match))[num.match > 2]) x[j,] y[w\$matches[j,],] summary(w) # For many applications would do something like this: # attach(df1) # x <- cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects # attach(df2) # y <- cbind(age, sex) # mat <- find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age # Demonstrate matchCases xcase <- c(1,3,5,12) xcontrol <- 1:6 idcase <- c('A','B','C','D') idcontrol <- c('a','b','c','d','e','f') ycase <- c(11,33,55,122) ycontrol <- c(11,22,33,44,55,66) matchCases(xcase, ycase, idcase, xcontrol, ycontrol, idcontrol, tol=1) # If y is a binary response variable, the following code # will produce a Mantel-Haenszel summary odds ratio that # utilizes the matching. # Standard variance formula will not work here because # a control will match more than one case # WARNING: The M-H procedure exemplified here is suspect # because of the small strata and widely varying number # of controls per case. x <- c(1, 2, 3, 3, 3, 6, 7, 12, 1, 1:7) y <- c(0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1) case <- c(rep(TRUE, 8), rep(FALSE, 8)) id <- 1:length(x) m <- matchCases(x[case], y[case], id[case], x[!case], y[!case], id[!case], tol=1) iscase <- m\$type=='case' # Note: the first tapply on insures that event indicators are # sorted by case id. The second actually does something. event.case <- tapply(m\$y[iscase], m\$idcase[iscase], sum) event.control <- tapply(m\$y[!iscase], m\$idcase[!iscase], sum) n.control <- tapply(!iscase, m\$idcase, sum) n <- tapply(m\$y, m\$idcase, length) or <- sum(event.case * (n.control - event.control) / n) / sum(event.control * (1 - event.case) / n) or # Bootstrap this estimator by sampling with replacement from # subjects. Assumes id is unique when combine cases+controls # (id was constructed this way above). The following algorithms # puts all sampled controls back with the cases to whom they were # originally matched. ids <- unique(m\$id) idgroups <- split(1:nrow(m), m\$id) B <- 50 # in practice use many more ors <- numeric(B) # Function to order w by ids, leaving unassigned elements zero align <- function(ids, w) { z <- structure(rep(0, length(ids)), names=ids) z[names(w)] <- w z } for(i in 1:B) { j <- sample(ids, replace=TRUE) obs <- unlist(idgroups[j]) u <- m[obs,] iscase <- u\$type=='case' n.case <- align(ids, tapply(u\$type, u\$idcase, function(v)sum(v=='case'))) n.control <- align(ids, tapply(u\$type, u\$idcase, function(v)sum(v=='control'))) event.case <- align(ids, tapply(u\$y[iscase], u\$idcase[iscase], sum)) event.control <- align(ids, tapply(u\$y[!iscase], u\$idcase[!iscase], sum)) n <- n.case + n.control # Remove sets having 0 cases or 0 controls in resample s <- n.case > 0 & n.control > 0 denom <- sum(event.control[s] * (n.case[s] - event.case[s]) / n[s]) or <- if(denom==0) NA else sum(event.case[s] * (n.control[s] - event.control[s]) / n[s]) / denom ors[i] <- or } describe(ors) ```

### Example output

```Loading required package: lattice

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

format.pval, units

[,1] [,2]
[1,] 0.10 0.20
[2,] 0.11 0.22
[3,] 0.30 0.40
[4,] 0.31 0.41
[5,] 0.32 5.00
[,1] [,2]
[1,] 0.09 0.21
[2,] 0.29 0.39
Match #1 Match #2 Match #3 Match #4 Match #5
[1,]      999      694        0        0        0
[2,]        0        0        0        0        0
[3,]      235        0        0        0        0
[4,]      964      139        0        0        0
[5,]      906      427      204        0        0
Distance #1 Distance #2 Distance #3 Distance #4 Distance #5
[1,]   0.1042884   0.1562084          NA          NA          NA
[2,]          NA          NA          NA          NA          NA
[3,]   0.7272258          NA          NA          NA          NA
[4,]   0.2815041   0.7973284          NA          NA          NA
[5,]   0.6135293   0.7162828   0.7189297          NA          NA
 0.3776632 0.6833354
[,1]      [,2]
[1,] 0.3708767 0.7045144
[2,] 0.3687917 0.7049588
[3,] 0.3821378 0.7078709
Frequency table of number of matches found per observation

m
0  1  2  3  4  5
27 53 64 54 32 20

Median minimum distance by number of matches

1         2         3         4         5
0.5859325 0.3376432 0.1917933 0.1407859 0.1398928

Observations selected first more than once (with frequencies)

57  73  91 101 116 165 191 251 256 292 415 422 438 443 467 552 592 593 650 691
2   2   2   2   2   3   2   2   2   3   2   2   2   2   2   2   2   2   2   2
719 733 747 754 818 820 824 849 871 926 945 964 970
2   2   2   2   2   2   3   2   2   2   2   2   2

Frequencies of Number of Matched Controls per Case:

matches
0 2 3
1 1 2

idcase    type id x  y
1       A    case  A 1 11
2       A control  a 1 11
3       A control  b 2 22
4       B    case  B 3 33
5       B control  b 2 22
6       B control  c 3 33
7       B control  d 4 44
8       C    case  C 5 55
9       C control  d 4 44
10      C control  e 5 55
11      C control  f 6 66

Frequencies of Number of Matched Controls per Case:

matches
0 2 3 4
1 1 5 1

 1.666667
ors
n  missing distinct     Info     Mean      Gmd      .05      .10
31       19       14    0.835   0.9076    1.384    0.000    0.000
.25      .50      .75      .90      .95
0.000    0.000    1.143    2.917    3.107

0 (17, 0.548), 0.375 (1, 0.032), 0.384615384615385 (1, 0.032), 0.8 (1, 0.032),
0.9375 (1, 0.032), 1.09090909090909 (1, 0.032), 1.14285714285714 (2, 0.065),
1.66666666666667 (1, 0.032), 1.71428571428571 (1, 0.032), 2.25 (1, 0.032),
2.91666666666667 (1, 0.032), 3 (1, 0.032), 3.21428571428571 (1, 0.032), 7.5 (1,
0.032)
```

Hmisc documentation built on Feb. 28, 2021, 9:05 a.m.