find.matches: Find Close Matches In harrelfe/Hmisc: Harrell Miscellaneous

Description

Compares each row in x against all the rows in y, finding rows in y with all columns within a tolerance of the values a given row of x. The default tolerance tol is zero, i.e., an exact match is required on all columns. For qualifying matches, a distance measure is computed. This is the sum of squares of differences between x and y after scaling the columns. The default scaling values are tol, and for columns with tol=1 the scale values are set to 1.0 (since they are ignored anyway). Matches (up to maxmatch of them) are stored and listed in order of increasing distance.
The summary method prints a frequency distribution of the number of matches per observation in x, the median of the minimum distances for all matches per x, as a function of the number of matches, and the frequency of selection of duplicate observations as those having the smallest distance. The print method prints the entire matches and distance components of the result from find.matches.
matchCases finds all controls that match cases on a single variable x within a tolerance of tol. This is intended for prospective cohort studies that use matching for confounder adjustment (even though regression models usually work better).

Usage

 1 2 3 4 5 6 7 8 9 10 11 find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10) ## S3 method for class 'find.matches' summary(object, ...) ## S3 method for class 'find.matches' print(x, digits, ...) matchCases(xcase, ycase, idcase=names(ycase), xcontrol, ycontrol, idcontrol=names(ycontrol), tol=NULL, maxobs=max(length(ycase),length(ycontrol))*10, maxmatch=20, which=c('closest','random'))

Arguments

 x a numeric matrix or the result of find.matches y a numeric matrix with same number of columns as x xcase xcontrol vectors, not necessarily of the same length, specifying a numeric variable used to match cases and control ycase ycontrol vectors or matrices, not necessarily having the same number of rows, specifying a variable to carry along from cases and matching controls. If you instead want to carry along rows from a data frame, let ycase and ycontrol be non-overlapping integer subscripts of the donor data frame. tol a vector of tolerances with number of elements the same as the number of columns of y, for find.matches. For matchCases is a scalar tolerance. scale a vector of scaling constants with number of elements the same as the number of columns of y. maxmatch maximum number of matches to allow. For matchCases, maximum number of controls to match with a case (default is 20). If more than maxmatch matching controls are available, a random sample without replacement of maxmatch controls is used (if which="random"). object an object created by find.matches digits number of digits to use in printing distances idcase idcontrol vectors the same length as xcase and xcontrol respectively, specifying the id of cases and controls. Defaults are integers specifying original element positions within each of cases and controls. maxobs maximum number of cases and all matching controls combined (maximum dimension of data frame resulting from matchControls). Default is ten times the maximum of the number of cases and number of controls. maxobs is used to allocate space for the resulting data frame. which set to "closest" (the default) to match cases with up to maxmatch controls that most closely match on x. Set which="random" to use randomly chosen controls. In either case, only those controls within tol on x are allowed to be used. ... unused

Value

find.matches returns a list of class find.matches with elements matches and distance. Both elements are matrices with the number of rows equal to the number of rows in x, and with k columns, where k is the maximum number of matches (<= maxmatch) that occurred. The elements of matches are row identifiers of y that match, with zeros if fewer than maxmatch matches are found (blanks if y had row names). matchCases returns a data frame with variables idcase (id of case currently being matched), type (factor variable with levels "case" and "control"), id (id of case if case row, or id of matching case), and y.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

Ming K, Rosenbaum PR (2001): A note on optimal matching with variable controls using the assignment algorithm. J Comp Graph Stat 10:455–463.

Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiology 56:230-237. Note: These papers were not used for the functions here but probably should have been.