nmatch: Optimal nonbipartite matching in randomized experiments and...
In designmatch: Matched Samples that are Balanced and Representative by Design

nmatch

R Documentation

Optimal nonbipartite matching in randomized experiments and observational studies

Description

Function for optimal nonbipartite matching in randomized experiments and observational studies that directly balances the observed covariates. nmatch allows the user to enforce different forms of covariate balance in the matched samples, such as moment balance (e.g., of means, variances, and correlations), distributional balance (e.g., fine balance, near-fine balance, strength-k balancing), and exact matching. Among others, nmatch can be used in the design of randomized experiments for matching before randomization (Greevy et al. 2004, Zou and Zubizarreta 2016), and in observational studies for matching with doses and strengthening an instrumental variable (Baiocchi et al. 2010, Lu et al. 2011).

Usage

	nmatch(dist_mat, subset_weight = NULL, total_pairs = NULL, mom = NULL,
	       exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL,
	       near = NULL, far = NULL, solver = NULL)

Arguments

`dist_mat`	distance matrix: a matrix of positive distances between units.
`subset_weight`	subset matching weight: a scalar that regulates the trade-off between the total sum of distances between matched pairs and the total number of matched pairs. The larger `subset_weight`, the more importance will be given to the the total number of matched pairs relative to the total sum of distances between matched pairs. See Rosenbaum (2012) and Zubizarreta et al. (2013) for a discussion of this parameter. If `subset_weight = NULL`, then `nmatch` will match all the available units, provided it exists a feasible solution exists.
`total_pairs`	total number of matched pairs: a scalar specifying the number of matched pairs to be obtained. If `total_pairs = NULL` then no specific number of matched pairs is required before matching.
`mom`	moment balance parameters: a list with three arguments, `mom = list(covs = mom_covs, tols = mom_tols, targets = mom_targets)`. `mom_covs` is a matrix where each column is a covariate whose mean is to be balanced. `mom_tols` is a vector of tolerances for the maximum difference in means for the covariates in `mom_covs`. `mom_targets` is a vector of target moments (e.g., means) of a distribution to be approximated by matching. `mom_targets` is optional, but if `mom_covs` is specified then `mom_tols` needs to be specified too. If `mom_targets` is `NULL`, then `nmatch` will match treated and control units so that covariates in `mom_covs` differ at most by `mom_tols`. If `mom_targets` is specified, then `nmatch` will match treated and control units so that each matched group differs at most by `mom_tols` units from the respective moments in `mom_targets`. As a result, the matched groups will differ at most `mom_tols * 2` from each other. Under certain assumptions, `mom_targets` can be used for constructing a representative matched sample. The lengths of `mom_tols` and `mom_target` have to be equal to the number of columns of `mom_covs`. Note that the columns of `mom_covs` can be transformations of the original covariates to balance higher order single-dimensional moments like variances and skewness, and multidimensional moments such as correlations (Zubizarreta 2012).
`exact`	Exact matching parameters: a list with one argument, `exact = list(covs = exact_covs)`, where `exact_covs` is a matrix where each column is a nominal covariate for exact matching.
`near_exact`	Near-exact matching parameters: a list with two arguments, `near_exact = list(covs = near_exact_covs, devs = near_exact_devs)`. `near_exact_covs` are the near-exact matching covariates; specifically, a matrix where each column is a nominal covariate for near-exact matching. `near_exact_devs` are the maximum deviations from near-exact matching: a vector of scalars defining the maximum deviation allowed from exact matching for the covariates defined in `near_exact_covs`. Note that the length of `near_exact_devs` has to be equal to the number of columns of `near_exact_covs`. For detailed expositions of near-exact matching in the context of bipartite matching, see section 9.2 of Rosenbaum (2010) and Zubizarreta et al. (2011).
`fine`	Fine balance parameters: a list with one argument, `fine = list(covs = fine_covs)`, where `fine_covs` is a matrix where each column is a nominal covariate for fine balance. Fine balance enforces exact distributional balance on nominal covariates, but without constraining treated and control units to be matched within each category of each nominal covariate as in exact matching. See chapter 10 of Rosenbaum (2010) for details.
`near_fine`	Near-fine balance parameters: a list with two arguments, `near_fine = list(covs = near_fine_covs, devs = near_fine_devs)`. `near_fine_covs` is a matrix where each column is a nominal covariate for near-fine matching. `near_fine_devs` is a vector of scalars defining the maximum deviation allowed from fine balance for the covariates in `near_fine_covs`. Note that the length of `near_fine_devs` has to be equal to the number of columns of `near_fine_covs`. See Yang et al. (2012) for a description of near-fine balance.
`near`	Near matching parameters: a list with three arguments, `near = list(covs = near_covs, pairs = near_pairs, groups = near_groups)`. `near_covs` is a matrix where each column is a variable for near matching. `near_pairs` is a vector determining the maximum distance between individual matched pairs for each variable in `near_covs`. `near_groups` is a vector defining the maximum average distance (in aggregate) between matched groups for each covariate in `near_covs`. If `near_covs` is specified, then either `near_pairs`, `near_covs`, or both must be specified as well, and the length of `near_pairs` and/or `near_groups` has to be equal to the number of columns of `near_covs`.
`far`	Far matching parameters: a list with three arguments, `far = list(covs = far_covs, pairs = far_pairs, groups = far_groups)`. `far_covs` is a matrix where each column is a variable (a covariate or an instrumental variable) for far matching. `far_pairs` is a vector determining the minimum distance between units in a matched pair for each variable in `far_covs`, and `far_groups` is a vector defining the minimum average (aggregate) distance between matched groups for each variable in `far_covs`. If `far_covs` is specified, then either `far_pairs`, `far_covs`, or both, must be specified, and the length of `far_pairs` and/or `far_groups` has to be equal to the number of columns of `far_covs`. See Zubizarreta et al. (2013) for strengthening an instrumental variable with integer programming.
`solver`	Optimization solver parameters: a list with four objects, `solver = list(name = name, t_max = t_max, approximate = 1, round_cplex = 0,` `trace_cplex = 0)`. `solver` is a string that determines the optimization solver to be used. The options are: `cplex`, `glpk`, `gurobi`, `highs`, and `symphony`. The default solver is `highs` with `approximate = 1`, so that by default an approximate solution is found (see `approximate` below). For an exact solution, we strongly recommend using `cplex` or `gurobi` as they are much faster than the other solvers, but they do require a license (free for academics, but not for people outside universities). Between `cplex` and `gurobi`, note that the installation of the `gurobi` interface for R is much simpler. `t_max` is a scalar with the maximum time limit for finding the matches. This option is specific to `cplex` and `gurobi`. If the optimal matches are not found within this time limit, a partial, suboptimal solution is given. `approximate` is a scalar that determines the method of solution. If `approximate = 1` (the default), an approximate solution is found via a relaxation of the original integer program. This method of solution is faster than `approximate = 0`, but some balancing constraints may be violated to some extent. `round_cplex` is binary specific to `cplex`. `round_cplex = 1` ensures that the solution found is integral by rounding and all the constraints are exactly statisfied; `round_cplex = 0` (the default) encodes there is no rounding which may return slightly infeasible integer solutions. `trace` is a binary specific to `cplex` and `gurobi`. `trace = 1` turns the optimizer output on. The default is `trace = 0`.

Value

A list containing the optimal solution, with the following objects:

`obj_total`	value of the objective function at the optimum;
`obj_dist_mat`	value of the total sum of distances term of the objective function at the optimum;
`id_1`	indexes of the matched units in group 1 at the optimum;
`id_2`	indexes of the matched units in group 2 at the optimum;
`group_id`	matched pairs at the optimum;
`time`	time elapsed to find the optimal solution.

Author(s)

Jose R. Zubizarreta <zubizarreta@hcp.med.harvard.edu>, Cinar Kilcioglu <ckilcioglu16@gsb.columbia.edu>.

References

Baiocchi, M., Small, D., Lorch, S. and Rosenbaum, P. R. (2010), "Building a Stronger Instrument in an Observational Study of Perinatal Care for Premature Infants," Journal of the American Statistical Association, 105, 1285-1296.

Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), "Optimal Multivariate Matching Before Randomization," Biostatistics, 5, 263-275.

Lu, B., Greevy, R., Xu, X., and Beck C. (2011), "Optimal Nonbipartite Matching and its Statistical Applications," The American Statistician, 65, 21-30.

Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.

Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.

Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.

Zou, J., and Zubizarreta, J. R. (2016), "Covariate Balanced Restricted Randomization: Optimal Designs, Exact Tests, and Asymptotic Results," working paper.

Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.

Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.

Examples

    

## Uncomment the following example
## Load and attach data
#data(lalonde)
#attach(lalonde)

################################# 
## Example: optimal subset matching
################################# 

## Optimal subset matching pursues two competing goals at 
## the same time: to minimize the total of distances while 
## matching as many observations as possible.  The trade-off 
## between these two is regulated by the parameter subset_weight 
## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion).
## Here the balance requirements are mean and fine balance for 
## different covariates.  We require 50 pairs to be matched.
## Again, the solver used is HiGHS with the approximate option.

## Matrix of covariates
#X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75)

## Distance matrix
#dist_mat_covs = round(dist(X_mat, diag = TRUE, upper = TRUE), 1)
#dist_mat = as.matrix(dist_mat_covs)

## Subset matching weight
#subset_weight = 1

## Total pairs to be matched
#total_pairs = 50

## Moment balance: constrain differences in means to be at most .1 standard deviations apart
#mom_covs = cbind(age, education)
#mom_tols = apply(mom_covs, 2, sd)*.1
#mom = list(covs = mom_covs, tols = mom_tols)

## Solver options
#t_max = 60*5
#solver = "highs"
#approximate = 1
#solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, 
#trace_cplex = 0)

## Match                  
#out = nmatch(dist_mat = dist_mat, subset_weight = subset_weight, total_pairs = total_pairs, 
#mom = mom, solver = solver)              
              
## Indices of the treated units and matched controls
#id_1 = out$id_1  
#id_2 = out$id_2	

## Assess mean balance
#a = apply(mom_covs[id_1, ], 2, mean)
#b = apply(mom_covs[id_2, ], 2, mean)
#tab = round(cbind(a, b, a-b, mom_tols), 2)
#colnames(tab) = c("Mean 1", "Mean 2", "Diffs", "Tols")
#tab

## Assess fine balance (note here we are getting an approximate solution)
#for (i in 1:ncol(fine_covs)) {		
#	print(finetab(fine_covs[, i], id_1, id_2))
#}

designmatch documentation built on Aug. 29, 2023, 5:11 p.m.