# bmatch: Optimal bipartite matching in observational studies In designmatch: Matched Samples that are Balanced and Representative by Design

## Description

Function for optimal bipartite matching in observational studies that directly balances the observed covariates. `bmatch` allows the user to enforce different forms of covariate balance in the matched samples such as moment balance (e.g., of means, variances and correlations), distributional balance (e.g., fine balance, near-fine balance, strength-k balancing) and exact matching. In observational studies where an instrumental variable is available, `bmatch` can be used to handle weak instruments and strengthen them by means of the `far` options (see Yang et al. 2014 for an example). `bmatch` can also be used in discontinuity designs by matching units in a neighborhood of the discontinuity (see Keele et al. 2015 for details). In any of these settings, `bmatch` either minimizes the total sum of covariate distances between matched units, maximizes the total number of matched units, or optimizes a combination of the two, subject to matching, covariate balancing, and representativeness constraints (see the examples below).

## Usage

 ```1 2 3 4``` ``` bmatch(t_ind, dist_mat = NULL, subset_weight = NULL, n_controls = 1, total_groups = NULL, mom = NULL, ks = NULL, exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL, near = NULL, far = NULL, solver = NULL) ```

## Arguments

 `t_ind` treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). Please note that the data needs to be sorted in decreasing order according to this treatment indicator. `dist_mat` distance matrix: a matrix of positive distances between treated units (rows) and controls (columns). If `dist_mat = NULL` and `subset_weight = 1`, then bmatch will solve the cardinality matching problem in Zubizarreta et al. (2014). `subset_weight` subset matching weight: a scalar that regulates the trade-off between the total sum of distances between matched pairs and the total number of matched pairs. The larger `subset_weight`, the more importance will be given to the the total number of matched pairs relative to the total sum of distances between matched pairs. See Rosenbaum (2012) and Zubizarreta et al. (2013) for a discussion of this parameter. If `subset_weight = NULL`, then `bmatch` will match all the treated observations, provided it exists a feasible solution. If `subset_weight = 1` and `dist_mat = NULL`, then `bmatch` will solve the cardinality matching problem in Zubizarreta et al. (2014). `n_controls` number of controls: a scalar defining the number of controls to be matched with a fixed rate to each treated unit. The default is `n_controls = 1`, i.e. pair matching. `total_groups` total number of matched pairs: a scalar specifying the number of matched pairs to be obtained. If `total_groups = NULL`, then no specific number of matched pairs is required before matching. `mom` moment balance parameters: a list with three arguments, `mom = list(covs = mom_covs, tols = mom_tols, targets = mom_targets)`. `mom_covs` is a matrix where each column is a covariate whose mean is to be balanced. `mom_tols` is a vector of tolerances for the maximum difference in means for the covariates in `mom_covs`. `mom_targets` is a vector of target moments (e.g., means) of a distribution to be approximated by matched sampling. `mom_targets` is optional, but if `mom_covs` is specified then `mom_tols` needs to be specified too. If `mom_targets` is `NULL`, then `bmatch` will match treated and control units so that covariates in `mom_covs` differ at most by `mom_tols`. If `mom_targets` is specified, then `bmatch` will match treated and control units so that each matched group differs at most by `mom_tols` units from the respective moments in `mom_targets`. As a result, the matched groups will differ at most `mom_tols * 2` from each other. Under certain assumptions, `mom_targets` can be used for constructing a representative matched sample. The lengths of `mom_tols` and `mom_target` have to be equal to the number of columns of `mom_covs`. Note that the columns of `mom_covs` can be transformations of the original covariates to balance higher order single-dimensional moments like variances and skewness, and multidimensional moments such as correlations (Zubizarreta 2012). `ks` Kolmogorov-Smirnov balance parameters: a list with three objects, `ks = list(covs = ks_covs, n_grid = ks_n_grid, tols = ks_tols)`. `ks_covs` is a matrix where each column is a covariate whose difference in a coarsened version of the Kolmogorov-Smirnov statistic between treated units and matched controls is to be constrained. `ks_n_grid` is scalar defining the number of equispaced grid points for calculating the Kolmogorov-Smirnov statistic between the distributions of the treated units and matched controls for the covariates in `ks_covs`. The defaul is `ks_n_grid = 10` for the deciles of the empirical cumulative distribution functions of the treated units for the covariate in question. `ks_tols` is a vector of tolerances for the maximum differences in means for the covariates defined in `ks_covs`. Note that the length of `ks_tols` has to be equal to the number of columns of `ks_covs`. `exact` Exact matching parameters: a list with one argument, `exact = list(covs = exact_covs)`, where `exact_covs` is a matrix where each column is a nominal covariate for exact matching. `near_exact` Near-exact matching parameters: a list with two objects, `near_exact = list(covs = near_exact_covs, devs = near_exact_devs)`. `near_exact_covs` are the near-exact matching covariates; specifically, a matrix where each column is a nominal covariate for near-exact matching. `near_exact_devs` are the maximum deviations from near-exact matching: a vector of scalars defining the maximum deviation allowed from exact matching for the covariates defined in `near_exact_covs`. Note that the length of `near_exact_devs` has to be equal to the number of columns of `near_exact_covs`. For detailed expositions of near-exact matching, see section 9.2 of Rosenbaum (2010) and Zubizarreta et al. (2011). `fine` Fine balance parameters: a list with one argument, `fine = list(covs = fine_covs)`, where `fine_covs` is a matrix where each column is a nominal covariate for fine balance. Fine balance enforces exact distributional balance on nominal covariates, but without constraining treated and control units to be matched within each category of each nominal covariate as in exact matching. See chapter 10 of Rosenbaum (2010) for details. `near_fine` Near-fine balance parameters: a list with two objects, `near_fine = list(covs = near_fine_covs, devs = near_fine_devs)`. `near_fine_covs` is a matrix where each column is a nominal covariate for near-fine matching. `near_fine_devs` is a vector of scalars defining the maximum deviation allowed from fine balance for the covariates in `near_fine_covs`. Note that the length of `near_fine_devs` has to be equal to the number of columns of `near_fine_covs`. See Yang et al. (2012) for a description of near-fine balance. `near` Near matching parameters: a list with three objects, `near = list(covs = near_covs, pairs = near_pairs, groups = near_groups)`. `near_covs` is a matrix where each column is a variable for near matching. `near_pairs` is a vector determining the maximum distance between individual matched pairs for each variable in `near_covs`. `near_groups` is a vector determining the maximum average distance between matched groups (in aggregate) for each covariate in `near_covs`. If `near_covs` is specified, then either `near_pairs`, `near_covs` or both must be specified as well, and the length of `near_pairs` and/or `near_groups` has to be equal to the number of columns of `near_covs`. The `near` options can be useful for matching within a window of a discontinuity (see Keele et al. 2015 for details). `far` Far matching parameters: a list with three objects, `far = list(covs = far_covs, pairs = far_pairs, groups = far_groups)`. `far_covs` is a matrix where each column is a variable (a covariate or an instrumental variable) for far matching. `far_pairs` is a vector determining the minimum distance between units in a matched pair for each variable in `far_covs`, and `far_groups` is a vector defining the minimum average (aggregate) distance between matched groups for each variable in `far_covs`. If `far_covs` is specified, then either `far_pairs`, `far_covs`, or both, must be specified, and the length of `far_pairs` and/or `far_groups` has to be equal to the number of columns of `far_covs`. See Yang et al. (2014) for strengthening an instrumental variable via far matching in a bipartite setting. `solver` Optimization solver parameters: a list with four objects, `solver = list(name = name, t_max = t_max, approximate = 1, round_cplex = 0,` ` trace_cplex = 0)`. `solver` is a string that determines the optimization solver to be used. The options are: `cplex`, `glpk`, `gurobi` and `symphony`. The default solver is `glpk` with `approximate = 1`, so that by default an approximate solution is found (see `approximate` below). For an exact solution, we strongly recommend using `cplex` or `gurobi` as they are much faster than the other solvers, but they do require a license (free for academics, but not for people outside universities). Between `cplex` and `gurobi`, note that installing the R interface for `gurobi` is much simpler. `t_max` is a scalar with the maximum time limit for finding the matches. This option is specific to `cplex` and `gurobi`. If the optimal matches are not found within this time limit, a partial, suboptimal solution is given. `approximate` is a scalar that determines the method of solution. If `approximate = 1` (the default), an approximate solution is found via a relaxation of the original integer program. This method of solution is faster than `approximate = 0`, but some balancing constraints may be violated to some extent. This option works only with `n_controls = 1`, i.e. pair matching. `round_cplex` is binary specific to `cplex`. `round_cplex = 1` ensures that the solution found is integral by rounding and all the constraints are exactly statisfied; `round_cplex = 0` (the default) encodes there is no rounding which may return slightly infeasible integer solutions. `trace` is a binary specific to `cplex` and `gurobi`. `trace = 1` turns the optimizer output on. The default is `trace = 0`.

## Value

A list containing the optimal solution, with the following objects:

 `obj_total` value of the objective function at the optimum; `obj_dist_mat` value of the total sum of distances term of the objective function at the optimum; `t_id` indexes of the matched controls at the optimum; `c_id` indexes of the matched controls at the optimum; `group_id` matched pairs or groups at the optimum; `time` time elapsed to find the optimal solution.

## Author(s)

Jose R. Zubizarreta <zubizarreta@hcp.med.harvard.edu>, Cinar Kilcioglu <ckilcioglu16@gsb.columbia.edu>.

## References

Keele, L., Titiunik, R., and Zubizarreta, J. R., (2015), "Enhancing a Geographic Regression Discontinuity Design Through Matching to Estimate the Effect of Ballot Initiatives on Voter Turnout," Journal of the Royal Statistical Society: Series A, 178, 223-239.

Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.

Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.

Yang, D., Small, D., Silber, J. H., and Rosenbaum, P. R. (2012), "Optimal Matching With Minimal Deviation From Fine Balance in a Study of Obesity and Surgical Outcomes," Biometrics, 68, 628-36.

Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.

Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.

Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.

Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), "Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit High Schools in Chile," Annals of Applied Statistics, 8, 204-231.

sensitivitymv, sensitivitymw.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191``` ``` # Load, sort, and attach data data(lalonde) lalonde = lalonde[order(lalonde\$treatment, decreasing = TRUE), ] attach(lalonde) ################################# # Example 1: cardinality matching ################################# # Cardinality matching finds the largest matched sample of pairs that meets balance # requirements. Here the balance requirements are mean balance, fine balance and # exact matching for different covariates. The solver used is glpk with the # approximate option. # Treatment indicator; note that the data needs to be sorted in decreasing order # according to this treatment indicator t_ind = treatment t_ind # Distance matrix dist_mat = NULL # Subset matching weight subset_weight = 1 # Moment balance: constrain differences in means to be at most .05 standard deviations apart mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) mom_tols = round(absstddif(mom_covs, t_ind, .05), 2) mom = list(covs = mom_covs, tols = mom_tols) # Fine balance fine_covs = cbind(black, hispanic, married, nodegree) fine = list(covs = fine_covs) # Exact matching exact_covs = cbind(black) exact = list(covs = exact_covs) # Solver options t_max = 60*5 solver = "glpk" approximate = 1 solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) # Match out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, mom = mom, fine = fine, exact = exact, solver = solver) # Indices of the treated units and matched controls t_id = out\$t_id c_id = out\$c_id # Time out\$time/60 # Matched group identifier (who is matched to whom) out\$group_id # Assess mean balance meantab(mom_covs, t_ind, t_id, c_id) # Assess fine balance (note here we are getting an approximate solution) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id, c_id)) } # Assess exact matching balance table(exact_covs[t_id]==exact_covs[c_id]) ## Uncomment the following examples ################################## ## Example 2: minimum distance matching ################################## ## The goal here is to minimize the total of distances between matched pairs. In ## this example there are no covariate balance requirements. Again, the solver ## used is glpk with the approximate option ## Treatment indicator #t_ind = treatment ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat = distmat(t_ind, X_mat) ## Subset matching weight #subset_weight = NULL ## Total pairs to be matched #total_groups = sum(t_ind) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace_cplex = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, total_groups = total_groups, #solver = solver) ## Indices of the treated units and matched controls #t_id = out\$t_id #c_id = out\$c_id ## Total of distances between matched pairs #out\$obj_total ## Assess mean balance #meantab(X_mat, t_ind, t_id, c_id) ################################## ## Example 3: optimal subset matching ################################## ## Optimal subset matching pursues two competing goals at ## the same time: to minimize the total sum of covariate distances ## while matching as many observations as possible. The trade-off ## between these two goals is regulated by the parameter subset_weight ## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion). ## Here the balance requirements are mean balance, near-fine balance ## and near-exact matching for different covariates. ## Again, the solver used is glpk with the approximate option. ## Treatment indicator #t_ind = treatment ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat = distmat(t_ind, X_mat) ## Subset matching weight #subset_weight = median(dist_mat) ## Moment balance: constrain differences in means to be at most .05 standard deviations apart #mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) #mom_tols = round(absstddif(mom_covs, t_ind, .05), 2) #mom = list(covs = mom_covs, tols = mom_tols) ## Near-fine balance #near_fine_covs = cbind(married, nodegree) #near_fine_devs = rep(5, 2) #near_fine = list(covs = near_fine_covs, devs = near_fine_devs) ## Near-exact matching #near_exact_covs = cbind(black, hispanic) #near_exact_devs = rep(5, 2) #near_exact = list(covs = near_exact_covs, devs = near_exact_devs) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace_cplex = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, #mom = mom, near_fine = near_fine, near_exact = near_exact, solver = solver) ## Indices of the treated units and matched controls #t_id = out\$t_id #c_id = out\$c_id ## Time #out\$time/60 ## Matched group identifier (who is matched to whom) #out\$group_id ## Assess mean balance (note here we are getting an approximate solution) #meantab(X_mat, t_ind, t_id, c_id) ## Assess fine balance #for (i in 1:ncol(near_fine_covs)) { # print(finetab(near_fine_covs[, i], t_id, c_id)) #} ## Assess exact matching balance #for (i in 1:ncol(near_exact_covs)) { # print(table(near_exact_covs[t_id, i]==near_exact_covs[c_id, i])) #} ```

### Example output

```Loading required package: lattice
Using the GLPK callable library version 4.65
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Building the matching problem...
GLPK optimizer is open...
Finding the optimal matches...
Optimal matches found
elapsed
0.05715
[1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
[19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
[37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
[55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
[73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
[91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116   1   2   3   4   5   6   7   8   9  10
[127]  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28
[145]  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46
[163]  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64
[181]  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82
[199]  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
[217] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
Mis Min      Max  Mean T  Mean C Std Dif P-val
age         0  16    55.00   26.51   26.37    0.02  0.91
education   0   1    18.00   10.41   10.36    0.02  0.90
black       0   0     1.00    0.75    0.75    0.00  1.00
hispanic    0   0     1.00    0.09    0.09    0.00  1.00
married     0   0     1.00    0.28    0.28    0.00  1.00
nodegree    0   0     1.00    0.62    0.62    0.00  1.00
re74        0   0 35040.07 3057.10 3294.02   -0.04  0.74
re75        0   0 25142.24 1943.60 2039.24   -0.03  0.83
Units
Cat  T  C
0 29 29
1 87 87
Units
Cat   T   C
0 105 105
1  11  11
Units
Cat  T  C
0 84 84
1 32 32
Units
Cat  T  C
0 44 44
1 72 72

TRUE
116
```

designmatch documentation built on May 1, 2019, 7:12 p.m.