In markmfredrickson/optmatch: Functions for Optimal Matching

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(optmatch)

The goal of this document is provide some basic performance testing of the optmatch package. In most matching workflows, a user first creates distances, produces one or more matches, assesses balance on a given match, and, provided balance meets necessary requirements, proceeds to analysis. Of these tasks, optmach is responsible for distance creation and creating the match. These two tasks will be considered separately. The remainder of this document lays out the performance testing strategy and then implements it for the distance creation and matching routines in optmatch.

Performance Testing Strategy

R provides built in execution profiling through the Rprof function (and a similarly named command line option). When invoked, this function regularly interrupts normal processing and writes out the current call stack to a file. This information can be used to attribute the portion of run time attributable different functions.

# library(profr)
library(profvis)
# perfcex <- 0.5 # makes text more readable in plots
# minlabel <- 0.05 # add labels for smaller segments of time

For interpretation of this data, we rely on the profr package, which provides some graphical summaries to help make raw profile data more manageable.

Since we suspect that creation of large data sets may account for a sizeable portion of the runtime of large problems, we also profile memory usage with simulated data described in its own section.

Simulated Data

set.seed(20130125) # so that the random values we generate will be consistent
# N <- 1000
N <- 1000
X <- data.frame(X1 = rnorm(N), 
                X2 = rnorm(N, mean = runif(N, -5, 5)), 
                X3 = as.factor(sample(letters[1:5], N, replace = T)))

mm <- model.matrix(I(rep(1, N)) ~  X1 + X2 + X1:X3, data = X)
coefs <- runif(dim(mm)[2], -2, 2)
logits <- as.vector(coefs %*% t(mm)) 
DATA <- data.frame(Z = rbinom(N, size = 1, prob = plogis(logits)), X)
model <- glm(Z ~ X1 + X2 + X1:X3, data = DATA)
predicted <- predict(model)

Before proceeding to the actual profiling, we begin by creating some simulated data.^[See the file setup.R for the supporting code.] We use r N individuals, with r sum(DATA[, "Z"]) receiving the treatment condition. Treatment assignment is based on a simple linear model of 3 covariates (two Normal variables and one categorical variable with 5 levels).

For most of the problems, we create a model of treatment assignment and then pull out the linear predictors from that model. Figure 1 shows the distributions of the predicted probabilities for the treated and control groups (henceforth the propensity score).

library(lattice)
densityplot(~ predicted, groups = DATA$Z, xlab = "Linear predictors")

Distance Creation

We begin by benchmarking dense distance creation.^[See the file distance.R for implementation details.] The match_on.numeric method has the least complicated interface and applies the least pre-processing to the input. Figure 2 shows the profile data for using match_on.numeric with the propensity score.

result.sparse.caliper <- match_on(x=predicted, z=DATA$Z, caliper=1)
result.sparse.within <- match_on(x=predicted, z=DATA$Z,
    within=exactMatch(Z ~ I(X3 == "a" | X3 == "b"), data = DATA))

profvis(result.dense <- match_on(x=predicted, z=DATA$Z))

There are two ways to create sparse problems. In the first, we use the caliper argument (for the numeric and glm methods of match_on). This argument looks over all the treated and control values and computes which treated and control units should be compared, and then computes the exact distances between them. Applying a caliper width of 1 to the simulated data, leads to a sparse matrix with r round(100 * length(result.sparse.caliper) / (N*N), 1)% finite entries. Figure 3 shows the profiling data for this process.

The alternative to the caliper argument is the within argument. This method is more general, applying to all match_on methods, but can sometimes require generating a dense matrix first (though not always --- the exactMatch function doesn't require a dense matrix). To use the within argument, we use exactMatch create two subproblems based on the categorical covariate. r round(100 * length(result.sparse.within) / (N*N), 1)% finite entries. Figure 4 shows the profiling results when using the within argument.

profvis(result.sparse.caliper <- match_on(x=predicted, z=DATA$Z, caliper=1))

# benchmark.sparse.caliper <- profr(result.sparse.caliper <- match_on(x =
# predicted, z = DATA$Z, caliper = 1))

# par(cex = perfcex)
# plot(benchmark.sparse.caliper, minlabel = minlabel)

profvis(result.sparse.within <- match_on(x=predicted, z=DATA$Z,
    within=exactMatch(Z ~ I(X3 == "a" | X3 == "b"), data = DATA)))

# benchmark.sparse.within <- profr(result.sparse.within <- match_on(x =
# predicted, z = DATA$Z, within = exactMatch(Z ~ I(X3 == "a" | X3 == "b"), data = DATA)))

# par(cex = perfcex)
# plot(benchmark.sparse.within, minlabel = minlabel)

match_on Methods

These additional methods add some pre-processing to the distance creation. These methods can work on sparse problems, but to keep things simple, these examples just create dense matrices. The glm method is a relatively small wrapper around the numeric method used in the previous examples. Figure 5 shows the profiling data for glm method, which we would expect to look very similar to Figure 2, the dense matrix problem from the previous section.

profvis(result.glm <- match_on(x = glm(Z ~ X1 + X2 + X3, data=DATA)))

# benchmark.glm <- profr(result.glm <- match_on(x =
#   glm(Z ~ X1 + X2 + X3, data = DATA)))

# par(cex = perfcex)
# plot(benchmark.glm, minlabel = minlabel)

Figure 6 shows the profiling data for using the formula method. This method, by default, creates a squared Mahalanobis distance between treated and control pairs (Euclidean distances scaled by the variance-covariance matrix). Figure 6 shows the profiling data for using the formula method. This plot is expected to be very different than the previous as all the previous examples were based a simple absolute difference of a 1-D vector. In this task, however, we have to compute a variance-covariance matrix and then produce a series of multiplications to compute the squared distances. There may be opportunities to improve both components of the distance creation. Additionally, they may not scale in the same way, with one dominating for small problems and the other for large. This plot does not provide any information the scaling nature of the function.

profvis(result.formula <- match_on(x = Z ~ X1 + X2 + X3, data=DATA))

# benchmark.formula <- profr(result.formula <- match_on(x =
#  Z ~ X1 + X2 + X3, data = DATA))

# par(cex = perfcex)
# plot(benchmark.formula, minlabel = minlabel)

Matching

Now that distance creation code has been benchmarked, we now consider the matching process itself. We reuse the earlier distance objects. To keep things simple, most of these matches will be computed using fullmatch at default settings. The process of matching, as currently implemented in optmatch can be broken down into the following steps:

\begin{enumerate} \item The specification for the match is check for basic sanity. \item If there are clear subproblems (such as those created using exactMatch), the problem is broken into multiple parts, and each of the next steps is run for each component. \item From the matrix representation of the distances (which is potentially sparse), a new matrix is formed with a row for each valid comparison. The data in each row is the treated unit, the control unit, and the distance (an adjacency matrix). \item This representation is passed to the Fortran solver, along with the fixed elements of the matching process. \item The results of the solver and turned into an optmatch object, a factor where the names corresponds to the names of the units and the levels to the matched groups. This object also stores additional information about the match as attributes. (If the match was split into multiple components, it is combined here.) \end{enumerate}

# increase N so that routines take enought time to profile properly
N <- 3000
X <- data.frame(X1 = rnorm(N), 
                X2 = rnorm(N, mean = runif(N, -5, 5)), 
                X3 = as.factor(sample(letters[1:5], N, replace = T)))

mm <- model.matrix(I(rep(1, N)) ~  X1 + X2 + X1:X3, data = X)
coefs <- runif(dim(mm)[2], -2, 2)
logits <- as.vector(coefs %*% t(mm)) 
DATA <- data.frame(Z = rbinom(N, size = 1, prob = plogis(logits)), X)
model <- glm(Z ~ X1 + X2 + X1:X3, data = DATA)
predicted <- predict(model)

We should expect to see these phases in each of the next figures. Figure 7 shows the matching process applied to the dense distance matrix from the previous section. Figure 8 shows a profiling information for the caliper argument based sparse distance matrix. Figure 9 shows the profiling data for the stratified, sparse problem (which can be split up into separate calls to the solver)

result.dense <- match_on(x = predicted, z = DATA$Z)
profvis(result.matching.dense <- fullmatch(result.dense))

# benchmark.matching.dense <- profr(result.matching.dense <- fullmatch(result.dense))
# par(cex = perfcex)
# plot(benchmark.matching.dense, minlabel = minlabel)

profvis(result.matching.sparse <- fullmatch(result.sparse.caliper))

# benchmark.matching.sparse <- profr(result.matching.sparse <- fullmatch(result.sparse.caliper))
# par(cex = perfcex)
# plot(benchmark.matching.sparse, minlabel = minlabel)

profvis(result.matching.sparse.strat <- fullmatch(result.sparse.within))

# benchmark.matching.sparse.strat <- profr(result.matching.sparse.strat <- fullmatch(result.sparse.within))
# par(cex = perfcex)
# plot(benchmark.matching.sparse.strat, minlabel = minlabel)

The previous tests used fullmatch with the default arguments. To test out the use of the various constraint arguments, we call pairmatch on the dense distance problem. In addition to fixing the minimum and maximum number of controls per matched set to 1, the pairmatch function inspects the distance object to set appropriate values for omit.fraction, the argument that allows a portion of the control group to be discarded. Figure 10 shows these profiling data.

profvis(result.matching.pairmatch <- pairmatch(result.dense))

# benchmark.matching.pairmatch <- profr(result.matching.pairmatch <- pairmatch(result.dense))
# par(cex = perfcex)
# plot(benchmark.matching.pairmatch, minlabel = minlabel)

mdist and match_on

In version 0.7 of optmatch and earlier, the primary method of creating distances was to use the mdist function. In version 0.8, match_on was added as a more comprehensive tool, specifically one that allowed for arbitrary sparseness of distances matrices.

# increase N so that routines take enought time to profile properly
N <- 10000
X <- data.frame(X1 = rnorm(N),
                X2 = rnorm(N, mean = runif(N, -5, 5)),
                X3 = as.factor(sample(letters[1:5], N, replace = T)))

mm <- model.matrix(I(rep(1, N)) ~  X1 + X2 + X1:X3, data = X)
coefs <- runif(dim(mm)[2], -2, 2)
logits <- as.vector(coefs %*% t(mm))
DATA <- data.frame(Z = rbinom(N, size = 1, prob = plogis(logits)), X)
model <- glm(Z ~ X1 + X2 + X1:X3, data = DATA)

profvis(result.mdist <- mdist(model, structure.fmla = ~ X3), interval = 0.001)

Appendix {-}

Scaling

Ks <- c(10, 20, 50, 100, 200, 300, 350, 400, 425, 450, 475, 500)
propensity <- rnorm(2 * max(Ks))
names(propensity) <- 1:(2 * max(Ks))

times <- lapply(Ks, function(k) {
  data <- propensity[1:(2 * k)]
  z <- rep(c(0,1), k)
  mdt <- system.time(mdist(z ~ data), gcFirst = TRUE) 
  mot <- system.time(match_on(x = data, z = z), gcFirst = TRUE)

  # for both runs, return the sum of the user and system time
  return(c(mdt[1] + mdt[2], mot[1] + mot[2]))
})

times <- do.call(rbind, times)
colnames(times) <- c("mdist", "match_on")
times <- data.frame(k = Ks, times)

This section considers how the two functions scale up as the problem size gets bigger. We create problems sets of size N = 2K with K treatment and K control units. Figure 12 shows the run time of the two functions as K is increased. The y axis is logged showing that both methods grow roughly quadratically, as we would expect from the nature of the algorithms. Of course, the absolute run time of match_on is orders of magnitude worse.

# plot(match_on ~ k, data = times, type = 'l', col = "red", log = "y")
# lines(mdist ~ k, data = times, col = "blue", log = "y")
plot(match_on ~ k, data = times, type = 'l', col = "red", log = "y")
lines(mdist ~ k, data = times, col = "blue")