knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(optmatch)
The goal of this document is provide some basic performance testing of the optmatch package. In most matching workflows, a user first creates distances, produces one or more matches, assesses balance on a given match, and, provided balance meets necessary requirements, proceeds to analysis. Of these tasks, optmach is responsible for distance creation and creating the match. These two tasks will be considered separately. The remainder of this document lays out the performance testing strategy and then implements it for the distance creation and matching routines in optmatch.
R provides built in execution profiling through the Rprof function (and a similarly named command line option). When invoked, this function regularly interrupts normal processing and writes out the current call stack to a file. This information can be used to attribute the portion of run time attributable different functions.
# library(profr) library(profvis) # perfcex <- 0.5 # makes text more readable in plots # minlabel <- 0.05 # add labels for smaller segments of time
For interpretation of this data, we rely on the profr package, which provides some graphical summaries to help make raw profile data more manageable.
Since we suspect that creation of large data sets may account for a sizeable portion of the runtime of large problems, we also profile memory usage with simulated data described in its own section.
set.seed(20130125) # so that the random values we generate will be consistent # N <- 1000 N <- 1000 X <- data.frame(X1 = rnorm(N), X2 = rnorm(N, mean = runif(N, -5, 5)), X3 = as.factor(sample(letters[1:5], N, replace = T))) mm <- model.matrix(I(rep(1, N)) ~ X1 + X2 + X1:X3, data = X) coefs <- runif(dim(mm)[2], -2, 2) logits <- as.vector(coefs %*% t(mm)) DATA <- data.frame(Z = rbinom(N, size = 1, prob = plogis(logits)), X) model <- glm(Z ~ X1 + X2 + X1:X3, data = DATA) predicted <- predict(model)
Before proceeding to the actual profiling, we begin by creating some simulated
data.^[See the file setup.R for the supporting code.]
We use r N
individuals, with r sum(DATA[, "Z"])
receiving the
treatment condition.
Treatment assignment is based on a
simple linear model of 3 covariates (two Normal variables and one categorical
variable with 5 levels).
For most of the problems, we create a model of treatment assignment and then pull out the linear predictors from that model. Figure 1 shows the distributions of the predicted probabilities for the treated and control groups (henceforth the propensity score).
library(lattice) densityplot(~ predicted, groups = DATA$Z, xlab = "Linear predictors")
We begin by benchmarking dense distance creation.^[See the file distance.R for implementation details.] The match_on.numeric method has the least complicated interface and applies the least pre-processing to the input. Figure 2 shows the profile data for using match_on.numeric with the propensity score.
result.sparse.caliper <- match_on(x=predicted, z=DATA$Z, caliper=1) result.sparse.within <- match_on(x=predicted, z=DATA$Z, within=exactMatch(Z ~ I(X3 == "a" | X3 == "b"), data = DATA))
profvis(result.dense <- match_on(x=predicted, z=DATA$Z))
There are two ways to create sparse problems. In the first, we use the
caliper argument (for the numeric and glm methods
of match_on). This argument looks over all the treated and control
values and computes which treated and control units should be compared, and
then computes the exact distances between them. Applying a caliper width of 1
to the simulated data, leads to a sparse matrix with
r round(100 * length(result.sparse.caliper) / (N*N), 1)
% finite
entries. Figure 3 shows the profiling data for this
process.
The alternative to the caliper argument is the within
argument.
This method is more general, applying to all match_on methods, but can sometimes
require generating a dense matrix first (though not always --- the
exactMatch function doesn't require a dense matrix). To use the
within argument, we use exactMatch create two subproblems
based on the categorical covariate.
r round(100 * length(result.sparse.within) / (N*N), 1)
% finite
entries. Figure 4 shows the profiling results when using
the within argument.
profvis(result.sparse.caliper <- match_on(x=predicted, z=DATA$Z, caliper=1)) # benchmark.sparse.caliper <- profr(result.sparse.caliper <- match_on(x = # predicted, z = DATA$Z, caliper = 1)) # par(cex = perfcex) # plot(benchmark.sparse.caliper, minlabel = minlabel)
profvis(result.sparse.within <- match_on(x=predicted, z=DATA$Z, within=exactMatch(Z ~ I(X3 == "a" | X3 == "b"), data = DATA))) # benchmark.sparse.within <- profr(result.sparse.within <- match_on(x = # predicted, z = DATA$Z, within = exactMatch(Z ~ I(X3 == "a" | X3 == "b"), data = DATA))) # par(cex = perfcex) # plot(benchmark.sparse.within, minlabel = minlabel)
These additional methods add some pre-processing to the distance creation. These methods can work on sparse problems, but to keep things simple, these examples just create dense matrices. The glm method is a relatively small wrapper around the numeric method used in the previous examples. Figure 5 shows the profiling data for glm method, which we would expect to look very similar to Figure 2, the dense matrix problem from the previous section.
profvis(result.glm <- match_on(x = glm(Z ~ X1 + X2 + X3, data=DATA))) # benchmark.glm <- profr(result.glm <- match_on(x = # glm(Z ~ X1 + X2 + X3, data = DATA))) # par(cex = perfcex) # plot(benchmark.glm, minlabel = minlabel)
Figure 6 shows the profiling data for using the formula method. This method, by default, creates a squared Mahalanobis distance between treated and control pairs (Euclidean distances scaled by the variance-covariance matrix). Figure 6 shows the profiling data for using the formula method. This plot is expected to be very different than the previous as all the previous examples were based a simple absolute difference of a 1-D vector. In this task, however, we have to compute a variance-covariance matrix and then produce a series of multiplications to compute the squared distances. There may be opportunities to improve both components of the distance creation. Additionally, they may not scale in the same way, with one dominating for small problems and the other for large. This plot does not provide any information the scaling nature of the function.
profvis(result.formula <- match_on(x = Z ~ X1 + X2 + X3, data=DATA)) # benchmark.formula <- profr(result.formula <- match_on(x = # Z ~ X1 + X2 + X3, data = DATA)) # par(cex = perfcex) # plot(benchmark.formula, minlabel = minlabel)
Now that distance creation code has been benchmarked, we now consider the matching process itself. We reuse the earlier distance objects. To keep things simple, most of these matches will be computed using fullmatch at default settings. The process of matching, as currently implemented in optmatch can be broken down into the following steps:
\begin{enumerate} \item The specification for the match is check for basic sanity. \item If there are clear subproblems (such as those created using exactMatch), the problem is broken into multiple parts, and each of the next steps is run for each component. \item From the matrix representation of the distances (which is potentially sparse), a new matrix is formed with a row for each valid comparison. The data in each row is the treated unit, the control unit, and the distance (an adjacency matrix). \item This representation is passed to the Fortran solver, along with the fixed elements of the matching process. \item The results of the solver and turned into an optmatch object, a factor where the names corresponds to the names of the units and the levels to the matched groups. This object also stores additional information about the match as attributes. (If the match was split into multiple components, it is combined here.) \end{enumerate}
# increase N so that routines take enought time to profile properly N <- 3000 X <- data.frame(X1 = rnorm(N), X2 = rnorm(N, mean = runif(N, -5, 5)), X3 = as.factor(sample(letters[1:5], N, replace = T))) mm <- model.matrix(I(rep(1, N)) ~ X1 + X2 + X1:X3, data = X) coefs <- runif(dim(mm)[2], -2, 2) logits <- as.vector(coefs %*% t(mm)) DATA <- data.frame(Z = rbinom(N, size = 1, prob = plogis(logits)), X) model <- glm(Z ~ X1 + X2 + X1:X3, data = DATA) predicted <- predict(model)
We should expect to see these phases in each of the next figures. Figure 7 shows the matching process applied to the dense distance matrix from the previous section. Figure 8 shows a profiling information for the caliper argument based sparse distance matrix. Figure 9 shows the profiling data for the stratified, sparse problem (which can be split up into separate calls to the solver)
result.dense <- match_on(x = predicted, z = DATA$Z) profvis(result.matching.dense <- fullmatch(result.dense)) # benchmark.matching.dense <- profr(result.matching.dense <- fullmatch(result.dense)) # par(cex = perfcex) # plot(benchmark.matching.dense, minlabel = minlabel)
profvis(result.matching.sparse <- fullmatch(result.sparse.caliper)) # benchmark.matching.sparse <- profr(result.matching.sparse <- fullmatch(result.sparse.caliper)) # par(cex = perfcex) # plot(benchmark.matching.sparse, minlabel = minlabel)
profvis(result.matching.sparse.strat <- fullmatch(result.sparse.within)) # benchmark.matching.sparse.strat <- profr(result.matching.sparse.strat <- fullmatch(result.sparse.within)) # par(cex = perfcex) # plot(benchmark.matching.sparse.strat, minlabel = minlabel)
The previous tests used fullmatch with the default arguments. To test out the use of the various constraint arguments, we call pairmatch on the dense distance problem. In addition to fixing the minimum and maximum number of controls per matched set to 1, the pairmatch function inspects the distance object to set appropriate values for omit.fraction, the argument that allows a portion of the control group to be discarded. Figure 10 shows these profiling data.
profvis(result.matching.pairmatch <- pairmatch(result.dense)) # benchmark.matching.pairmatch <- profr(result.matching.pairmatch <- pairmatch(result.dense)) # par(cex = perfcex) # plot(benchmark.matching.pairmatch, minlabel = minlabel)
In version 0.7 of optmatch and earlier, the primary method of creating distances was to use the mdist function. In version 0.8, match_on was added as a more comprehensive tool, specifically one that allowed for arbitrary sparseness of distances matrices.
# increase N so that routines take enought time to profile properly N <- 10000 X <- data.frame(X1 = rnorm(N), X2 = rnorm(N, mean = runif(N, -5, 5)), X3 = as.factor(sample(letters[1:5], N, replace = T))) mm <- model.matrix(I(rep(1, N)) ~ X1 + X2 + X1:X3, data = X) coefs <- runif(dim(mm)[2], -2, 2) logits <- as.vector(coefs %*% t(mm)) DATA <- data.frame(Z = rbinom(N, size = 1, prob = plogis(logits)), X) model <- glm(Z ~ X1 + X2 + X1:X3, data = DATA)
profvis(result.mdist <- mdist(model, structure.fmla = ~ X3), interval = 0.001)
Ks <- c(10, 20, 50, 100, 200, 300, 350, 400, 425, 450, 475, 500) propensity <- rnorm(2 * max(Ks)) names(propensity) <- 1:(2 * max(Ks)) times <- lapply(Ks, function(k) { data <- propensity[1:(2 * k)] z <- rep(c(0,1), k) mdt <- system.time(mdist(z ~ data), gcFirst = TRUE) mot <- system.time(match_on(x = data, z = z), gcFirst = TRUE) # for both runs, return the sum of the user and system time return(c(mdt[1] + mdt[2], mot[1] + mot[2])) }) times <- do.call(rbind, times) colnames(times) <- c("mdist", "match_on") times <- data.frame(k = Ks, times)
This section considers how the two functions scale up as the problem size gets
bigger. We create problems sets of size N = 2K
with K
treatment and K
control units. Figure 12 shows the run time of the two functions as K
is
increased. The y
axis is logged showing that both methods grow roughly
quadratically, as we would expect from the nature of the algorithms. Of
course, the absolute run time of match_on is orders of magnitude
worse.
# plot(match_on ~ k, data = times, type = 'l', col = "red", log = "y") # lines(mdist ~ k, data = times, col = "blue", log = "y") plot(match_on ~ k, data = times, type = 'l', col = "red", log = "y") lines(mdist ~ k, data = times, col = "blue")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.