knitr::opts_chunk$set(warning = TRUE, message = TRUE, fig.align = "center", fig.height = 5, fig.width = 6)
Summary: In a block-randomized controlled trial, the individuals in a study are divided based on important characteristics (for example age group, sex, or smoking status). By stratifying the data in this way prior to randomization, the researchers can reduce the heterogeneity of their treatment and control groups, thereby reducing the variance of their estimate of the effect of their treatment. The stratamatch
package is designed to extend this approach to the observational setting by providing functions to allow users to easily divide an observational data set into strata of observations with similar baseline characteristics. The treated and control individuals within each stratum of the data can then by matched (for example, based on propensity score), in order to recapitulate the block-randomized trial from observational data. In order to select an effective stratification scheme for the data, stratamatch
applies a pilot design approach to estimate a quantity called the prognostic score, and divides the data into blocks of individuals with similar scores. The potential benefits of such an approach are twofold. First, stratifying the data effectively allows for more computationally efficient matching of large data sets. Second, early studies of the prognostic score show that using the prognostic score to inform the matching process may help reduce the variance in the estimated treatment effect and increase the robustness of an observational study to bias from unmeasured confounding factors.
This document contains the following sections
After briefly introducing the goals of stratamatch and the approaches it implements on a statistical level, this document will walk through an example of how stratamatch might be used on a toy data set. We will show how to stratify a data set, how to diagnose the quality of the stratification, and how to match a data set within strata.
In a fully-randomized controlled experiment, treatment assignments are made independently of each individual's baseline covariates, allowing for unbiased estimation of the treatment effect. In observational studies, researchers may attempt to account for differences in their subject's baseline covariates by matching treated and control subjects with similar characteristics. In particular, the popular approach of propenisty score matching pairs individuals who had similar estimated probabilities of recieving the treatment based on their baseline characteristics. By balancing the treated and control groups in this way, the propensity score matching approach attempts to coerce the data set into a form that resembles a randomized controlled trial. Importantly, however, propensity score matching can only address bias to due measured baseline covariates. Imbalances in unmeasured baseline covariates may still bias the effect estimate after propensity score matching. For this reason, researchers often carry out a sensitivity analysis to address concerns about unmeasured confounding.
However, the fully-randomized experiment is not the only experimental study design. In a block-randomized controlled experiment, subjects are stratified based on important covariates (e.g. sex, age group, smoking status) before randomization into treatment groups. This helps reduce the heterogeneity between treatment and control groups. In the experimental context, reducing the heterogeneity between compared groups helps to reduce the variance of the effect estimator. In observational settings however, this reduction in heterogeneity has the added benefit of reducing the sensitivity of the study results to unobserved confounding (see @rosenbaum2005heterogeneity, for a great discussion on this point). Moreover, if the sample size of an observational study is very large, stratifying the sample and matching separately within the strata in this way may be much faster than matching the entire sample at once. Stratamatch helps to facilitate this process by supplying tools stratify an observational dataset and match within each stratum, thereby emulating an observational version of the block-randomized controlled trial.
Once we have decided to stratify our data set, the question becomes: how should strata be determined? One option is to select the covariates on which to stratify by hand based on expert knowledge. Another option is to use knowledge from previous experiments to select the best substrata. In the experimental setting, this may be done with a pilot study. Using this approach, researchers set aside some of their resources before running the main experiment for the purpose of running a smaller, 'pilot' experiment. By examining the outcomes of the pilot study, they can gather information which they can use to inform the design of the main experiment. Importantly, after the pilot study is run, the individuals in the pilot experiment are not reused in the main experiment, so the observations of the outcomes from this earlier analysis are not allowed to bias the results of the main study.
@aikens2020pilot extend the idea of the pilot study to the observational setting. Using an observational pilot design, the researchers may set aside a 'pilot set' of their data. Outcome information on this pilot set can be observed, and the information gained can be used to inform the study design. Subsequently, in order to preserve the separation of the study design from the study analysis, the observations from the pilot set are omitted from the main analysis.
Stratamatch uses a pilot design to estimate a quantity called the prognostic score (@hansen2008prognostic), defined here as an individual's expected outcome under the control assignment, based on their baseline covariates. Balancing observational data sets based on the prognostic score can the reduce heterogeneity between matched individuals, decreasing variance and diminishing the sensitivity of the study results to unobserved confounding (See @aikens2020pilot, @antonelli2018doubly, and @leacy2014joint). Moreover, since the prognostic score is often continuous, strata can be easily determined using prognostic score quantiles to select evenly sized bins for the data. This circumvents common problems with stratification based on expert knowledge, since that process often generates strata which are too large, too small, or too poorly balanced between treatment and control observations.
The stratamatch function, auto_stratify
, carries out the prognostic score stratification pilot design described above. Although there are many additional options available when running this function, the most basic procedure does the following:
Partition the data set into a pilot data set and an analysis data set
Fit a model for the prognostic score from the observations in the pilot set
Estimate prognostic scores for the analysis set using the prognostic model
Stratify the analysis set based on prognostic score quantiles.
Once the data set has been stratified, stratamatch
suggests several diagnostic plots and tables that can be used to assess the size and balance of the strata produced. If the strata are satisfactory, the treatment and control individuals can then be matched. At this point, the reader may choose to match the data set within each stratum using the strata_match
function, or they may select and implement their own matching scheme.
We demonstrate the functionality of stratamatch with a toy example. The stratamatch
package contains its own function, make_sample_data
, for just this purpose.
library(stratamatch) set.seed(125) # make sample data set of 5000 observations dat <- make_sample_data(n = 5000) # print the first few rows of the sample_data head(dat)
Let's assume that the data in dat
are an observational data set, and we would like to estimate the effect of some binary treatment on a binary outcome. Suppose the column treat
in this data gives the treatment assigment (where a 0
means untreated and a 1
means treated), and the column outcome
gives information on who had the outcome of interest and who didn't (similarly, let's call 0
the negative outcome and 1
the positive outcome). However, since this is an observational data set, we didn't get to choose who got the treatment and who didn't. Instead, individuals self-selected into treatment or control groups, perhaps in some way which is influenced by their background characteristics. Additionally, an individual's probability of having the positive outcome may be influenced by the treatment or by other baseline factors, but we don't know which. In addition to treatment assignments and outcomes, we have measured five baseline covariates for each individual: X1
, X2
, B1
, B2
, and C1
. X1
and X2
are continuous, B1
and B2
are binary, and C1
is a categorical variable which takes on possible values "a"
, "b"
, and "c"
.
We should also note that dat
is a fairly large data set (5000 observations), but only about 1/5 of the individuals in the data set recieved the treatment.
Suppose we wanted to stratify the data manually based on covariates that our experts have selected. Importantly, we can only stratify the data based on categorical or binary variables. This means that we cannot use X1
or X2
directly (this would cause manual_stratify
to throw an error). Below, I stratify the data set based on C1
, B1
, and B2
.
# manually stratify dat based on categorial and binary variables m.strat <- manual_stratify(data = dat, treat ~ B1 + B2 + C1) # try printing the result m.strat summary(m.strat)
Above, we see that manual_stratify
returns a manual strata
object. The printed summary above shows that manual_stratify
has divided our dataset according to our instructions and produced 12 strata.
We can extract important information from m.strat
with $
.
The most important piece of m.strat
is the analysis set. This is the stratified data which we can later match and use for effect estimation.
# show the first few rows of the stratified data set head(m.strat$analysis_set)
Since we didn't use a pilot design for this example, our analysis set just has all of the same data in dat
that we put in to manual_stratify
. The only difference is that now there is an additional column, stratum
, which says which stratum each individual has been sorted into.
In contrast to manual_stratify
, the auto_stratify
function automatically creates the stratified data set based on estimated prognostic scores. In this process, we specify what model we want to use for the prognostic score (the prognosis
argument) and what percent of the control reserve we want to use to do the fitting (the pilot_fraction
) argument. We also need to tell auto_stratify
which column of our data set designates the treatment assignment (treat
). The final argument, size,
specifies about how large we would like our strata to be. I pick size = 400
here so that we get roughly the same number of strata as when we manually stratified our data set.
In practice, pilot_fraction
and size
have defaults, so we don't need to specify them every time. Instead of pilot_fraction
, we could specify a pilot_size
to get a pilot set of approximately that number of observations. Run help(auto_stratify)
for more information.
a.strat <- auto_stratify(dat, treat = "treat", prognosis = outcome ~ X1 + X2, pilot_fraction = 0.1, size = 400) # print and summarize the result from running auto_stratify a.strat summary(a.strat)
Our call to auto_stratify
has created auto_strata
object. An auto_strata
object is much like a manual_strata
object, except that it contains somewhat more information, since the stratification process was done differently. Importantly, auto_strata
has partitioned 10% of the individuals in the control group to be in the pilot set. These individuals were used to build the prognostic score, and were extracted from the analysis set. In order to prevent overfitting, the observations in the pilot set should not be included in later analyses, and should not be reused when making the final effect estimate.
We can access the pilot set and the analysis set using $
with a.strat
. The command a.strat$analysis_set
gives the analysis set and a.strat$pilot_set
gives the pilot set. We can also view the prognostic score estimates for the individuals in the analysis set with a.strat$prognostic_scores
.
Suppose you're interested in partitioning your data into a pilot set and an analysis set, but you'd like to do other things with it after that - for example perhaps you'd like to fit a more complicated model for prognostic score than a logistic regression. The split_pilot_set
will split your pilot and analysis set for you based on specified preferences. For more information, run help(split_pilot_set)
.
Above, we showed the default method used by auto_stratify
. By providing other arguments, it is possible to construct the strata using various other methods. For example, one could fit the prognostic scores for the analysis set separately and provide them as an argument to auto_stratify
. Alternatively, the researcher could specify their own pilot set that they would like to use to build the prognostic model. Run help(auto_stratify)
to see all of the possible inputs to this function.
Both manual_stratify
and auto_stratify
objects have a strata table, which shows the rules defining each stratum. Just like when we extracted the analysis set, we can extract the strata_table
with $
. Below, I show the strata table for m.strat
, our manually stratified data. Each row describes the definitions used to make one stratum. For example, below we see that stratum 1 contains all individuals who have B1 = 0
and B2 = 0
, and C1 = "a"
:
# get strata table for manually stratified data set m.strat$strata_table
The strata table for an automatically stratified data set is somewhat different:
# show strata table for the automatically stratified data a.strat$strata_table
The strata table for a.strat
shows what prognostic score quantiles were used to divide the strata. For example, individuals in stratum 1 have the lowest prognostic scores. This means that, if the individuals in stratum 1 were in the control group, we would predict them to have a very low probability of having the positive outcome, based solely on their values of X1
and X2
. The a vector of the cutoff values used to divide the strata can be extracted from a.strat
with extract_cut_points
.
Another important diagnostic for both manually and automatically stratified data is the issue_table
.
# issue table for manually stratified data m.strat$issue_table
This table shows each of the strata that we obtained through manual stratification and some of the potential issues we may face when trying to match within these strata. In particular, this table highlights some strata which are too small and/or have many more treated individuals than controls. The Total
column shows the total number of observations in each stratum. Some of the strata here are quite a bit larger than others. In practice, data sets of about 4000 start to take a long time to match, so that's about the maximum size we would like for our strata. In data sets of fewer than 100 individuals, it can be hard to find good matches between our treated and control individuals.
One benefit of automatic stratification is that it tends to create strata of roughly equal size, so that no strata are extremely small or extremely large. We can see this by comparing the Total
column in the issue tables of the manually and automatically stratified data sets:
# issue table for automatically stratified data a.strat$issue_table
Plots are an important way that we can check how our statification went. There are two types of plots that we can make automatically with either a manual_strata
object or an auto_strata
object. The first is a size-ratio plot. This plots each stratum in the analysis set based on its size and the percentage of control observations. If a stratum is too small or is very imbalanced in its treat:control ratio, finding quality matches may be difficult (shaded yellow areas). If a stratum is too large, matching may be very computationally taxing (shaded red areas). We can see here that a few strata from our manual stratification are quite small, and some have nearly entirely control individuals.
# size-ratio plot for manually stratified data plot(m.strat) # plot(m.strat, type = "SR") will give the same output # plot(m.strat, label = TRUE) will allow the user to click points to label them
We can compare this with a size-ratio plot from automatic stratification:
# size-ratio plot for automatically stratified data plot(a.strat)
This plot clearly shows that the strata generated by auto_stratify are all about the same size. However, some of the strata still have a treat:control ratio much less than 1:1. Some of this is unavoidable because our data set naturally had a treat:control ratio of about 1:5. However, we should try to avoid having very extreme treat:control ratios in any of our strata.
The second plot we can make to diagnose issues both manually and automatically stratified data is an overlap histogram. Below, I've made an overlap histogram for all strata, and an overlap histogram for stratum 3. This shows the propensity score distributions of the treated and control populations. In order to make this histogram, we must tell the plot
function what the propensity scores are for our data set by specifying the propensity
argument. We have three options: propensity
can be a propensity score formula (which will then be fit on the analysis set), a glm
modeling propensity score, or a vector of propensity scores. Here, I just pass in a formula.
# propensity score histograms for all strata from manually stratified data plot(m.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2) # propensity score histograms for all strata from manually stratified data plot(m.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2, stratum = 3)
A similar command works for automatically stratified data. :
# propensity score histograms for stratum 3 from automatically stratified data plot(a.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2, stratum = 3)
In addition to the two plot types above, there are some plots that can be used to visualize auto_strata
objects only: "residual"
and "AC"
plots.
Setting type = "AC"
in the plot command produces a Assignment-Control plot. This shows each individual in terms of their estimated propensity score and their estimated prognostic score (for more details, see @aikens2020pilot). This allows us to check how well the treated and control samples overlap not only in their probability for treatment but in their expected outcome under the control assignment. Just as with the overlap histograms, we need to provide information on the propensity score. Below, I use the same propensity formula as in the histogram above. The first plot shows all of the strata, with lines indicating the barriers between strata, while the second plot looks specifically at stratum 3 by specifying the stratum
argument. Apparent striations or groupings in assignment control plots like the ones below are common; they appear when some binary or categorical covariate has a strong association with either the outcome or the treatment assignment.
# make a Assignment-Control plot plot(a.strat, type = "AC", propensity = treat ~ X1 + X2 + B1 + B2) # make a Assignment-Control plot plot(a.strat, type = "AC", propensity = treat ~ X1 + X2 + B1 + B2, stratum = 3)
Finally, if we fit a prognostic model, then we can run diagnostics on that prognostic model with type = "residual"
. This is essentially just a wrapper for the glm
plotting method. It produces all the familiar diagnostic plots that we can obtain from calling plot
on a fitted model from glm
.
# diagnostic plots for prognostic model plot(a.strat, type = "residual") # plot(a.strat$prognostic_model) will do the same thing - see below
The plot
command with type = "residual"
gives us a one-line command that shows us standard diagnostic plots for the prognostic score model that we fit on the pilot set. If we want to run extra diagnostics on our prognostic model, we can do that too. Our a.strat
stores a copy of the prognostic score model, which we can extract in the usual way with $
. By extracting the prognostic model (prognostic_model
), we can run any of the usual diagnostics for our fitted model.
# extract prognostic model from a.strat progmod <- a.strat$prognostic_model # as an example, summarize model coefficients summary(progmod)
Based on the coefficients from the prognostic model, we can tell that our prognostic score estimates are primarily determined by each individual's X2
baseline value.
Now that we have stratified the data, we can match the individuals within each stratum. Note that all of the examples that follow additionally require the R package optmatch
(@hansen2006optmatch) to be installed.
The most flexible strategy - although occasionally more complex - is for the researcher to do this step by hand, using the strata designations in the analysis set returned by either one of the stratification methods. If the researcher plans to do something very specific or nuanced in the matching process, this may be the best approach. For example, users proficient with the matching package optmatch
(@hansen2006optmatch) will note that adding + strata(stratum)
to the matching formula supplied to optmatch will request that optmatch perform the matching within each stratum in the analysis set (see the optmatch documentation for further details). Another approach is to divide the analysis_set
into separate data frames and match on those individually, perhaps distributing over several computing clusters or using a different matching scheme in each stratum (for example different $k$ in $1:k$ matching)
If the goal is something more simple, e.g. a 1-to-1 or 1-to-$k$ optimal propensity score matching within strata, the strata_match
function can do that automatically. Below, I match two control individuals to each treatment individual within strata in my analysis set provided by a.strat. I use the propensity score formula treat ~ X1 + X2 + B1 + B2
.
if (!requireNamespace("optmatch", quietly = TRUE)) knitr::opts_chunk$set(eval = FALSE)
# match the automatically stratified data mymatch <- strata_match(a.strat, treat ~ X1 + X2 + B1 + B2, k = 1) # summarize matching results summary(mymatch)
This function in turn calls optmatch
to do the matching, stratified by the strata assignments. It returns an optmatch matching. In essence, each individual is given an identification code for its match. <NA>
indicates that the individual was not matched, and an identifier such as 3.15
indicates that this individual was in stratum 3 and it was placed in match 15.
# add match information as a column in the data set matched_data <- a.strat$analysis_set matched_data$match <- as.character(mymatch) head(matched_data)
Effect estimation can then be done using the matched analysis set via researcher's method of choice.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.