auto_stratify: Auto Stratify

View source: R/auto_stratify.R

auto_stratifyR Documentation

Auto Stratify


Automatically creates strata for matching based on a prognostic score formula or a vector of prognostic scores already estimated by the user. Creates a auto_strata object, which can be passed to strata_match for stratified matching or unpacked by the user to be matched by some other means.


  outcome = NULL,
  size = 2500,
  pilot_fraction = 0.1,
  pilot_size = NULL,
  pilot_sample = NULL,
  group_by_covariates = NULL



data.frame with observations as rows, features as columns


string giving the name of column designating treatment assignment


information on how to build prognostic scores. Three different input types are allowed:

  1. vector of prognostic scores for all individuals in the data set. Should be in the same order as the rows of data.

  2. a formula for fitting a prognostic model

  3. an already-fit prognostic score model


string giving the name of column with outcome information. Required if prognostic_scores is specified. Otherwise it will be inferred from prog_formula


numeric, desired size of strata (default = 2500)


numeric between 0 and 1 giving the proportion of controls to be allotted for building the prognostic score (default = 0.1)


alternative to pilot_fraction. Approximate number of observations to be used in pilot set. Note that the actual pilot set size returned may not be exactly pilot_size if group_by_covariates is specified because balancing by covariates may result in deviations from desired size. If pilot_size is specified, pilot_fraction is ignored.


a data.frame of held aside samples for building prognostic score model. If pilot_sample is specified, pilot_size and pilot_fraction are both ignored.


character vector giving the names of covariates to be grouped by (optional). If specified, the pilot set will be sampled in a stratified manner, so that the composition of the pilot set reflects the composition of the whole data set in terms of these covariates. The specified covariates must be categorical.


Stratifying by prognostic score quantiles can be more effective than manually stratifying a data set because the prognostic score is continuous, thus the strata produced tend to be of equal size with similar prognosis.

Automatic stratification requires information on how the prognostic scores should be derived. This is primarily determined by the specifciation of the prognosis argument. Three main forms of input for prognosis are allowed:

  1. A vector of prognostic scores. This vector should be the same length and order of the rows in the data set. If this method is used, the outcome argument must also be specified; this is simply a string giving the name of the column which contains outcome information.

  2. A formula for prognosis (e.g. outcome ~ X1 + X2). If this method is used, auto_stratify will automatically split the data set into a pilot_set and an analysis_set. The pilot set will be used to fit a logistic regression model for outcome in the absence of treatment, and this model will be used to estimate prognostic scores on the analysis set. The analysis set will then be stratified based on the estimated prognostic scores. In this case the outcome argument need not be specified since it can be inferred from the input formula.

  3. A model for prognosis (e.g. a glm object). If this method is used, the outcome argument must also be specified


Returns an auto_strata object. This contains:

  • outcome - a string giving the name of the column where outcome information is stored

  • treat - a string giving the name of the column encoding treatment assignment

  • analysis_set - the data set with strata assignments

  • call - the call to auto_stratify used to generate this object

  • issue_table - a table of each stratum and potential issues of size and treat:control balance. In small or imbalanced strata, it may be difficult or infeasible to find high-quality matches, while very large strata may be computationally intensive to match.

  • strata_table - a table of each stratum and the prognostic score quantile bin to which it corresponds

  • prognostic_scores - a vector of prognostic scores.

  • prognostic_model - a model for prognosis fit on a pilot data set. Will be NULL if a vector of prognostic scores was provided as the prognosis argument to auto_stratify rather than a model or formula.

  • pilot_set - the set of controls used to fit the prognostic model. These are excluded from subsequent analysis so that the prognostic score is not overfit to the data used to estimate the treatment effect. Will be NULL if a pre-fit model or a vector of prognostic scores was provided as the prognosis argument to auto_stratify rather than formula.


This section suggests fixes for common errors that appear while fitting the prognostic score or using it to estimate prognostic scores on the analysis set.

  • Encountered an error while fitting the prognostic model... numeric probabilities 0 or 1 produced. This error means that the prognostic model can perfectly separate positive from negative outcomes. Estimating a treatment effect in this case is unwise since an individual's baseline characteristics perfectly determine their outcome, regardless of whether they recieve the treatment. This error may also appear on rare occaisions when your pilot set is very small (number of observations approximately <= number of covariates in the prognostic model), so that perfect separation happens by chance.

  • Encountered an error while estimating prognostic scores ... factor X has new levels ... This may indicate that some value(s) of one or more categorical variables appear in the analysis set which were not seen in the pilot set. This means that when we try to obtain prognostic scores for our analysis set, we run into some new value that our prognostic model was not prepared to handle. There are a few options we have to troubleshoot this problem:

    • Rejection sampling. Run auto_stratify again with the same arguments until this error does not occur (i.e. until some observations with the missing value are randomly selected into the pilot set)

    • Eliminate this covariate from the prognostic formula.

    • Remove observations with the rare covariate value from the entire data set. Consider carefully how this exclusion might affect your results.

Other errors or warnings can occur if the pilot set is too small and the prognostic formula is too complicated. Always make sure that the number of observations in the pilot set is large enough that you can confidently fit a prognostic model with the number of covariates you want.

See Also

manual_stratify, new_auto_strata


# make sample data set
dat <- make_sample_data(n = 75)

# construct a pilot set, build a prognostic score for `outcome` based on X2
# and stratify the data set based on the scores into sets of about 25
# observations
a.strat_formula <- auto_stratify(dat, "treat", outcome ~ X2, size = 25)

# stratify the data set based on a model for prognosis
pilot_data <- make_sample_data(n = 30)
prognostic_model <- glm(outcome ~ X2, pilot_data, family = "binomial")
a.strat_model <- auto_stratify(dat, "treat", prognostic_model,
  outcome = "outcome", size = 25

# stratify the data set based on a vector of prognostic scores
prognostic_scores <- predict(prognostic_model,
  newdata = dat,
  type = "response"
a.strat_scores <- auto_stratify(dat, "treat", prognostic_scores,
  outcome = "outcome", size = 25

# diagnostic plots
plot(a.strat_formula, type = "AC", propensity = treat ~ X1, stratum = 1)
plot(a.strat_formula, type = "hist", propensity = treat ~ X1, stratum = 1)
plot(a.strat_formula, type = "residual")

stratamatch documentation built on March 31, 2022, 9:07 a.m.