View source: R/auto_stratify.R
auto_stratify | R Documentation |
Automatically creates strata for matching based on a prognostic score formula
or a vector of prognostic scores already estimated by the user. Creates a
auto_strata
object, which can be passed to strata_match
for stratified matching or unpacked by the user to be matched by some other
means.
auto_stratify( data, treat, prognosis, outcome = NULL, size = 2500, pilot_fraction = 0.1, pilot_size = NULL, pilot_sample = NULL, group_by_covariates = NULL )
data |
|
treat |
string giving the name of column designating treatment assignment |
prognosis |
information on how to build prognostic scores. Three different input types are allowed:
|
outcome |
string giving the name of column with outcome information. Required if prognostic_scores is specified. Otherwise it will be inferred from prog_formula |
size |
numeric, desired size of strata (default = 2500) |
pilot_fraction |
numeric between 0 and 1 giving the proportion of controls to be allotted for building the prognostic score (default = 0.1) |
pilot_size |
alternative to pilot_fraction. Approximate number of
observations to be used in pilot set. Note that the actual pilot set size
returned may not be exactly |
pilot_sample |
a data.frame of held aside samples for building
prognostic score model. If |
group_by_covariates |
character vector giving the names of covariates to be grouped by (optional). If specified, the pilot set will be sampled in a stratified manner, so that the composition of the pilot set reflects the composition of the whole data set in terms of these covariates. The specified covariates must be categorical. |
Stratifying by prognostic score quantiles can be more effective than manually stratifying a data set because the prognostic score is continuous, thus the strata produced tend to be of equal size with similar prognosis.
Automatic stratification requires information on how the prognostic scores
should be derived. This is primarily determined by the specifciation of the
prognosis
argument. Three main forms of input for prognosis
are allowed:
A vector of prognostic scores. This vector
should be the same length and order of the rows in the data set. If this
method is used, the outcome
argument must also be specified; this is
simply a string giving the name of the column which contains outcome
information.
A formula for prognosis (e.g. outcome ~ X1 + X2
).
If this method is used, auto_stratify
will automatically split the
data set into a pilot_set
and an analysis_set
. The pilot set
will be used to fit a logistic regression model for outcome in the absence of
treatment, and this model will be used to estimate prognostic scores on the
analysis set. The analysis set will then be stratified based on the
estimated prognostic scores. In this case the outcome
argument need
not be specified since it can be inferred from the input formula.
A
model for prognosis (e.g. a glm
object). If this method is used, the
outcome
argument must also be specified
Returns an auto_strata
object. This contains:
outcome
- a string giving the name of the column where outcome
information is stored
treat
- a string giving the name of the column encoding
treatment assignment
analysis_set
- the data set with strata assignments
call
- the call to auto_stratify
used to generate this
object
issue_table
- a table of each stratum and potential issues of
size and treat:control balance. In small or imbalanced strata, it may be
difficult or infeasible to find high-quality matches, while very large
strata may be computationally intensive to match.
strata_table
- a table of each stratum and the prognostic
score quantile bin to which it corresponds
prognostic_scores
- a vector of prognostic scores.
prognostic_model
- a model for prognosis fit on a pilot data
set. Will be NULL
if a vector of prognostic scores was provided as
the prognosis
argument to auto_stratify
rather than a model
or formula.
pilot_set
- the set of controls used to fit the prognostic
model. These are excluded from subsequent analysis so that the prognostic
score is not overfit to the data used to estimate the treatment effect.
Will be NULL
if a pre-fit model or a vector of prognostic scores was
provided as the prognosis
argument to auto_stratify
rather
than formula.
This section suggests fixes for common errors that appear while fitting the prognostic score or using it to estimate prognostic scores on the analysis set.
Encountered an error while fitting the prognostic model...
numeric probabilities 0 or 1 produced
. This error means that the
prognostic model can perfectly separate positive from negative outcomes.
Estimating a treatment effect in this case is unwise since an individual's
baseline characteristics perfectly determine their outcome, regardless of
whether they recieve the treatment. This error may also appear on rare
occaisions when your pilot set is very small (number of observations
approximately <= number of covariates in the prognostic model), so that
perfect separation happens by chance.
Encountered an error while estimating prognostic scores ...
factor X has new levels ...
This may indicate that some value(s) of one
or more categorical variables appear in the analysis set which were not
seen in the pilot set. This means that when we try to obtain prognostic
scores for our analysis set, we run into some new value that our prognostic
model was not prepared to handle. There are a few options we have to
troubleshoot this problem:
Rejection sampling. Run auto_stratify
again with the
same arguments until this error does not occur (i.e. until some
observations with the missing value are randomly selected into the pilot
set)
Eliminate this covariate from the prognostic formula.
Remove observations with the rare covariate value from the entire data set. Consider carefully how this exclusion might affect your results.
Other errors or warnings can occur if the pilot set is too small and the prognostic formula is too complicated. Always make sure that the number of observations in the pilot set is large enough that you can confidently fit a prognostic model with the number of covariates you want.
manual_stratify
, new_auto_strata
# make sample data set set.seed(111) dat <- make_sample_data(n = 75) # construct a pilot set, build a prognostic score for `outcome` based on X2 # and stratify the data set based on the scores into sets of about 25 # observations a.strat_formula <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # stratify the data set based on a model for prognosis pilot_data <- make_sample_data(n = 30) prognostic_model <- glm(outcome ~ X2, pilot_data, family = "binomial") a.strat_model <- auto_stratify(dat, "treat", prognostic_model, outcome = "outcome", size = 25 ) # stratify the data set based on a vector of prognostic scores prognostic_scores <- predict(prognostic_model, newdata = dat, type = "response" ) a.strat_scores <- auto_stratify(dat, "treat", prognostic_scores, outcome = "outcome", size = 25 ) # diagnostic plots plot(a.strat_formula) plot(a.strat_formula, type = "AC", propensity = treat ~ X1, stratum = 1) plot(a.strat_formula, type = "hist", propensity = treat ~ X1, stratum = 1) plot(a.strat_formula, type = "residual")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.