R/auto_stratify.R
In stratamatch: Stratification and Matching for Large Observational Data Sets

Documented in auto_stratify build_autostrata check_base_inputs_auto_stratify check_outcome check_prognostic_formula check_scores estimate_scores fit_prognostic_model is_binary make_autostrata_table

#----------------------------------------------------------
### CONTAINS:
# auto_stratify its helper functions, except split_pilot_set
#----------------------------------------------------------
#----------------------------------------------------------
### AUTO STRATIFY
#----------------------------------------------------------

#' Auto Stratify
#'
#' Automatically creates strata for matching based on a prognostic score formula
#' or a vector of prognostic scores already estimated by the user. Creates a
#' \code{auto_strata} object, which can be passed to \code{\link{strata_match}}
#' for stratified matching or unpacked by the user to be matched by some other
#' means.
#'
#' Stratifying by prognostic score quantiles can be more effective than manually
#' stratifying a data set because the prognostic score is continuous, thus the
#' strata produced tend to be of equal size with similar prognosis.
#'
#' Automatic stratification requires information on how the prognostic scores
#' should be derived.  This is primarily determined by the specifciation of the
#' \code{prognosis} argument.  Three main forms of input for \code{prognosis}
#' are allowed: \enumerate{ \item A vector of prognostic scores. This vector
#' should be the same length and order of the rows in the data set.  If this
#' method is used, the \code{outcome} argument must also be specified; this is
#' simply a string giving the name of the column which contains outcome
#' information. \item A formula for prognosis (e.g. \code{outcome ~ X1 + X2}).
#' If this method is used, \code{auto_stratify} will automatically split the
#' data set into a \code{pilot_set} and an \code{analysis_set}.  The pilot set
#' will be used to fit a logistic regression model for outcome in the absence of
#' treatment, and this model will be used to estimate prognostic scores on the
#' analysis set.  The analysis set will then be stratified based on the
#' estimated prognostic scores.  In this case the \code{outcome} argument need
#' not be specified since it can be inferred from the input formula. \item A
#' model for prognosis (e.g. a \code{glm} object).  If this method is used, the
#' \code{outcome} argument must also be specified}
#'
#' @section Troubleshooting:
#'
#'   This section suggests fixes for common errors that appear while fitting the
#'   prognostic score or using it to estimate prognostic scores on the analysis
#'   set.
#'
#'   \itemize{
#'
#'   \item \code{Encountered an error while fitting the prognostic model...
#'   numeric probabilities 0 or 1 produced}. This error means that the
#'   prognostic model can perfectly separate positive from negative outcomes.
#'   Estimating a treatment effect in this case is unwise since an individual's
#'   baseline characteristics perfectly determine their outcome, regardless of
#'   whether they recieve the treatment.  This error may also appear on rare
#'   occaisions when your pilot set is very small (number of observations
#'   approximately <= number of covariates in the prognostic model), so that
#'   perfect separation happens by chance.
#'
#'   \item \code{Encountered an error while estimating prognostic scores ...
#'   factor X has new levels ... } This may indicate that some value(s) of one
#'   or more categorical variables appear in the analysis set which were not
#'   seen in the pilot set. This means that when we try to obtain prognostic
#'   scores for our analysis set, we run into some new value that our prognostic
#'   model was not prepared to handle.  There are a few options we have to
#'   troubleshoot this problem: \itemize{
#'
#'   \item \strong{Rejection sampling.}  Run \code{auto_stratify} again with the
#'   same arguments until this error does not occur (i.e. until some
#'   observations with the missing value are randomly selected into the pilot
#'   set)
#'
#'   \item \strong{Eliminate this covariate from the prognostic formula.}
#'
#'   \item \strong{Remove observations with the rare covariate value from the
#'   entire data set.} Consider carefully how this exclusion might affect your
#'   results.
#'
#'   } }
#'
#'   Other errors or warnings can occur if the pilot set is too small and the
#'   prognostic formula is too complicated.  Always make sure that the number of
#'   observations in the pilot set is large enough that you can confidently fit
#'   a prognostic model with the number of covariates you want.
#'
#' @param data \code{data.frame} with observations as rows, features as columns
#' @param treat string giving the name of column designating treatment
#'   assignment
#' @param outcome string giving the name of column with outcome information.
#'   Required if prognostic_scores is specified.  Otherwise it will be inferred
#'   from prog_formula
#' @param prognosis information on how to build prognostic scores.  Three
#'   different input types are allowed: \enumerate{ \item vector of prognostic
#'   scores for all individuals in the data set. Should be in the same order as
#'   the rows of \code{data}. \item a \code{formula} for fitting a prognostic
#'   model \item an already-fit prognostic score model}
#' @param size numeric, desired size of strata (default = 2500)
#' @param pilot_fraction numeric between 0 and 1 giving the proportion of
#'   controls to be allotted for building the prognostic score (default = 0.1)
#' @param pilot_size alternative to pilot_fraction. Approximate number of
#'   observations to be used in pilot set. Note that the actual pilot set size
#'   returned may not be exactly \code{pilot_size} if \code{group_by_covariates}
#'   is specified because balancing by covariates may result in deviations from
#'   desired size. If \code{pilot_size} is specified, \code{pilot_fraction} is
#'   ignored.
#' @param pilot_sample a data.frame of held aside samples for building
#'   prognostic score model. If \code{pilot_sample} is specified,
#'   \code{pilot_size} and \code{pilot_fraction} are both ignored.
#' @param group_by_covariates character vector giving the names of covariates to
#'   be grouped by (optional). If specified, the pilot set will be sampled in a
#'   stratified manner, so that the composition of the pilot set reflects the
#'   composition of the whole data set in terms of these covariates.  The
#'   specified covariates must be categorical.
#' @return Returns an \code{auto_strata} object.  This contains: \itemize{
#'
#'   \item \code{outcome} - a string giving the name of the column where outcome
#'   information is stored
#'
#'   \item \code{treat} - a string giving the name of the column encoding
#'   treatment assignment
#'
#'   \item \code{analysis_set} - the data set with strata assignments
#'
#'   \item \code{call} - the call to \code{auto_stratify} used to generate this
#'   object
#'
#'   \item \code{issue_table} - a table of each stratum and potential issues of
#'   size and treat:control balance. In small or imbalanced strata, it may be
#'   difficult or infeasible to find high-quality matches, while very large
#'   strata may be computationally intensive to match.
#'
#'   \item \code{strata_table} - a table of each stratum and the prognostic
#'   score quantile bin to which it corresponds
#'
#'   \item \code{prognostic_scores} - a vector of prognostic scores.
#'
#'   \item \code{prognostic_model} - a model for prognosis fit on a pilot data
#'   set.  Will be \code{NULL} if a vector of prognostic scores was provided as
#'   the \code{prognosis} argument to \code{auto_stratify} rather than a model
#'   or formula.
#'
#'   \item \code{pilot_set} - the set of controls used to fit the prognostic
#'   model. These are excluded from subsequent analysis so that the prognostic
#'   score is not overfit to the data used to estimate the treatment effect.
#'   Will be \code{NULL} if a pre-fit model or a vector of prognostic scores was
#'   provided as the \code{prognosis} argument to \code{auto_stratify} rather
#'   than formula.
#'
#'   }
#' @seealso \code{\link{manual_stratify}}, \code{\link{new_auto_strata}}
#' @export
#' @examples
#' # make sample data set
#' set.seed(111)
#' dat <- make_sample_data(n = 75)
#'
#' # construct a pilot set, build a prognostic score for `outcome` based on X2
#' # and stratify the data set based on the scores into sets of about 25
#' # observations
#' a.strat_formula <- auto_stratify(dat, "treat", outcome ~ X2, size = 25)
#'
#' # stratify the data set based on a model for prognosis
#' pilot_data <- make_sample_data(n = 30)
#' prognostic_model <- glm(outcome ~ X2, pilot_data, family = "binomial")
#' a.strat_model <- auto_stratify(dat, "treat", prognostic_model,
#'   outcome = "outcome", size = 25
#' )
#'
#' # stratify the data set based on a vector of prognostic scores
#' prognostic_scores <- predict(prognostic_model,
#'   newdata = dat,
#'   type = "response"
#' )
#' a.strat_scores <- auto_stratify(dat, "treat", prognostic_scores,
#'   outcome = "outcome", size = 25
#' )
#'
#' # diagnostic plots
#' plot(a.strat_formula)
#' plot(a.strat_formula, type = "AC", propensity = treat ~ X1, stratum = 1)
#' plot(a.strat_formula, type = "hist", propensity = treat ~ X1, stratum = 1)
#' plot(a.strat_formula, type = "residual")
auto_stratify <- function(data, treat, prognosis,
                          outcome = NULL, size = 2500,
                          pilot_fraction = 0.1, pilot_size = NULL,
                          pilot_sample = NULL, group_by_covariates = NULL) {
  check_base_inputs_auto_stratify(data, treat, outcome)

  # if input data is grouped, all sorts of strange things happen
  data <- data %>% dplyr::ungroup()

  build <- build_autostrata(
    data, treat, prognosis, outcome, pilot_fraction,
    pilot_size, pilot_sample, group_by_covariates
  )

  analysis_set <- build$analysis_set
  prognostic_scores <- build$prognostic_scores
  pilot_set <- build$pilot_set
  prognostic_model <- build$prognostic_model
  outcome <- build$outcome

  # Create strata from prognostic score quantiles
  n_bins <- ceiling(dim(analysis_set)[1] / size)

  withCallingHandlers({
      qcut <- Hmisc::cut2(prognostic_scores, g = n_bins)
    },
    error = function(c) {
      stop("Error defining quantile bins: (Sometimes this happens when \"size\" << analysis set size.)")}
  )

  # make strata table
  strata_table <- make_autostrata_table(qcut)
  analysis_set$stratum <- as.integer(qcut)

  # package and return result
  result <- new_auto_strata(
    analysis_set = analysis_set,
    treat = treat,
    call = match.call(),
    issue_table = make_issue_table(analysis_set, treat),
    strata_table = strata_table,
    outcome = outcome,
    prognostic_scores = prognostic_scores,
    prognostic_model = prognostic_model,
    pilot_set = pilot_set
  )

  return(result)
}

#----------------------------------------------------------
### Helpers
#----------------------------------------------------------


#' Build Autostrata object
#'
#' Not meant to be called externally. Given the arguments to auto_stratify,
#' build the prognostic scores and return the analysis set, the prognostic
#' scores, the pilot set, the prognostic model, and the outcome string.  The
#' primary function of this code is to determine the type of \code{prognosis}
#' and handle it appropriately.
#'
#' @inheritParams auto_stratify
#' @seealso \code{\link{auto_stratify}}
#'
#' @return a list of: analysis set, prognostic scores, pilot set, prognostic
#'   model, and outcome string
#' @keywords internal
build_autostrata <- function(data, treat, prognosis, outcome, pilot_fraction,
                             pilot_size, pilot_sample, group_by_covariates) {

  # if prognosis is a vector of prognostic scores
  if (is.numeric(prognosis)) {
    check_scores(prognosis, data, outcome)
    prognostic_scores <- prognosis
    analysis_set <- data
    pilot_set <- NULL
    prognostic_model <- NULL
  }

  # if prognosis is a formula
  else if (inherits(prognosis, "formula")) {
    check_prognostic_formula(prognosis, data, outcome, treat)

    if (!is.null(pilot_sample)) {
      # check pilot sample
      if (!is.data.frame(pilot_sample)) {
        stop("pilot_sample must be a data.frame")
      }
      message("Using user-specified set for prognostic score modeling.")
      pilot_set <- pilot_sample
      analysis_set <- data
    } else {
      split <- split_pilot_set(
        data, treat, pilot_fraction, pilot_size,
        group_by_covariates
      )
      pilot_set <- split$pilot_set
      analysis_set <- split$analysis_set
    }
    outcome <- all.vars(prognosis)[1]

    prognostic_model <- fit_prognostic_model(pilot_set, prognosis, outcome)
    prognostic_scores <- estimate_scores(prognostic_model, analysis_set)
  }

  # if prognosis is a model, or it is unrecognized
  else {
    # try to predict.  If successful, prognosis was a model.
    # otherwise, throw an error: prognosis type not recognized
    withCallingHandlers(
      {
        prognostic_scores <- predict(prognosis,
          newdata = data,
          type = "response"
        )
      },
      error = function(c) {
        stop("prognosis type not recognized")
      }
    )
    if (is.null(outcome)) {
      stop("If specifying a prognostic score model, outcome must be specified")
    }
    analysis_set <- data
    pilot_set <- NULL
    prognostic_model <- prognosis
  }

  return(list(
    analysis_set = analysis_set, prognostic_scores = prognostic_scores,
    pilot_set = pilot_set, prognostic_model = prognostic_model,
    outcome = outcome
  ))
}


#' Fit Prognostic Model
#'
#' Given a pilot set and a prognostic formula, return the fitted formula.  If
#' the outcome is binary, fit a logistic regression.  Otherwise, fit a linear
#' model.
#'
#' @param dat data.frame on which model should be fit
#' @param prognostic_formula formula for prognostic model
#' @param outcome string giving name of column of data where outcomes are
#'   recorded
#'
#' @return a glm or lm object fit from \code{prognostic_formula} on \code{data}
#' @keywords internal
fit_prognostic_model <- function(dat, prognostic_formula, outcome) {

  # if outcome is binary or logical, run logistic regression
  if (is_binary(na.omit(dat[[outcome]]))) {
    message(paste(
      "Fitting prognostic model via logistic regression:",
      Reduce(paste, deparse(prognostic_formula))
    ))
    withCallingHandlers(
      {
        prognostic_model <- glm(prognostic_formula,
          data = dat,
          family = "binomial"
        )
      },
      error = function(e) {
        message("Error while fitting the prognostic model.")
        message("For troubleshooting help, run help(\"auto_stratify\")")
        stop(e)
      },
      warning = function(w) {
        message("Warning while fitting the prognostic model.")
        message("For troubleshooting help, run help(\"auto_stratify\")")
        warning(w)
      }
    )
  }

  # if outcome is numeric, run linear regression
  else if (is.numeric(na.omit(dat[[outcome]]))) {
    if (length(unique(na.omit(dat[[outcome]]))) == 2) {
      warning("outcome column has only two values. Is it binary?")
      message("Convert to 01 or logical format to run logistic regression instead")
    }
    message(paste(
      "Fitting prognostic model via linear regression:",
      Reduce(paste, deparse(prognostic_formula))
    ))
    withCallingHandlers(
      {
        prognostic_model <- lm(prognostic_formula, data = dat)
      },
      error = function(e) {
        message("Error while fitting the prognostic model.")
        message("For troubleshooting help, run help(\"auto_stratify\")")
        stop(e)
      },
      warning = function(w) {
        message("Warning while fitting the prognostic model.")
        message("For troubleshooting help, run help(\"auto_stratify\")")
        warning(w)
      }
    )
  }
  else {
    stop("Outcome was not a recognized type. Must be binary, logical, or numeric")
  }

  return(prognostic_model)
}

#' Estimate Prognostic Scores
#'
#' Tries to make prognostic scores.  If successfull, returns them, otherwise
#' throws an error message.  Common failure mode is that the prognostic score is
#' built on some categorical variable that takes on some values in the analysis
#' set that are never seen in the pilot set.  Outputs are on the response scale,
#' (rather than the linear predictor), so the score is the expected value of the
#' outcome under the control assignement based on the observed covariates.
#'
#' @param prognostic_model Model of prognosis
#' @param analysis_set data set on which prognostic scores should be estimated
#'
#' @return vector of prognostic scores
#' @keywords internal
estimate_scores <- function(prognostic_model, analysis_set) {
  withCallingHandlers(
    {
      scores <- predict(prognostic_model, analysis_set, type = "response")
    },
    error = function(e) {
      message("Error while estimating prognostic scores from the prognostic model.")
      message("For troubleshooting help, run help(\"auto_stratify\")")
      stop(e)
    },
    warning = function(w) {
      message("Warning while estimating prognostic scores from the prognostic model.")
      message("For troubleshooting help, run help(\"auto_stratify\")")
      warning(w)
    }
  )

  return(scores)
}

#' Make strata table
#'
#' @param qcut the prognostic score quantile cuts
#'
#' @return data.frame of strata definitions
#' @keywords internal
make_autostrata_table <- function(qcut) {
  data.frame(qcut) %>%
    dplyr::mutate(stratum = as.integer(qcut), quantile_bin = as.character(qcut)) %>%
    dplyr::group_by(.data$quantile_bin) %>%
    dplyr::summarise(
      size = dplyr::n(), stratum = dplyr::first(.data$stratum),
      .groups = "drop_last"
    ) %>%
    dplyr::arrange(.data$stratum) %>%
    dplyr::select(.data$stratum, .data$quantile_bin, .data$size)
}

#----------------------------------------------------------
### Input Checkers
#----------------------------------------------------------

#' Check inputs from auto_stratify
#'
#' Not meant to be called externally.  Throws errors if basic auto_stratify
#' inputs are incorrect.
#'
#' @inheritParams auto_stratify
#'
#' @return nothing; produces errors and warnings if anything is wrong
#' @keywords internal
check_base_inputs_auto_stratify <- function(data, treat, outcome) {
  # general checks
  if (!is.data.frame(data)) stop("data must be a data.frame")
  if (!(is.character(treat) & length(treat) == 1)) {
    stop("treat must be a single string")
  }
  if (!is.element(treat, colnames(data))) {
    stop("treat must be the name of a column in data")
  }
  if (!is_binary(data[[treat]])) {
    stop("treatment column must be binary or logical")
  }
  if (!is.null(outcome)) {
    check_outcome(outcome, data, treat)
  }
}

#' Check Scores
#'
#' Checks that prognostic scores are the same length as data
#'
#' @inheritParams auto_stratify
#' @param prognostic_scores, a numeric vector
#'
#' @return nothing
#' @keywords internal
check_scores <- function(prognostic_scores, data, outcome) {
  if (length(prognostic_scores) != dim(data)[1]) {
    stop("prognostic scores must be the same length as the data")
  }
  if (is.null(outcome)) {
    stop("If specifying prognostic scores, outcome must be specified")
  }
}

#' Check Outcome
#'
#' Checks that outcome is a string which is a column in the data
#'
#' @inheritParams auto_stratify
#'
#' @return nothing
#' @keywords internal
check_outcome <- function(outcome, data, treat) {
  if (!(is.character(outcome) & length(treat) == 1)) {
    stop("outcome must be a single string")
  }
  if (!is.element(outcome, colnames(data))) {
    stop("outcome must be a column in data")
  }
}

#' Check Prognostic Formula
#'
#' @inheritParams auto_stratify
#' @param prog_formula a formula for prognostic score
#'
#' @return nothing
#' @keywords internal
check_prognostic_formula <- function(prog_formula, data, outcome, treat) {
  if (treat == all.vars(prog_formula)[1]) {
    stop("prognostic formula must model outcome, not treatment")
  }

  if (is.element(treat, all.vars(prog_formula))) {
    RHS <- as.character(prog_formula)[3]
    allowed <- paste("-", treat)

    # check that every time treat appears on RHS, it appears as "- treat"
    if (lengths(regmatches(RHS, gregexpr(allowed, RHS))) != lengths(regmatches(RHS, gregexpr(treat, RHS)))) {
      stop("prognostic formula must model the outcome in the absence of treatment; the treatment assignment may not be a predictor for the prognostic score model")
    }
  }
  if (!is.null(outcome)) {
    if (outcome != all.vars(prog_formula)[1]) {
      stop("prognostic formula must model outcome")
    }
  }
}


#' Check if a vector is binary
#'
#' return TRUE if the input is logical or if it contains only 0's and 1's
#'
#' @param col a column from a data frame
#'
#' @return logical
#' @keywords internal
is_binary <- function(col) {
  if (is.logical(na.omit(col))) {
    return(TRUE)
  }
  if (all(is.element(na.omit(col), 0:1))) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}
Any scripts or data that you put into this service are public.
stratamatch documentation built on March 31, 2022, 9:07 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
stratamatch
Stratification and Matching for Large Observational Data Sets

R/auto_stratify.R
In stratamatch: Stratification and Matching for Large Observational Data Sets

Defines functions is_binary check_prognostic_formula check_outcome check_scores check_base_inputs_auto_stratify make_autostrata_table estimate_scores fit_prognostic_model build_autostrata auto_stratify

Documented in auto_stratify build_autostrata check_base_inputs_auto_stratify check_outcome check_prognostic_formula check_scores estimate_scores fit_prognostic_model is_binary make_autostrata_table

Try the stratamatch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

stratamatch Stratification and Matching for Large Observational Data Sets

R/auto_stratify.R In stratamatch: Stratification and Matching for Large Observational Data Sets

Defines functions is_binary check_prognostic_formula check_outcome check_scores check_base_inputs_auto_stratify make_autostrata_table estimate_scores fit_prognostic_model build_autostrata auto_stratify

Documented in auto_stratify build_autostrata check_base_inputs_auto_stratify check_outcome check_prognostic_formula check_scores estimate_scores fit_prognostic_model is_binary make_autostrata_table

Try the stratamatch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

stratamatch
Stratification and Matching for Large Observational Data Sets

R/auto_stratify.R
In stratamatch: Stratification and Matching for Large Observational Data Sets