R/sim.survdata.R

Defines functions sim.survdata

Documented in sim.survdata

#' Simulating duration data for the Cox proportional hazards model
#'
#' \code{sim.survdata()} randomly generates data frames containing a user-specified number
#' of observations, time points, and covariates.  It generates durations, a variable indicating
#' whether each observation is right-censored, and "true" marginal effects.
#' It can accept user-specified coefficients, covariates, and baseline hazard functions, and it can
#' output data with time-varying covariates or using time-varying coefficients.
#'
#' @param N Number of observations in each generated data frame. Ignored if \code{X} is not \code{NULL}
#' @param T The latest time point during which an observation may fail. Failures can occur
#' as early as 1 and as late as T
#' @param type If "none" (the default) data are generated with no time-varying covariates or coefficients.
#' If "tvc", data are generated with time-varying covariates, and if "tvbeta" data are generated with time-varying
#' coefficients (see details)
#' @param hazard.fun A user-specified R function with one argument, representing time, that outputs the baseline hazard function.
#' If \code{NULL}, a baseline hazard function is generated using the flexible-hazard method as described in Harden and
#' Kropko (2018) (see details)
#' @param num.data.frames The number of data frames to be generated
#' @param fixed.hazard If \code{TRUE}, the same hazard function is used to generate each data frame. If \code{FALSE} (the default),
#' different drawn hazard functions are used to generate each data frame.  Ignored if \code{hazard.fun} is not \code{NULL} or if
#' \code{num.data.frames} is 1
#' @param knots The number of points to draw while using the flexible-hazard method to generate hazard functions (default is 8).
#' Ignored if \code{hazard.fun} is not \code{NULL}
#' @param spline If \code{TRUE} (the default), a spline is employed to smooth the generated cumulative baseline hazard, and if \code{FALSE}
#' the cumulative baseline hazard is specified as a step function with steps at the knots. Ignored if \code{hazard.fun} is not \code{NULL}
#' @param X A user-specified data frame containing the covariates that condition duration. If \code{NULL}, covariates are generated from
#' normal distributions with means given by the \code{mu} argument and standard deviations given by the \code{sd} argument
#' @param beta Either a user-specified vector containing the coefficients that for the linear part of the duration model, or
#' a user specified matrix with rows equal to \code{T} for pre-specified time-varying coefficients.
#' If \code{NULL}, coefficients are generated from normal distributions with means of 0 and standard deviations of 0.1
#' @param xvars The number of covariates to generate. Ignored if \code{X} is not \code{NULL}
#' @param mu If scalar, all covariates are generated to have means equal to this scalar. If a vector, it specifies the mean of each covariate separately,
#' and it must be equal in length to \code{xvars}. Ignored if \code{X} is not \code{NULL}
#' @param sd If scalar, all covariates are generated to have standard deviations equal to this scalar. If a vector, it specifies the standard deviation
#' of each covariate separately, and it must be equal in length to \code{xvars}. Ignored if \code{X} is not \code{NULL}
#' @param covariate Specification of the column number of the covariate in the \code{X} matrix for which to generate a simulated marginal effect (default is 1).
#' The marginal effect is the difference in expected duration when the covariate is fixed at a high value and the expected duration when the covariate is fixed
#' at a low value
#' @param low The low value of the covariate for which to calculate a marginal effect
#' @param high The high value of the covariate for which to calculate a marginal effect
#' @param compare The statistic to employ when examining the two new vectors of expected durations (see details).  The default is \code{median}
#' @param censor The proportion of observations to designate as being right-censored
#' @param censor.cond Whether to make right-censoring conditional on the covariates (default is \code{FALSE}, but see details)
#' @details The \code{\link[coxed]{sim.survdata}} function generates simulated duration data. It can accept a user-supplied
#' hazard function, or else it uses the flexible-hazard method described in Harden and Kropko (2018) to generate
#' a hazard that does not necessarily conform to any parametric hazard function. It can generate data with time-varying
#' covariates or coefficients. For time-varying covariates \code{type="tvc"} it employs the permutational algorithm by Sylvestre and Abrahamowicz (2008).
#' For time-varying coefficients with \code{type="tvbeta"}, the first beta coefficient that is either supplied by the user or generated by
#' the function is multiplied by the natural log of the failure time under consideration.
#'
#' If \code{fixed.hazard=TRUE}, one baseline hazard is generated and the same function is used to generate all of the simulated
#' datasets. If \code{fixed.hazard=FALSE} (the default), a new hazard function is generated with each simulation iteration.
#'
#' The flexible-hazard method employed when \code{hazard.fun} is \code{NULL} generates a unique baseline hazard by fitting a curve to
#' randomly-drawn points. This produces a wide variety
#' of shapes for the baseline hazard, including those that are unimodal, multimodal, monotonically increasing or decreasing, and many other
#' shapes. The method then generates a density function based on each baseline hazard and draws durations from it in a way that circumvents
#' the need to calculate the inverse cumulative baseline hazard. Because the shape of the baseline hazard can vary considerably, this approach
#' matches the Cox model’s inherent flexibility and better corresponds to the assumed data generating process (DGP) of the Cox model. Moreover,
#' repeating this process over many iterations in a simulation produces simulated samples of data that better reflect the considerable
#' heterogeneity in data used by applied researchers. This increases the generalizability of the simulation results. See Harden and Kropko (2018)
#' for more detail.
#'
#' When generating a marginal effect, first the user specifies a covariate by typing its column number in the \code{X} matrix into the \code{covariate}
#' argument, then specifies the high and low values at which to fix this covariate.  The function calculates the differences in expected duration for each
#' observation when fixing the covariate to the high and low values.  If \code{compare} is \code{median}, the function reports the median of these differences,
#' and if \code{compare} is \code{mean}, the function reports the median of these differences, but any function may be employed that takes a vector as input and
#' outputs a scalar.
#'
#' If \code{censor.cond} is \code{FALSE} then a proportion of the observations specified by \code{censor} is randomly and uniformly selected to be right-censored.
#' If \code{censor.cond} is \code{TRUE} then censoring depends on the covariates as follows: new coefficients are drawn from normal distributions with mean 0 and
#' standard deviation of 0.1, and these new coefficients are used to create a new linear predictor using the \code{X} matrix.  The observations with the largest
#' (100 x \code{censor}) percent of the linear predictors are designated as right-censored.
#' @return Returns an object of class "\code{simSurvdata}" which is a list of length \code{num.data.frames} for each iteration of data simulation.
#' Each element of this list is itself a list with the following components:
#' \tabular{ll}{
#' \code{data} \tab The simulated data frame, including the simulated durations, the censoring variable, and covariates\cr
#' \code{xdata} \tab The simulated data frame, containing only covariates \cr
#' \code{baseline} \tab A data frame containing every potential failure time and the baseline failure PDF,
#' baseline failure CDF, baseline survivor function, and baseline hazard function at each time point. \cr
#' \code{xb} \tab The linear predictor for each observation \cr
#' \code{exp.xb} \tab The exponentiated linear predictor for each observation \cr
#' \code{betas} \tab The coefficients, varying over time if \code{type} is "tvbeta" \cr
#' \code{ind.survive} \tab An (\code{N} x \code{T}) matrix containing the individual survivor function at
#' time t for the individual represented by row n   \cr
#' \code{marg.effect} \tab The simulated marginal change in expected duration comparing the high and low values of
#' the variable specified with \code{covariate} \cr
#' \code{marg.effect.data} \tab The \code{X} matrix and vector of durations for the low and high conditions \cr
#' }
#' @references Harden, J. J. and Kropko, J. (2018). Simulating Duration Data for the Cox Model.
#' \emph{Political Science Research and Methods} \url{https://doi.org/10.1017/psrm.2018.19}
#'
#' Sylvestre M.-P., Abrahamowicz M. (2008) Comparison of algorithms to generate event times conditional on time-dependent covariates. \emph{Statistics in Medicine} \strong{27(14)}:2618–34.
#' @author Jonathan Kropko <jkropko@@virginia.edu> and Jeffrey J. Harden <jharden2@@nd.edu>
#' @export
#' @examples
#' simdata <- sim.survdata(N=1000, T=100, num.data.frames=2)
#' require(survival)
#' data <- simdata[[1]]$data
#' model <- coxph(Surv(y, failed) ~ X1 + X2 + X3, data=data)
#' model$coefficients ## model-estimated coefficients
#' simdata[[1]]$betas ## "true" coefficients
#'
#' ## User-specified baseline hazard
#' my.hazard <- function(t){ #lognormal with mean of 50, sd of 10
#' dnorm((log(t) - log(50))/log(10)) /
#'      (log(10)*t*(1 - pnorm((log(t) - log(50))/log(10))))
#' }
#' simdata <- sim.survdata(N=1000, T=100, hazard.fun = my.hazard)
#'
#' ## A simulated data set with time-varying covariates
#' \dontrun{simdata <- sim.survdata(N=1000, T=100, type="tvc", xvars=5, num.data.frames=1)
#' summary(simdata$data)
#' model <- coxph(Surv(start, end, failed) ~ X1 + X2 + X3 + X4 + X5, data=simdata$data)
#' model$coefficients ## model-estimated coefficients
#' simdata$betas ## "true" coefficients
#' }
#'
#' ## A simulated data set with time-varying coefficients
#' simdata <- sim.survdata(N=1000, T=100, type="tvbeta", num.data.frames = 1)
#' simdata$betas
sim.survdata <- function(N=1000, T=100, type="none", hazard.fun = NULL, num.data.frames = 1,
                         fixed.hazard = FALSE, knots = 8, spline = TRUE,
                         X=NULL, beta=NULL, xvars=3, mu=0, sd=.5,
                         covariate=1, low=0, high=1, compare=median,
                         censor = .1, censor.cond = FALSE){
     if(!is.null(X)){
          N <- nrow(X)
          xvars <- ncol(X)
     }
     ifelse(is.null(hazard.fun),
            baseline <- baseline.build(T=T, knots=knots, spline=spline),
            baseline <- user.baseline(hazard.fun, T))

     result <- lapply(1:num.data.frames, FUN=function(i){

          if(!fixed.hazard & is.null(hazard.fun)) baseline <- baseline.build(T=T, knots=knots, spline=spline)

          xb <- generate.lm(baseline, X=X, beta=beta, N=N, xvars=xvars, mu=mu, sd=sd, censor=censor, type=type)
          data <- xb$data
          if(xb$tvc) xdata <- dplyr::select(data, -id, -failed, -start, -end)
          if(!xb$tvc) xdata <- dplyr::select(data, -y)
          me <- make.margeffect(baseline, xb, covariate, low, high, compare=compare)
          if(type=="none" | type=="tvbeta") ifelse(censor.cond,
                                  data$failed <- !censor.x(xdata, censor=censor),
                                  data$failed <- !(runif(N) < censor))
          if(!is.null(hazard.fun)){
               data$failed[data$y==T] <- TRUE
               r <- sum(data$y==T)
               warning(paste(r, c("additional observations right-censored because the user-supplied hazard function
                                  is nonzero at the latest timepoint. To avoid these extra censored observations, increase T")))
          }
          if(!is.null(beta)){
               exp.low <- baseline$failure.CDF[1]*N
               exp.hi <- baseline$survivor[T-1]*N
               obs.low <- sum(data$y==1)
               obs.hi <- sum(data$y==T)
               if((obs.hi + obs.low) > 1.2*(exp.hi + exp.low)){
                    warning(paste(c(obs.hi + obs.low, "observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables."),
                                  collapse = " "))
               }
          }
          return(list(data = data,
                      xdata = xdata,
                      baseline=baseline,
                      xb = xb$XB,
                      exp.xb = xb$exp.XB,
                      betas = xb$beta,
                      ind.survive = xb$survmat,
                      marg.effect = me$marg.effect,
                      marg.effect.data = list(low = me$data.low,
                                              high = me$data.high)))
     })
     if(num.data.frames == 1){
          result <- result[[1]]
          class(result) <- "simSurvdata"
     } else{
          class(result) <- c("simSurvdata", "simSurvdataList")
     }
     return(result)
}

Try the coxed package in your browser

Any scripts or data that you put into this service are public.

coxed documentation built on Aug. 2, 2020, 9:07 a.m.