R/datasim.R
In icensmis: Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes

Documented in datasim

#' Simulate data including multiple outcomes from error-prone diagnostic tests 
#' or self-reports
#' 
#' @description This function simulates a data of N subjects with misclassified 
#'   outcomes, assuming each subject receives a sequence of pre-scheduled tests 
#'   for disease status ascertainment. Each test is subject to error, 
#'   characterized by sensitivity and specificity. An exponential distribution 
#'   is assumed for the time to event of interest. Three kinds of covariate 
#'   settings can be generated: one sample setting, two group setting, and 
#'   continuous covariates setting with each covariate sampled from i.i.d. N(0, 
#'   1). Two missing mechanisms can be assumed, namely MCAR and NTFP. The MCAR 
#'   setting assumes that each test is subject to a constant, independent 
#'   probability of missingness. The NTFP mechanism includes two types of 
#'   missingness - (1) incorporates a constant, independent, probability of 
#'   missing for each test prior to the first positive test result; and (2) all 
#'   test results after first positive are missing. The simulated data is in 
#'   longitudinal form with one row per test time.
#'   
#'   Covariate values, by default, are assumed to be constant. However, this 
#'   function can simulate a special case of time varying covariates. Under time
#'   varying covariates setting, each subject is assumed to have a change time 
#'   point, which is sampled from the visit times. We assume that each subject 
#'   has two sets of covariate values. Before his change time point, the 
#'   covariate values take from the first set, and second set after change time 
#'   point. Thus, each subject's distribution of survival time is two-piece 
#'   exponential distribution with different hazard rates.
#'   
#' @export
#' 
#' @param N total number of subjects to be simulated
#' @param blambda baseline hazard rate
#' @param testtimes a vector of pre-scheduled test times
#' @param sensitivity the sensitivity of test
#' @param specificity the specificity of test
#' @param betas a vector of regression coefficients of the same length as the 
#'   covariate vector. If betas = NULL then the simulated dataset corresponds to
#'   the one sample setting. If betas != NULL and twogroup != NULL then the 
#'   simulated dataset corresponds to the two group setting, and the first value
#'   of betas is used as the coefficient for the treatment group indicator. If 
#'   betas != NULL and twogroup = NULL, then the covariates are ~ i.i.d. N(0, 
#'   1), and the number of covariates is determined by the length of betas.
#' @param twogroup corresponds to the proportion of subjects allocated to the 
#'   baseline (reference) group in the two-group setting. For the two-group 
#'   setting, this variable should be between 0 and 1. For the one sample and 
#'   multiple (>= 2) covariate setting, this variable should be set to NULL. 
#'   That is, when betas !=NULL, set twogroup to equal the proportion of the 
#'   subjects in the baseline group to obtain a simulated dataset corresponding 
#'   to the two-group setting. Else, set twogroup=NULL to obtain either the one 
#'   sample setting (betas=NULL) or continuous covariates (betas !=NULL).
#' @param pmiss a value or a vector (must have same length as testtimes) of the 
#'   probabilities of each test being randomly missing at each test time. If 
#'   pmiss is a single value, then each test is assumed to have an identical 
#'   probability of missingness.
#' @param pcensor a value or a vector (must have same length as testtimes) of
#' the interval probabilities of censoring time at each interval, 
#' assuming censoring process
#' is independent on other missing mechanisms. If it is the single value, then
#' we assume same interval probabilities as the value. The sum of pcensor (or 
#' pcensor * length(testtimes) if it is single value) must be <= 1. For example,
#' if pcensor = c(0.1, 0.2), then it means the the probabilities of censoring time
#' in first and second intervals are 0.1, 0.2, and the probability of not being
#' censored is 0.7.
#' @param design missing mechanism: "MCAR" or "NTFP"
#' @param negpred baseline negative predictive value, i.e. the probability of being 
#'   truely disease free for those who were tested (reported) as disease free at
#'   baseline. If baseline screening test is perfect, then negpred = 1.
#' @param time.varying indicator whether fitting a time varying covariate model 
#'   or not
#'   
#' @return simulated longitudinal form data frame
#'   
#' @details To simulate the one sample setting data, set betas to be NULL. To 
#'   simulate the two group setting data, set twogroup to equal the proportion 
#'   of the subjects in the baseline group and set betas to equal the 
#'   coefficient corresponding to the treatment group indicator(i.e. beta equals
#'   the log hazard ratio of the two groups). To simulate data with continuous 
#'   i.i.d. N(0, 1) covariates, set twogroup to be NULL and set betas to equal 
#'   the vector of coefficients of the covariates.
#'   
#' @examples
#' ## One sample setting  
#' simdata1 <- datasim(N = 1000, blambda = 0.05, testtimes = 1:8, sensitivity = 0.7,
#'   specificity = 0.98, betas = NULL, twogroup = NULL, pmiss = 0.3, design = "MCAR")
#' 
#' ## Two group setting, and the two groups have same sample sizes
#' simdata2 <- datasim(N = 1000, blambda = 0.05, testtimes = 1:8, sensitivity = 0.7,
#'   specificity = 0.98, betas = 0.7, twogroup = 0.5, pmiss = 0.3, design = "MCAR")
#'   
#' ## Three covariates with coefficients 0.5, 0.8, and 1.0
#' simdata3 <- datasim(N = 1000, blambda = 0.05, testtimes = 1:8, sensitivity = 0.7,
#'   specificity = 0.98, betas = c(0.5, 0.8, 1.0), twogroup = NULL, pmiss = 0.3,
#'   design = "MCAR", negpred = 1)
#' 
#' ## NTFP missing mechanism
#' simdata4 <- datasim(N = 1000, blambda = 0.05, testtimes = 1:8, sensitivity = 0.7,
#'   specificity = 0.98, betas = c(0.5, 0.8, 1.0), twogroup = NULL, pmiss = 0.3,
#'   design = "NTFP", negpred = 1)	 
#' 
#' ## Baseline misclassification
#' simdata5 <- datasim(N = 2000, blambda = 0.05, testtimes = 1:8, sensitivity = 0.7,
#'   specificity = 0.98, betas = c(0.5, 0.8, 1.0), twogroup = NULL, pmiss = 0.3, 
#'   design = "MCAR", negpred = 0.97)  
#'   
#' ## Time varying covariates
#' simdata6 <- datasim(N = 1000, blambda = 0.05, testtimes = 1:8, sensitivity = 0.7,
#'   specificity = 0.98, betas = c(0.5, 0.8, 1.0), twogroup = NULL, pmiss = 0.3,
#'   design = "MCAR", negpred = 1, time.varying = TRUE)  
#'   
#' @useDynLib icensmis
#' @importFrom Rcpp evalCpp
#' @importFrom stats model.matrix optim pnorm qnorm rbinom rexp rnorm runif sd
#' 
datasim <- function(N, blambda, testtimes, sensitivity, specificity, 
                     betas = NULL, twogroup = NULL, pmiss = 0, pcensor = 0, design = "MCAR",
                     negpred = 1, time.varying = F) {
  
  #############################################################################
  # Input check
  ############################################################################# 
  ## type check
  stopifnot(is.numeric(N), is.numeric(N), is.numeric(testtimes),
            is.numeric(sensitivity), is.numeric(specificity), is.numeric(pmiss),
            is.numeric(negpred), is.logical(time.varying), is.character(design))
  ## range check
  stopifnot(length(N) == 1, N >= 1, is.finite(N), length(sensitivity) == 1, 
            sensitivity >= 0, sensitivity <= 1, length(specificity) == 1,
            specificity >= 0, specificity <= 1, length(negpred) == 1, negpred >= 0,
            negpred <= 1, all(testtimes > 0), all(is.finite(testtimes)),
            length(blambda) == 1, blambda > 0, is.finite(blambda),
            length(pmiss) == 1, pmiss < 1, pmiss >= 0, 
            design %in% c("MCAR", "NTFP"))
  if (!is.null(betas)) stopifnot(is.numeric(betas), is.finite(betas))
  if (!is.null(twogroup)) stopifnot(is.numeric(twogroup), length(twogroup) == 1, 
                                twogroup > 0, twogroup < 1)
  
  # check pcensor
  if (!(length(pcensor) %in% c(1, length(testtimes)))) 
    stop("length of pcensor should be 1 or same as survivals")
  if (any(pcensor < 0)) stop("pcensor can not be negative values")
  if (length(pcensor) == 1) {
    if ((pcensor * length(testtimes)) > 1) 
      stop("Total censoring probabilities exceeding 1, see pcensor definition")
  } else {
    if (sum(pcensor) > 1)
      stop("Total censoring probabilities exceeding 1, see pcensor definition")
  }
  
  
  #############################################################################
  # Simulate covariates and event time
  #############################################################################
  testtimes <- sort(unique(testtimes))
  nbeta <- length(betas)
  ntest <- length(testtimes)
  ID <- rep(1:N, each = ntest)
  time <- rep(testtimes, times = N)
  if (!time.varying) {
    if (is.null(betas)) {
      covm <- NULL
      ET <- rexp(N, blambda)
    } else {
      if (!is.null(twogroup)) {
        nbeta <- 1
        covm <- matrix(rbinom(N, 1, 1 - twogroup), ncol = 1)
        colnames(covm) <- "group"
        lambda <- blambda*exp(c(covm*betas[1]))
      } else {
        covm <- matrix(rnorm(nbeta * N), ncol = nbeta)
        colnames(covm) <- paste0("cov", 1:nbeta)
        lambda <- blambda*exp(c(covm%*%betas))
      }
      ET <- rexp(N, lambda)     
      covm <- covm[ID, , drop = F]
    }     
  } else {
    if (is.null(betas)) stop("missing betas for time varying model")
    if (!is.null(twogroup)) 
      warning("two group setting is not supported for time varying model,
              twogroup = NULL assumed")
    chgtime <- sample(testtimes, N, replace = T)
    covm1 <- matrix(rnorm(nbeta*N), ncol = nbeta)
    covm2 <- matrix(rnorm(nbeta*N), ncol = nbeta)
    lambda1 <- blambda*exp(c(covm1%*%betas))
    lambda2 <- blambda*exp(c(covm2%*%betas))
    samplet <- function(ct, lam1, lam2) {
      x <- runif(1)
      cumhaz <- -log(1-x)
      ifelse(cumhaz < lam1*ct, cumhaz/lam1, (cumhaz - lam1*ct)/lam2 + ct)  
    }
    ET <- sapply(1:N, function(i) samplet(chgtime[i], lambda1[i], lambda2[i]))
    chgtime <- rep(chgtime, each = ntest)
    p1 <- time < chgtime
    p2 <- time >= chgtime
    covm <- p1*covm1[ID, , drop = F] + p2*covm2[ID, , drop = F]
    colnames(covm) <- colnames(covm1) <- paste0("cov", 1:nbeta)
  }
  ## baseline misclassification
  ET[rbinom(N, 1, 1 - negpred) == 1] <- 0
  ET <- ET[ID]
  
  #############################################################################
  # Simulate test result and apply missing
  #############################################################################  
  occur <- time > ET
  probs   <- ifelse(occur, sensitivity, 1 - specificity)
  result  <- rbinom(length(occur), 1, probs)
  if (is.null(covm)) {
    data <- data.frame(ID, testtime = time, result = result)
  } else {
    data <- data.frame(ID, covm, testtime = time, result = result)
  }
  notmiss <- rbinom(nrow(data), 1, 1 - pmiss)
  data <- data[notmiss == 1, ]
  
  #############################################################################
  # Apply NTFP
  ############################################################################# 
  if (design == "NTFP") {
    afterpos <- function(x) {
      npos <- cumsum(x == 1)
      (npos == 0) | (npos == 1 & x == 1)    
    }
    keep <- unlist(tapply(data$result, data$ID, afterpos))
    data <- data[keep, ]
  }
  
  #############################################################################
  # Apply censoring
  ############################################################################# 
  if (any(pcensor > 0)) {
    if (length(pcensor) == 1) pcensor <- rep(pcensor, ntest)
    pcensor <- c(pcensor, 1 - sum(pcensor))
    ctimes <- c(testtimes, Inf)
    data1 <- split(data, data$ID)
    data1 <- lapply(data1, function(x) {
      x$censortime <- sample(ctimes, 1, prob = pcensor)
      x
    })
    data1 <- unsplit(data1, data$ID)
    #data1 <- subset(data1, testtime < censortime)
    data1 <- data1[data1$testtime < data1$censortime, ]
    drop <- which(names(data1) == "censortime")
    data <- data1[, -drop]    
  }

  #############################################################################
  # Add baseline information for time-varying model
  ############################################################################# 
  if (time.varying) {
    base <- data.frame(ID = 1:N, covm1, testtime = 0, result = 0)
    base <- base[base$ID %in% unique(data$ID), ]
    data <- rbind(data, base)
    data <- data[order(data$ID, data$testtime), ]
  }
  
  row.names(data) <- NULL
  data
}

Any scripts or data that you put into this service are public.

icensmis documentation built on Sept. 5, 2021, 5:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

icensmis
Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes

R/datasim.R
In icensmis: Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes

Defines functions datasim

Documented in datasim

Try the icensmis package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

icensmis Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes

R/datasim.R In icensmis: Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes

Defines functions datasim

Documented in datasim

Try the icensmis package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

icensmis
Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes

R/datasim.R
In icensmis: Study Design and Data Analysis in the Presence of Error-Prone Diagnostic Tests and Self-Reported Outcomes