simData: Simulate data from a multivariate joint model

View source: R/simData.R

simDataR Documentation

Simulate data from a multivariate joint model

Description

Simulate multivariate longitudinal and survival data from a joint model specification, with potential mixture of response families. Implementation is similar to existing packages (e.g. joineR, joineRML).

Usage

simData(
  n = 250,
  ntms = 10,
  fup = 5,
  family = list("gaussian", "gaussian"),
  sigma = list(0.16, 0.16),
  beta = rbind(c(1, 0.1, 0.33, -0.5), c(1, 0.1, 0.33, -0.5)),
  D = NULL,
  gamma = c(0.5, -0.5),
  zeta = c(0.05, -0.3),
  theta = c(-4, 0.2),
  cens.rate = exp(-3.5),
  regular.times = TRUE,
  dof = Inf,
  random.formulas = NULL,
  disp.formulas = NULL,
  return.ranefs = FALSE
)

Arguments

n

the number of subjects

ntms

the number of time points

fup

the maximum follow-up time, such that t = [0, ..., fup] with length ntms. In instances where subject i doesn't fail before fup, their censoring time is set as fup + 0.1.

family

a K-list of families, see details.

sigma

a K-list of dispersion parameters corresponding to the order of family, and matching disp.formulas specification; see details.

beta

a K \times 4 matrix specifying fixed effects for each K parameter, in the order (Intercept), time, continuous, binary.

D

a positive-definite matrix specifying the variance-covariance matrix for the random effects. If not supplied an identity matrix is assumed.

gamma

a K-vector specifying the association parameters for each longitudinal outcome.

zeta

a vector of length 2 specifying the coefficients for the baseline covariates in the survival sub-model, in the order of continuous and binary.

theta

parameters to control the failure rate, see baseline hazard.

cens.rate

parameter for rexp to generate censoring times for each subject.

regular.times

logical, if regular.times = TRUE (the default), then every subject will have the same follow-up times defined by seq(0, fup, length.out = ntms). If regular.times = FALSE then follow-up times are set as random draws from a uniform distribution with maximum fup.

dof

integer, specifies the degrees of freedom of the multivariate t-distribution used to generate the random effects. If specified, this t-distribution is used. If left at the default dof=Inf then the random effects are drawn from a multivariate normal distribution.

random.formulas

allows user to specify if an intercept-and-slope (~ time) or intercept-only (~1) random effects structure should be used on a response-by-response basis. Defaults to an intercept-and-slope for all responses.

disp.formulas

allows user to specify the dispersion model simulated. Intended use is to allow swapping between intercept only (the default) and a time-varying one (~ time). Note that this should be a K-list of formula objects, so if only one dispersion model is wanted, then an intercept-only should be specified for remaining sub-models. The corresponding item in list of sigma parameters should be of appropriate size. Defaults to an intercept-only model.

return.ranefs

a logical determining whether the true random effects should be returned. This is largely for internal/simulation use. Default return.ranefs = FALSE.

Details

simData simulates data from a multivariate joint model with a mixture of families for each k=1,\dots,K response. The specification of family changes requisite dispersion parameter sigma, if applicable. The family list can (currently) contain:

"gaussian"

Simulated with identity link, corresponding item in sigma will be the variance.

"poisson"

Simulated with log link, corresponding dispersion in sigma can be anything, as it doesn't impact simulation.

"binomial"

Simulated with logit link, corresponding dispersion in sigma can be anything, as it doesn't impact simulation.

"negbin"

Simulated with a log link, corresponding item in sigma will be the overdispersion defined on the log scale. Simulated variance is \mu+\mu^2/\varphi.

"genpois"

Simulated with a log link, corresponding item in sigma will be the dispersion. Values < 0 correspond to under-dispersion, and values > 0 over- dispersion. See rgenpois for more information. Simulated variance is (1+\varphi)^2\mu.

"Gamma"

Simulated with a log link, corresponding item in sigma will be the shape parameter, defined on the log-scale.

Therefore, for families "negbin", "Gamma", "genpois", the user can define the dispersion model desired in disp.formulas, which creates a data matrix W. For the "negbin" and "Gamma" cases, we define \varphi_i=\exp\{W_i\sigma_i\} (i.e. the exponent of the linear predictor of the dispersion model); and for "genpois" the identity of the linear is used.

Value

A list of two data.frames: One with the full longitudinal data, and another with only survival data. If return.ranefs=TRUE, a matrix of the true b values is also returned. By default (i.e. no arguments provided), a bivariate Gaussian set of joint data is returned.

Baseline hazard

When simulating the survival time, the baseline hazard is a Gompertz distribution controlled by theta=c(x,y):

\lambda_0(t) = \exp{x + yt}

where y is the shape parameter, and the scale parameter is \exp{x}.

Author(s)

James Murray (j.murray7@ncl.ac.uk).

References

Austin PC. Generating survival times to simulate Cox proportional hazards models with time-varying covariates. Stat Med. 2012; 31(29): 3946-3958.

See Also

joint

Examples

# 1) A set of univariate data ------------------------------------------
beta <- c(2.0, 0.33, -0.25, 0.15)
# Note that by default arguments are bivariate, so need to specify the univariate case
univ.data <- simData(beta = beta,    
                     gamma = 0.15, sigma = list(0.2), family = list("gaussian"), 
                     D = diag(c(0.25, 0.05)))
                     
# 2) Univariate data, with failure rate controlled ---------------------
# In reality, many facets contribute to the simulated failure rate, in 
# this example, we'll just atler the baseline hazard via 'theta'.
univ.data.highfail <- simData(beta = beta,
                              gamma = 0.15, sigma = list(0.0), family = list("poisson"),
                              D = diag(c(0.40, 0.08)), theta = c(-2, 0.1))

# 3) Trivariate (K = 3) mixture of families with dispersion parameters -
beta <- do.call(rbind, replicate(3, c(2, -0.1, 0.1, -0.2), simplify = FALSE))
gamma <- c(0.3, -0.3, 0.3)
D <- diag(c(0.25, 0.09, 0.25, 0.05, 0.25, 0.09))
family <- list('gaussian', 'genpois', 'negbin')
sigma <- list(.16, 1.5, log(1.5))
triv.data <- simData(ntms=15, family = family, sigma = sigma, beta = beta, D = D, 
                     gamma = gamma, theta = c(-3, 0.2), zeta = c(0,-.2))

# 4) K = 4 mixture of families with/out dispersion ---------------------
beta <- do.call(rbind, replicate(4, c(2, -0.1, 0.1, -0.2), simplify = FALSE))
gamma <- c(-0.75, 0.3, -0.6, 0.5)
D <- diag(c(0.25, 0.09, 0.25, 0.05, 0.25, 0.09, 0.16, 0.02))
family <- list('gaussian', 'poisson', 'binomial', 'gaussian')
sigma <- list(.16, 0, 0, .05) # 0 can be anything here, as it is ignored internally.
mix.data <- simData(ntms=15, family = family, sigma = sigma, beta = beta, D = D, gamma = gamma,
                    theta = c(-3, 0.2), zeta = c(0,-.2))
                    
# 5) Bivariate joint model with two dispersion models. -----------------
disp.formulas <- list(~time, ~time)          # Two time-varying dispersion models
sigma <- list(c(0.00, -0.10), c(0.10, 0.15)) # specified in form of intercept, slope
D <- diag(c(.25, 0.04, 0.50, 0.10))
disp.data <- simData(family = list("genpois", "negbin"), sigma = sigma, D = D,
                     beta = rbind(c(0, 0.05, -0.15, 0.00), 1 + c(0, 0.25, 0.15, -0.20)),
                     gamma = c(1.5, 1.5),
                     disp.formulas = disp.formulas, fup = 5)            

# 6) Trivariate joint model with mixture of random effects models ------
# It can be hard to e.g. fit a binomial model on an intercept and slope, since e.g.
# glmmTMB might struggle to accurately fit it (singular fits, etc.). To that end, could
# swap the corresponding random effects specification to be an intercept-only.
family <- list("gaussian", "binomial", "gaussian")
# A list of formulae, even though we want to change the second sub-model's specification
# we need to specify the rest of the items, too (same as disp.formulas, sigma).
random.formulas <- list(~time, ~1, ~time)
beta <- rbind(c(2, -0.2, 0.5, -0.25), c(0, 0.5, 1, -1), c(-2, 0.2, -0.5, 0.25))
# NOTE that the specification of RE matrix will need to match.
D <- diag(c(0.25, 0.09, 1, 0.33, 0.05))
# Simulate data, and return REs as a sanity check...
mix.REspec.data <- simData(beta = beta, D = D, family = family,
                           gamma = c(-0.5, 1, 0.5), sigma = list(0.15, 0, 0.15),
                           random.formulas = random.formulas, return.ranefs = TRUE)

gmvjoint documentation built on Oct. 6, 2024, 1:07 a.m.