bagged: Bootstrap estimates for interaction terms in exploratory...

View source: R/bagged.R

baggedR Documentation

Bootstrap estimates for interaction terms in exploratory subgroup analyses

Description

Perform exploratory subgroup analysis using bootstrap bias adjustment as described in Rosenkranz (2016). The function fits a GLM or a Cox model in the data and then performs bootstrap samples to correct for bias.

Usage

bagged(resp, trt, subgr, covars = NULL, data, 
       fitfunc = c("lm", "glm", "glm.nb", "survreg", "coxph", "rlm"),
       event, exposure, 
       level = 0.1,
       B = 2000, mc.cores = 1, stratified = TRUE, 
       select.by = c("BIC", "AIC"), quietly = FALSE, ...)

Arguments

resp

Character giving the name of the response variable. The variable can be either defined in the global environment or in the data-set data specified below. For interactive use it is also possible to use unquoted names (i.e. bagged(resp,...) instead of bagged("resp",...)), avoid this for non-interactive use of the function.

trt

Character giving the name of the treatment variable. The variable can be either defined in the global environment or in the data-set data specified below. Note that the treatment variable itself needs to be defined as a numeric variable, with control coded as 0, and treatment coded as 1. For interactive use it is also possible to use unquoted names (as for the resp argument. see above).

subgr

Character vector giving the variable names in data to use as subgroup identifiers. Note that the subgroup variables in data need to be numeric 0-1 variables.

covars

Formula, specifying additional covariates to be included in the models (need to be available in data).

data

Data frame containing the variables referenced in resp, trt, subgr and covars (and possibly event and exposure).

fitfunc

Model fitting functions. Currrently one of 'lm', 'glm', 'glm.nb', 'survreg', 'coxph' or 'rlm'.

event

Character giving the name of the event variable. Has to be specified when using fit functions 'survreg' and 'coxph'. The variable can be either defined in the global environment or in the data-set data.

exposure

Character giving the name of the exposure variable, needed for negative binomial regression, when using fit functions 'glm.nb'. This is typically the time each patient is exposed to the drug. The fitted model uses the call glm.nb(.~.+offset(log(exposure))). The variable needs to be defined either in the global environment or in the data-set data.

level

Significance level for confidence intervals will be calculated for treatment effect estimates.

B

A numeric input. The number of bootstrap samples to perform.

mc.cores

A numeric input. This argument is passed to the mclapply function to perform computations in parallel. If mc.cores = 1, then lapply is used.

stratified

Should the bootstrap resampling be done stratifying by treatment group? (default: TRUE).

select.by

Should the model selection be done using BIC or AIC? (default: BIC).

quietly

A logical. By default (quietly = FALSE), bagged prints messages when a subgroup is not selected in any bootstrap sample or when the variance for the bootstrap estimate of one or more subgroups could not be calculated. If TRUE, these messages are not printed.

...

Other arguments passed to the model fitting function.

Details

In the generalized linear model case, P generalized linear models are fitted such that

M_p: h(μ_{pi}) = α_p + β_p z_i + (γ_p + δ_p z_i) s_{pi} + ∑_{k = 1}^{K} τ_k x_{ik}

where h is the link function, μ_{pi} = E_p[Y_i] is the expectation of the response Y_i under model M_p and x_{ik} are additional covariates we control for. For survival data, a proportional hazards model can be used:

M_p: λ_{pi}(t)= λ_{p0}(t) \exp≤ft\{β_p z_i + (γ_p + δ_p z_i)s_{pi} + ∑_{k = 1}^{K} τ_k x_{ik} \right\}

The focus of estimation is the difference in the treatment effect between a subgroup and its complement, the treatment by subgroup interaction δ_p.

Consider now B bootstrap samples from the original data. Let (Y_{b1}^*, ..., Y_{bN}^*) be a bootstrap sample from the original data. Let (z_{b1}^*, ..., z_{bN}^*), (s_{b1}^*, ..., s_{bN}^*), and (x_{b1k}^*, ..., x_{bNk}^*) be the corresponding treatment indicators, group indicators, and covariates in the bootstrap samples, respectively. For each p=1, ..., P and b=1, ..., B we fit the model:

h(E_p[Y_{bi}^*]) = α^{*}_{bp} + β^{*}_{bp} z_{bi}^* + (γ^{*}_{bp} + δ^{*}_{bp} z_{bi}^*) s^{*}_{bpi} + ∑_{k = 1}^{K} τ_k x^*_{ik}

An estimator of δ_{p} given that subgroup S_p provided the best fit can be calculated as

\overline{δ}^{*}_p = \frac{∑_{b=1}^B u_{bp} \hat{δ}^{*}_{bp} }{∑_{b=1}^B u_{bp}}

where \hat{δ}^{*}_{bp} is the usual estimator of δ_{bp} and u_{bp} = 1 if the subgroup p provides the best fit for bootstrap sample b and 0 otherwise.

An bias-reduced estimator of δ_{p} can be obtained as:

\check{δ}^{*}_p = 2 \hat{δ}_p - \overline{δ}^{*}_p

A bias-reduced estimator with decreased variability is obtained by replacing the maximum likelihood estimator by the bagging estimator \hat{δ}^{*}_{bp}:

\hat{δ}^{*}_p = \frac{1}{B}∑_{b=1}^B \hat{δ}^{*}_{bp}

so that the bias reduced estimator is

\tilde{δ}^{*}_p = 2 \hat{δ}^*_p - \overline{δ}^{*}_p

Value

An object of class subtee. A list containing a dataframe (model_fits) with the estimates using the original data, and a dataframe (bagged_results) with the bootstrap estimates with their percent of selection. The latter contains the following columns: 'percent_selected': the relative proportion for selection of each subgroup, 'bagg': the (uncorrected) bagged estimate \hat{δ}^{*}_p 'boot_red': the bias reduced bootstrap estimate \check{δ}^{*}_p 'bagg_red': the bias reduced bootstrap estimate with decreased variability by bagging \tilde{δ}^{*}_p and the respective standard deviations of the estimates.

References

Ballarini, N. Thomas, M., Rosenkranz, K. and Bornkamp, B. (2021) "subtee: An R Package for Subgroup Treatment Effect Estimation in Clinical Trials" Journal of Statistical Software, 99, 14, 1-17, doi: 10.18637/jss.v099.i14

Rosenkranz, G.(2016) "Exploratory subgroup analysis in clinical trials by model selection", Biometrical Journal, 58, 1007-1259. doi: 10.1002/bimj.201500147

See Also

glm, coxph

Examples

## Not run: 
## toy example calls using the simulated datnorm data-set without
## treatment and subgroup effect, see ?datnorm for details
data(datnorm)
head(datnorm)

## first need to create candidate subgroups (if not already defined in data-set)
## here generate candidate subgroups manually (need to be numeric 0-1 variables)
groups <- data.frame(labvalL.5=as.numeric(datnorm$labvalue < 0.5),
                     regUS=as.numeric(datnorm$region == "US"),
                     hgtL175=as.numeric(datnorm$height < 175))
fitdat <- cbind(datnorm, groups) # bind subgroup variables to main data
## subgroups of interest
subgr <- c("labvalL.5", "regUS", "hgtL175")
res <- bagged(resp = "y", trt = "treat", subgr = subgr, data = fitdat, 
                       covars = ~ x1 + x2, fitfunc = "lm")
res


## generate candidate subgroups using the subbuild function
## semi-automatically i.e. some groups specified directly (height and
## smoker), for region and labvalue subbuild generates subgroups (see
## ?subbuild)
cand.groups <- subbuild(datnorm, height < 175, smoker == 1, region, labvalue)
head(cand.groups)
fitdat <- cbind(datnorm, cand.groups) 
subgr <- colnames(cand.groups)
res <- bagged(resp = "y", trt = "treat", subgr = subgr, data = fitdat, 
                       covars = ~ x1 + x2, fitfunc = "lm")
res

## toy example call for binary data on simulated datbin data-set
data(datbin)
cand.groups <- subbuild(datbin, height < 175, smoker == 1, region, labvalue)
fitdat <- cbind(datbin, cand.groups) 
subgr <- colnames(cand.groups)
res <- bagged(resp = "y", trt = "treat", subgr = subgr, data = fitdat, 
                       covars = ~ x1 + x2, fitfunc = "glm", 
                       family = binomial(link = "logit"))
## scale of the treatment effect estimate: difference on log-odds scale
res

## toy example call for parametric and semi-parametric survival data on
## datsurv data-set
data(datsurv)
cand.groups <- subbuild(datsurv, height < 175, smoker == 1, region, labvalue)
fitdat <- cbind(datsurv, cand.groups)
subgr <- colnames(cand.groups)
res.survreg <- bagged(resp = "y", trt = "treat", subgr = subgr, data = fitdat,
                               covars = ~ x1 + x2,
                               fitfunc = "survreg", event = "event", dist = "exponential")
## parametric survival model (here exponential distribution)
## scale of treatment effect estimate: log scale (see ?survreg for details)
res.survreg

# Decreased B for a reduction in computational time
res.cox <- bagged(resp = "y", trt = "treat", subgr = subgr, data = fitdat,
                  fitfunc = "coxph", event = "event", B = 20) # B=2000 should be used
## scale of treatment effect estimate: difference in log-hazard rate
res.cox
## toy example call overdispersed count data on datcount data-set
data(datcount)
cand.groups <- subbuild(datcount, height < 175, smoker == 1, region, labvalue)
fitdat <- cbind(datcount, cand.groups)
subgr <- colnames(cand.groups)
# Decreased B for a reduction in computational time
res <- bagged(resp = "y", trt = "treat", subgr = subgr, data = fitdat,
              fitfunc = "glm.nb", exposure = "exposure", B = 20) # B=2000 should be used
## scale of treatment effect estimate: difference on log scale
res
## End(Not run)

subtee documentation built on March 22, 2022, 5:07 p.m.