In pablogonzalezginestet/stackBagg: Stacked IPCW Bagging

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette focuses on giving an overview of the main functions of the package. This vignette uses the simulated data which ships with the package.

Setup

We load the package stackBagg

library(stackBagg)

Example: Simulated Data

The simulated data can be loaded via:

library(stackBagg)
data("exampleData")

exampleData is a list with three elements. The first elemenent contains the training data set and the second element contains the test data set. The data set train and test consist of the variables id, the binary outcome E, the event times denoted as ttilde, delta as the event indicator at ttilde (censored observations are denoted by the value 0), the trueT which denotes the binary outcome under no censoring and lastly 20 covariates:

train <- exampleData[[1]][[1]]
head(train)

test <- exampleData[[2]][[1]]
head(test)

There is a third element exampleData which is the true AUC for the simulated data on the cumulative incidence of the main event at time 26.5 computed analytically under the Weibull distribution using the scale and shape parameter for both event times.

auc_true <- exampleData[[3]][[1]]
auc_true

The train data set consists in 800 individuals where 167 are censored and 180 experience the event of interest at time 26.5:

n_train<- dim(train)[1]
n_train
summary(train$E)

and 51 subjects experience the competing event within the time of interest 26.5. These individuals are part of the control group.

library(dplyr)
train %>% subset(E==0 & ttilde<26.5) %>%  count(delta)

The test data set consists in 200 individuals where 45 are censored and 39 experience the event of interest:

n_test<- dim(test)[1]
n_test
summary(test$E)

and there are 24 subjects that experience the competing event within the time of interest 26.5:

library(dplyr)
test %>% subset(E==0 & ttilde<26.5) %>%  count(delta)

Next we apply stackBagg to estimate the risk of experiencing the main event using the 20 covariates in the data set. In other words, using stackBagg we are going to estimate $P(T<26.5,\delta=1|X)$ using a bunch of machine learning algorithms. Before applying stackBagg, we need to make sure that the data set is in the appropiate format: time event is in the first column and the second column is the indicator event type.

Note: The first two columns of the data.frame data set must be in the following order: time and then status/event indicator. The function stackBagg::stackBagg creates for us the binary variable of interest E in terms of the time point of interest and the status/event indicator.

train <- train[,-(1:2)]
head(train,2)
test <- test[,-(1:2)]
head(test,2)

Another argument of the function stackBagg::stackBagg is the names of covariates as they are named in the data set that we want to include in the model. As we have said above, we are going to use all covariates

xnam <- names(train)[-(1:3)]
xnam

We have also to specify the library of algorithms that we want to use to predict the event of interest and to be used to form the stack. We could see all the algorithms that the user could potentially include in the analysis through stackBagg::all.algorithms(). Let 's use all of them and we denote it as ens.library.

ens.library <-stackBagg::algorithms()
ens.library

Another argument of the function stackBagg::stackBagg is a list of tune parameters for each machine learning procedure. If this argument is missing, the function uses as default values the same used for the simulation in the paper. For now, we are going to use the default values for the tune parameters. Additionally, we will use 5 folds and we are going to show the results computing the weights under a Cox proportional hazard model (CoxPH) and boosting Cox regression (Cox-Boost).

Firstly, we model the weights under CoxPH and we train the different models and get their predictions on the test data set.

pred <- stackBagg::stackBagg(train.data = train,test.data = test,xnam = xnam,tao = 26.5,weighting = "CoxPH",folds = 5,ens.library = ens.library )

We show several output of pred.

Library of algorithms

The machine learning algorithms in the library:

pred$library

Predictions and performance of IPCW Bagging

Let s take a look first at the IPCW Bagging prediction of the different algorithms and the stacked IPCW Bagging on the test data set:

head(pred$prediction_ensBagg,5)

The assessment of predictive performance using the IPCW AUC is:

pred$auc_ipcwBagg

The optimal coefficients used to get the stacked IPCW Bagging:

pred$optimal_coefficients

We can note that the algorithms with the largest predictive performance weigh in more in the stack. We check if convergence has been reached in the optimization problem of finding the optimal coefficients (0 denotes convergence, 1 otherwise) and the penalization term used:

pred$convergence
pred$penalization_term

We can check the tune parameters used to train the algorithms:

pred$tuneparams

The GAM is trained with two degree of freedom 3 and 4. The LASSO paremeter refers to the lambda penalization term. The parameters in the random forest refer to the number of trees (num_tree=500) and the umber of variables randomly sampled as candidates at each split (mtry=4). k chosen in the k-NN is 25. The SVM parameters are the cost, gamma and kernel (radial=1 and linear=2). Since the kernel is linear the gamma is NA since the linear kernel does not use the gamma parameter. The neurons is set to 1 in the neural network. The last values are the parameter number of trees, k that determines the prior probability that the average of the outcome falls into (-3,3) and q the quantile of the prior on the error variance.

ROC curve

Let s compare the ROC curve for the best and worst single algorithm and the stack. The ROC curve of the stack is

stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"Stack"],wts=pred$wts_test,tao=26.5,method = "ipcw")

The GAM.4 ROC curve is

stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"ens.gam.4"],wts=pred$wts_test,tao=26.5,method = "ipcw")

The k-NN ROC curve is

stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"ens.knn"],wts=pred$wts_test,tao=26.5,method = "ipcw")

Native Weights

Now let s take a look at prediction of the algorithms that allows for weights natively:

head(pred$prediction_native_weights,5)

and their performance is:

pred$auc_native_weights

Survival Methods

Moreover, let s see the prediction of three survival based methods: a cause-specific Cox proportional hazard regression model, Cox-Boost and survival random forests for competing risk.

head(pred$prediction_survival,5)

pred$auc_survival

Naive Methods

Lastly, we could see the performance of the algorithms if we were to discard the censored observations

pred.discard <- stackBagg::prediction_discard(train.data = train,test.data = test,xnam = xnam,tao = 26.5,ens.library=ens.library)
head(pred.discard$prediction_discard)
pred.discard$auc_discard
stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred.discard$prediction_discard[,"ens.gam.4"],tao=26.5,method = "discard")