knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette focuses on giving an overview of the main functions of the package. This vignette uses the simulated data which ships with the package.
We load the package stackBagg
library(stackBagg)
The simulated data can be loaded via:
library(stackBagg) data("exampleData")
exampleData is a list with three elements. The first elemenent contains the training data set and the second element contains the test data set. The data set train and test consist of the variables id, the binary outcome E, the event times denoted as ttilde, delta as the event indicator at ttilde (censored observations are denoted by the value 0), the trueT which denotes the binary outcome under no censoring and lastly 20 covariates:
train <- exampleData[[1]][[1]] head(train)
test <- exampleData[[2]][[1]] head(test)
There is a third element exampleData which is the true AUC for the simulated data on the cumulative incidence of the main event at time 26.5 computed analytically under the Weibull distribution using the scale and shape parameter for both event times.
auc_true <- exampleData[[3]][[1]]
auc_true
The train data set consists in 800 individuals where 167 are censored and 180 experience the event of interest at time 26.5:
n_train<- dim(train)[1] n_train summary(train$E)
and 51 subjects experience the competing event within the time of interest 26.5. These individuals are part of the control group.
library(dplyr) train %>% subset(E==0 & ttilde<26.5) %>% count(delta)
The test data set consists in 200 individuals where 45 are censored and 39 experience the event of interest:
n_test<- dim(test)[1] n_test summary(test$E)
and there are 24 subjects that experience the competing event within the time of interest 26.5:
library(dplyr) test %>% subset(E==0 & ttilde<26.5) %>% count(delta)
Next we apply stackBagg to estimate the risk of experiencing the main event using the 20 covariates in the data set. In other words, using stackBagg we are going to estimate $P(T<26.5,\delta=1|X)$ using a bunch of machine learning algorithms. Before applying stackBagg, we need to make sure that the data set is in the appropiate format: time event is in the first column and the second column is the indicator event type.
train <- train[,-(1:2)] head(train,2) test <- test[,-(1:2)] head(test,2)
Another argument of the function stackBagg::stackBagg is the names of covariates as they are named in the data set that we want to include in the model. As we have said above, we are going to use all covariates
xnam <- names(train)[-(1:3)] xnam
We have also to specify the library of algorithms that we want to use to predict the event of interest and to be used to form the stack. We could see all the algorithms that the user could potentially include in the analysis through stackBagg::all.algorithms(). Let 's use all of them and we denote it as ens.library.
ens.library <-stackBagg::algorithms() ens.library
Another argument of the function stackBagg::stackBagg is a list of tune parameters for each machine learning procedure. If this argument is missing, the function uses as default values the same used for the simulation in the paper. For now, we are going to use the default values for the tune parameters. Additionally, we will use 5 folds and we are going to show the results computing the weights under a Cox proportional hazard model (CoxPH) and boosting Cox regression (Cox-Boost).
Firstly, we model the weights under CoxPH and we train the different models and get their predictions on the test data set.
pred <- stackBagg::stackBagg(train.data = train,test.data = test,xnam = xnam,tao = 26.5,weighting = "CoxPH",folds = 5,ens.library = ens.library )
We show several output of pred.
The machine learning algorithms in the library:
pred$library
Let s take a look first at the IPCW Bagging prediction of the different algorithms and the stacked IPCW Bagging on the test data set:
head(pred$prediction_ensBagg,5)
The assessment of predictive performance using the IPCW AUC is:
pred$auc_ipcwBagg
The optimal coefficients used to get the stacked IPCW Bagging:
pred$optimal_coefficients
We can note that the algorithms with the largest predictive performance weigh in more in the stack. We check if convergence has been reached in the optimization problem of finding the optimal coefficients (0 denotes convergence, 1 otherwise) and the penalization term used:
pred$convergence pred$penalization_term
We can check the tune parameters used to train the algorithms:
pred$tuneparams
The GAM is trained with two degree of freedom 3 and 4. The LASSO paremeter refers to the lambda penalization term. The parameters in the random forest refer to the number of trees (num_tree=500) and the umber of variables randomly sampled as candidates at each split (mtry=4). k chosen in the k-NN is 25. The SVM parameters are the cost, gamma and kernel (radial=1 and linear=2). Since the kernel is linear the gamma is NA since the linear kernel does not use the gamma parameter. The neurons is set to 1 in the neural network. The last values are the parameter number of trees, k that determines the prior probability that the average of the outcome falls into (-3,3) and q the quantile of the prior on the error variance.
Let s compare the ROC curve for the best and worst single algorithm and the stack. The ROC curve of the stack is
stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"Stack"],wts=pred$wts_test,tao=26.5,method = "ipcw")
The GAM.4 ROC curve is
stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"ens.gam.4"],wts=pred$wts_test,tao=26.5,method = "ipcw")
The k-NN ROC curve is
stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"ens.knn"],wts=pred$wts_test,tao=26.5,method = "ipcw")
Now let s take a look at prediction of the algorithms that allows for weights natively:
head(pred$prediction_native_weights,5)
and their performance is:
pred$auc_native_weights
Moreover, let s see the prediction of three survival based methods: a cause-specific Cox proportional hazard regression model, Cox-Boost and survival random forests for competing risk.
head(pred$prediction_survival,5)
pred$auc_survival
Lastly, we could see the performance of the algorithms if we were to discard the censored observations
pred.discard <- stackBagg::prediction_discard(train.data = train,test.data = test,xnam = xnam,tao = 26.5,ens.library=ens.library) head(pred.discard$prediction_discard) pred.discard$auc_discard stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred.discard$prediction_discard[,"ens.gam.4"],tao=26.5,method = "discard")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.