knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
pbc2
datasetThe pbc2
dataset [@murtaugh_primary_1994] is loaded with the package DynForest
to illustrate its function abilities. pbc2
data come from a clinical trial conducted by the Mayo Clinic between 1974 and 1984 to treat the primary biliary cholangitis (PBC), a chronic liver disease. 312 patients were enrolled in a clinical trial to evaluate the effectiveness of D-penicillamine compared to a placebo to treat the PBC and were followed since the clinical trial ends, leading to a total of 1945 observations. During the follow-up, several clinical continuous markers were collected over time such as: the level of serum bilirubin (serBilir
), the level of serum cholesterol (serChol
), the level of albumin (albumin
), the level of alkaline (alkaline
), the level of aspartate aminotransferase (SGOT
), platelets count (platelets
) and the prothrombin time (prothrombin
). 4 non-continuous time-dependent predictors were also collected: the presence of ascites (ascites
), the presence of hepatomegaly (hepatomegaly
), the presence of blood vessel malformations in the skin (spiders
) and the edema levels (edema
). These time-dependent predictors were recorded according to time
variable. In addition to these time-dependent predictors, few predictors were collected at enrollment: the sex (sex
), the age (age
) and the drug treatment (drug
). During the follow-up, 140 patients died before transplantation, 29 patients were transplanted and 143 patients were censored alive (event
). The time of first event (censored alive or any event) was considered as the event time (years
)
library("DynForest") data(pbc2) head(pbc2)
For the illustration, we select patients still at risk at 4 years and we recode the event
variable with event
= 1 for subjects who died during between 4 years and 10 years, whereas subjects with transplantation were recoded event
= 0, as the subjects still alive. We split the subjects into two datasets: (i) one dataset to train the random forest using $2/3$ of patients; (ii) one dataset to predict on the other $1/3$ of patients.
pbc2 <- pbc2[which(pbc2$years>4&pbc2$time<=4),] pbc2$event <- ifelse(pbc2$event==2, 1, 0) pbc2$event[which(pbc2$years>10)] <- 0 set.seed(1234) id <- unique(pbc2$id) id_sample <- sample(id, length(id)*2/3) id_row <- which(pbc2$id%in%id_sample) pbc2_train <- pbc2[id_row,] pbc2_pred <- pbc2[-id_row,]
We use the same strategy as in the survival context (section 4) to build the random forest, with the same predictors and the same association for time-dependent predictors.
timeData_train <- pbc2_train[,c("id","time", "serBilir","SGOT", "albumin","alkaline")] timeVarModel <- list(serBilir = list(fixed = serBilir ~ time, random = ~ time), SGOT = list(fixed = SGOT ~ time + I(time^2), random = ~ time + I(time^2)), albumin = list(fixed = albumin ~ time, random = ~ time), alkaline = list(fixed = alkaline ~ time, random = ~ time)) fixedData_train <- unique(pbc2_train[,c("id","age","drug","sex")])
With a categorical outcome, the definition of the output object is slightly different. We should specify type
="factor" to define the outcome as categorical, and the dataframe in Y
should contain only 2 columns, the variable identifier id
and the outcome event
.
Y <- list(type = "factor", Y = unique(pbc2_train[,c("id","event")]))
We executed dynforest()
function to build the random forest with hyperparameters mtry
= 7 and nodesize
= 2 as follows:
res_dyn <- dynforest(timeData = timeData_train, fixedData = fixedData_train, timeVar = "time", idVar = "id", timeVarModel = timeVarModel, mtry = 7, nodesize = 2, Y = Y, ncores = 7, seed = 1234)
With a categorical outcome, the OOB prediction error is evaluated using a missclassification criterion. This criterion can be computed with compute_ooberror()
function and the results of the random forest can be displayed using summary()
:
res_dyn_OOB <- compute_ooberror(dynforest_obj = res_dyn)
summary(res_dyn_OOB) dynforest executed for categorical outcome Splitting rule: Minimize weighted within-group Shannon entropy Out-of-bag error type: Missclassification Leaf statistic: Majority vote ---------------- Input Number of subjects: 150 Longitudinal: 4 predictor(s) Numeric: 1 predictor(s) Factor: 2 predictor(s) ---------------- Tuning parameters mtry: 7 nodesize: 2 ntree: 200 ---------------- ---------------- dynforest summary Average depth per tree: 5.89 Average number of leaves per tree: 16.81 Average number of subjects per leaf: 5.72 ---------------- Out-of-bag error based on Missclassification Out-of-bag error: 0.2333 ---------------- Computation time Number of cores used: 7 Time difference of 2.87888 mins ----------------
In this illustration, we built the random forest using 150 subjects because we only kept the subjects still at risk at landmark time at 4 years and split the dataset in $2/3$ for training and $1/3$ for testing. We have on average 5.7 subjects per leaf, and the average depth level per tree is 5.9. This random forest predicted the wrong outcome for 24\% of the subjects. The random forest performances can be optimized by choosing the mtry
and nodesize
hyperparameters that minimized the OOB missclassification.
We can predict the probability of death between 4 and 10 years on subjects still at risk at landmark time at 4 years. In classification mode, the predictions are performed using the majority vote. The prediction over the trees is thus a category of the outcome along with the proportion of the trees that lead to this category. Predictions are computed using predict()
function, then a dataframe can be easily built from the returning object to get the prediction and probability of the outcome for each subject:
timeData_pred <- pbc2_pred[,c("id","time", "serBilir","SGOT", "albumin","alkaline")] fixedData_pred <- unique(pbc2_pred[,c("id","age","drug","sex")]) pred_dyn <- predict(object = res_dyn, timeData = timeData_pred, fixedData = fixedData_pred, idVar = "id", timeVar = "time", t0 = 4)
head(data.frame(pred = pred_dyn$pred_indiv, proba = pred_dyn$pred_indiv_proba)) pred proba 101 0 0.945 104 0 0.790 106 1 0.600 108 0 0.945 112 1 0.575 114 0 0.650
As shown in this example, some predictions are made with varying confidence from 57.5\% for subject 112 to 94.5\% for subject 101. We predict for instance no event for subject 101 with a probability of 94.5\% and an event for subject 106 with a probability of 60.0\%.
The most predictive variables can be identified using compute_vimp()
and displayed using plot()
function as follows:
res_dyn_VIMP <- compute_vimp(dynforest_obj = res_dyn_OOB, seed = 123) plot(res_dyn_VIMP, PCT = TRUE)
knitr::include_graphics("Figures/DynForestR_classi_VIMP.png")
Again, we found that the most predictive variable is serBilir
. When perturbating serBilir
, the OOB prediction error was increased by 15\%.
The minimal depth is computed using compute_vardepth()
function and is displayed at predictor and feature levels using plot()
function. The results are displayed in Figure 2 using the random forest with maximal mtry
hyperparameter value (i.e., mtry
= 7) for a better understanding.
depth_dyn <- compute_vardepth(dynforest_obj = res_dyn_OOB) p1 <- plot(depth_dyn, plot_level = "predictor") p2 <- plot(depth_dyn, plot_level = "feature")
plot_grid(p1, p2, labels = c("A", "B"))
knitr::include_graphics("Figures/DynForestR_classi_mindepth.png")
We observe that serBilir
and albumin
have the lowest minimal depth and are used to split the subjects in almost all the trees (199 and 196 out of 200 trees, respectively) (Figure 2A). Figure 2B provides further results. In particular, this graph shows that the random intercept (indicated by bi0) of serBilir
and albumin
are the earliest predictors used to split the subjects and are present in 193 and 190 out of 200 trees, respectively.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.