knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction to sword

What is sword ?

sword is an experimental package which provides a tool to generalise the use of the Random Forest algorithm and the Generalised Additive Model (GAM) model to the modeling of right censored data. Such models can be built very easily thanks to sword, and their performances can be compared with benchmark models.

Technically, sword is a wrapper for statistical algorithms provided in the R packages randomForestSRC (random forest : rfsrc), rpart, survival and mgcv (gam).

As sword is experimental, it is designed such that somebody interested in the method could test it and get results very fast on new data. However, sword is not numerically optimised and can not handle massive amount of data.

We point out that sword has more a research focus and one should be careful before making a professional use of it.

Survival regression setting

To describe the general problem we study, let $T$ a time variable. It is well known that in some applications where the time of interest is a long duration, the time $T$ may not be fully observed due to the fact that $T$ may not have occured before the end of the study, or that the follow up of an individual may have stoped prematurely, before $T$ occured. We meet such situations in the study of lifetime, long term care insurance, predictive maintenance or contract churn. Indeed, in the case of lifetime for instance, part of the population is still alive at the time the study is made and the lifetimes of the living individuals are unknown : we know that their lifetimes ($T$) are bigger than their current ages but we don't know what will be their lifetimes (we know a partial information).

In this context, we then define $C$ the time of the end of the observation, caused by any reason (end of follow up, end of the study, ...) except death, and we say that $T$ is right censored by $C$. Instead of observing $T$, we observe $Y = \text{min}(T,C)$ and $\Delta = 1_{T \le C}$.

We note $X$ a vector of characteristics of an observation (e.g. for the study of lifetime : sex, education,..). Then our observations consists in $n$ independant and identically distributed replications of the vector $(Y, \Delta, X)$. We note $(Y_i, \Delta_i, X_i){i=1,..,n}$ the random variables, and $(y_i, \delta_i, x_i){i=1,..,n}$ the corresponding observations.

In some situations, one may be interested in predicting $\phi(T)$ (for a given function $\phi$) rather than predicting $T$. We give an example of this situation in the article [??] where $T$ corresponds to the termination time of an insurance contract and $\phi(T)$ denotes the amount of commission received by the broker which sold the contract. In the following, we study another example where $\phi(T) = 1_{T > 365}$ and we focus on predicting the probability that the waiting time before a patient benefit from a liver transplantation exceed one year.

The packages sword answers the problem of the prediction of $\phi(T)$, providing functions to build models that aims to estimate $E[\phi(T)|X]$.

Weighted method for the regression of a right censored variable

Inverse Probability of Censoring Weighting (IPCW)

The main function of sword is the function sw_reg : this function allows to fit random forest and GAM (Generalised additive model) models to a right censored variable, using Inverse Probability of Censoring Weighting (IPCW). The idea behind the IPCW method is that the non censored observations of $T$ : $(y_i)_{i \in {i = 1,..n / \delta_i = 1}}$, are biased downwards since they are realisations of the variable $T | T \le C$. But it is possible to compensate the bias induced by the censoring thanks to a correct weighting.

If we assume $(H)$ that $T$ and $C$ are independent conditinally on $X$, then we can show that we have the following equality :

$$E\left[ W \cdot \psi(Y,X) \right] = E \left[\psi(T,X)\right] ,~~ \text{with} ~~ W = \frac{\Delta}{S_C(Y |X)}$$

with $S_C(t ~| X = x)$ denoting the survival function of $C$ given that $X = x$, and $\psi$ a given function.

This equality shows that it is possible to estimate the distribution of $(T,X)$ from the available censored observations $(y_i, \delta_i, x_i)_{i=1,..,n}$. To do so, each observation $(y_i, x_i)$ is weighted as follow :

Remark: we have $S_C(t_i |X = x_i) = P(C > t_i | X = x_i) \overset{(H)}{=} P(C > T | X = x_i, T = t_i)$. Then we observe that the quantity $S_C(t_i |X = x_i)$ corresponds to a probability of being non censored. Hence the name IPCW.

By attributing weights $w_i$ to the observations, sw_reg generalises classical regression algorithms to the right censored case.

Practical algorithms

In practice, the function $S_C(\cdot |X = x)$ is unknown and so are the weights $w_i$. Then we should estimate $S_C(\cdot |X = x)$ in order to use the IPCW principle. There are 3 methods implemented in sw_reg to estimate $S_C(\cdot |X = x)$ (type_w argument) :

Once we have built an estimator $\hat{S}_C(\cdot |X = x)$ of $S_C(\cdot |X = x)$, we can compute the IPC weights $\hat{w}_i = \delta_i ~/~ \hat{S}_C(y_i |X = x_i)$.

Then the method consists in using the regression algorithm to predict the targets $\phi(y_i)$, with explanatory variables $x_i$, and with each observation $i$ getting a weight $\hat{w}_i$ in the model training.

Depending on the algorithm, there might be different ways to take the weights into account. For instance, if we consider a linear gaussian model (with parameter $\beta$) as regression algorithm, the weights $\hat{w}_i$ are introduced in the quadratic error of the model and $\hat{\beta}$ is estimated with :

$$ \hat{\beta} = \underset{\beta \in R^d}{\text{argmin}} \sum_{i=1}^n \hat{w}_i \cdot (\phi(y_i) - \beta \cdot x_i) $$

Example data : Liver transplant waiting time

We now illustrate through an example how the package sword can be used to model right censored data. We use the dataset called transplant which records, from 1990 to 1999, information about the patients which were on a waiting list for receiving a transplant.

Load and preprocess the data

library(sword)
library(ggplot2)
library(reshape2)
data("transplant")
head(transplant)

In this dataset, the variable futime corresponds to the $Y$ variable defined in the previous paragraph (time from entry in the list to the exit of the list fo rany reason) and event corrresponds to $\Delta$. A small difference is that in the transplant dataset there are 4 outcomes for the event taking place at the end of the futime period (censored, death, ltx : liver transplant, or withdraw) whereas in our setting there are only two possible outcomes : ocurrence of $T$ or censoring (end the observation : $C$). For the purpose of the example, we reduce the problem studied here in a problem with a binary outcome by condidering the event delta : did the person receive a transplant ? yes : 1 (ltx modality) or no : 0 (censored, death, withdraw modalities). As a result, we study the distribution of the variable "waiting time from the entry in the list until transplantation".

transplant$delta = 1 * (transplant$event == "ltx") # create binary variable which indicate censoring/non censoring

Four covariates (vector $X$) are provided :

# compute the number of missing values for each column
apply(transplant, MARGIN = 2, FUN = function(x){sum(is.na(x))})

# keep only rows with no missing value
transplant_bis = transplant[complete.cases(transplant),]

Statistics about the data (Survival function of the outcome)

# plot the survival curve of the waiting time until transplant
KM = survfit(formula = Surv(time = futime, event = delta) ~ 1,
             data = transplant_bis)
plot(KM, 
     main = "Survival Curve of the waiting time before liver transplant", 
     ylab = "survival prob.", xlab = "time (days)")

Predict the probability that a transplantation occurs after more than one year of waiting

In this application we take $\phi(t) = 1_{t > 365}$ so that our goal is to estimate $P(T > 365~|~X)$.

We split the dataset into train and test sets so that we can validate our model :

nrow(transplant_bis) # number of observations

set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
train = transplant_bis[train_lines,] # train set
test = transplant_bis[-train_lines,] # test set

In this special case the value of $\phi(T)$ is known as soon as $T$ reaches 366 days : we can set $T' = \text{min}(T, 366)$ and we have $\phi(T') = \phi(T)$. Then we can replace $\phi(T)$ by $\phi(T')$ ; the advantage is that $T' \le T$, then $T'$ is less censored than $T$ and we get more non censored observations, resulting in a more accurate estimation. This is the purpose of the max_time argument of the function sw_reg which allows to replace $T$ by $T'= min(T,$ max_time).

Weighted Regression Survival

Train a weighted random forest model

By default, the type_reg argument of sw_reg takes the values "RF", which means the regression algorithm used by the function is the random forest. The other possible value is "gam". Then, the following command train a weighted random forest model on the transplant data :

res1 = sw_reg(y_var = "futime", 
              delta_var = "delta",
              x_vars = c("age", "sex", "abo", "year"),
              train = train,
              test = test,
              type_w = "KM",
              phi = function(x){(x > 365) * 1},
              max_time = 366,
              types_w_ev = c("KM", "Cox", "RSF", "unif"))

The type_w argument specifies the type of weights to use in the model training, among "KM", "Cox" and "RSF" (see part. ??, default = "KM"). The other possible value is "unif" which corresponds to every non censored observations getting the same weight (and censored obs. getting weights 0) : this value is not recommended since it generally leads to highly biased results, but it can be tested.

Due to the max_time argument, the rate of censoring in the training data is different from the rate of censoring in the initial data :

print(sum(train$delta == 0) / nrow(train)) # rate of censoring on the initial data
print(head(res1$cens_rate)) # rate of censoring after thresholding with `max_time`

The training weights, which are in the above example computed with the "KM" method, are returned by the function :

print(res1$w_mod_train[1:30])

The parameter max_w_mod may also be specified in the model training (default value = 20). The purpose of this parameter is to limit the ratio $\max(\hat{w}_i) / \min(\hat{w}_i)$ to the value max_w_mod. When needed, the weights $\hat{w}_i$ that exceed max_w_mod$ \min(\hat{w}_i)$ are set to the value : max_w_mod$ \min(\hat{w}_i)$. Afterwards, the weights $(\hat{w}_i)$ are normalised such that they sum to $1$.

print(res1$max_w_mod) # value passed to the sw_reg function (default value = 20)
print(max(res1$w_mod_train) / min(res1$w_mod_train[res1$w_mod_train > 0])) # maximum ratio among the train weights
print(res1$n_w_mod_modif_train) # number of weights modified due to `max_w_mod`

In our example, as the rate of censoring is low and the values of $T$ are thresholded by $366$, the weights don't take big values and the parameter max_w_mod has no effect. But in some cases where the rate of censoring is high and/or max_time isn't specified, the weighs $\hat{w}_i$ can take very big values. When it happens, il should be corrected because the importance of a single observation in the final model must be limited : otherwise this results in very high variance in the final model.

Predictions of the model

The predictions given by the model on the train and the test sets are available in the returned obj :

print(head(res1$pred_train))
print(head(res1$pred_test))

It is often needed to compute the predictions of the model on new data and the function predict_sw_reg allows to do so :

pred_test = predict_sw_reg(obj = res1, newdata = test)
print(pred_test$pred[1:30])

Assess the quality of fit of the model

The type_w_eval argument has the same set of possible values as type_w, but multiple choices are possible. This argument gives the names of the weights that should be considered for the calculation of the mean squared error (MSE) estimator. Indeed, we have that, likewise for the model training, the right censoring should be compensate in the test sample. We then use the same IPCW technique as for the training and the MSE is evaluated with :

$$MSE = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} \hat{w}_i \cdot (\phi(y_i) - pred_i)^2$$

where $pred_i$ is the prediction made by the model.

Goodness of fit statistics on the test set are accessible through the perf_test element of the returned obj :

print(res1$perf_test$criteria_weighted) # test mse and R2

We observe that we get one MSE for each type of weights specified in the types_w_ev argument.

We also get the $R2$ criteria which are normalisations of the MSE. Let :

$$ \hat{Mean} = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} \hat{w}_i \cdot \phi(y_i)$$

and

$$ \hat{Var} = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} \hat{w}_i \cdot (\phi(y_i) - \hat{Mean})^2 $$

Then $R2 = 1 - MSE ~/~ \hat{Var}$.

The classical concordance index (or C-index) is provided :

print(res1$perf_test$concordance)

These performance criteria are also computed on the train set :

print(res1$perf_train)

The weights used to evaluate the model are accessible through :

# weights used for evaluation
print(head(res1$mat_w_train))
print(head(res1$mat_w_test))

Likewise for the weights used for training, the ratio between the evaluation weights may be limited to a certain value (common for all the types of weights). This is specified by the argument max_w_ev.

print(res1$max_w_ev) # value passed to the sw_reg function (default value = 1000)
print(res1$n_w_ev_modif_test) # number of test weights modified because of the threshold
print(res1$n_w_ev_modif_train) # number of train weights modified because of the threshold

The outputs sum_w_train and sum_w_test returns the sum of the weights $\hat{w}_i$ before any thresholding is done on the weights. This value is normally closed from 1. A value far from 1 may indicate a bad fit for the model which estimates $S_C(\cdot | X)$.

print(res1$sum_w_train) # sum of the train weights before reprocessing
print(res1$sum_w_test) # sum fo the test weighs before reprocessing

other mode for weighted random forest

res2 = sw_reg(y_var = "futime",
              delta_var = "delta",
              x_vars = c("age", "sex", "abo", "year"),
              train = train,
              test = test,
              type_w = "KM",
              phi = function(x){(x > 365) * 1},
              max_time = 366,
              types_w_ev = c("KM", "Cox", "RSF", "unif"),
              mode_sw_RF = 2)
print(res2$perf_test)
print(res2$perf_test_KMloc)

weighted GAM model

The sw_reg may also be used to fit Generalized Additive Model (GAM) on survival data by specifying type_reg = "gam" in the function call. Here, we train a GLM model with binomial distribution and link logit on the data :

res2 = sw_reg(y_var = "futime", 
              delta_var = "delta",
              x_vars = c("age", "sex", "abo", "year"),
              train = train,
              test = test,
              type_reg = "gam",
              type_w = "Cox",
              phi = function(x){(x > 365) * 1},
              max_time = 366,
              types_w_ev = c("KM", "Cox", "RSF", "unif"),
              family = binomial(link = "logit"))

IPC weights are computed using the Cox model (type_w = "Cox"). The gam obj is returned by the function :

summary(res2$sw_gam_obj)

The fit appears to be poor because the p-values of the coefficients are high. The following scores confirm the fit of the GLM (binomial) model is not good.

print(res2$pred_test[1:20])
print(res2$perf_test)

Modify the arguments of the internal regression algorithm used

Both the random forest algorithm and the GAM model can benefit from the adjustment of different parameters. For the random forest, the most common parameters are :

Those four parameters are available as parameters of the sw_reg function. But many other parameters can be pass to the randomForestSRC::rfsrc function of the package , or the rpart::rpart function. In fact, any parameter of the wrapped function can be specified through a call to sw_reg, thanks to the argument .... For instance, one can add proximity = TRUE when calling to sw_reg (type_reg = "rf", mode_sw_RF = 1) and the argument will be pass to the function randomForestSRC::rfsrc :

res11 = sw_reg(y_var = "futime", 
               delta_var = "delta",
               x_vars = c("age", "sex", "abo", "year"),
               train = train,
               test = test,
               type_w = "KM",
               phi = function(x){(x > 365) * 1},
               max_time = 366,
               types_w_ev = c("KM", "Cox", "RSF", "unif"),
               proximity = T)
print(res11$sw_RF_obj$proximity[1:5,1:5]) # matrix of the proximities between the first 5 obs. of the train set
print(dim(res11$sw_RF_obj$proximity)) # dimension of the proximity matrix

For the mgcv::gam function, no additionnal parameter figure in the list of the arguments of sw_reg. Every parameter should be pass thanks to "...". We did in the previous paragraph to specify family = binomial(link = "logit") in the gam model.

Comparison with benchmark models

The two benchmark models we use are derived from 2 classical algorithms to study survival data : the Cox model and the random survival forest algorithm. In both cases, the survival model is used to build a model to estimate the conditionnal survival functions $S_T(\cdot,X = x)$ : let $\hat{S} _T(\cdot,X = x)$ the estimates. Then, the prediction for $E[\phi(T)|X=x]$ is estimated with the formula

$$ \hat{\phi} = - \int_{s=0}^{\text{max_time}} \phi(s) d\hat{S}_T(\cdot,X = x) $$

Random Survival Forest

res2 = rsf_reg(y_var = "futime",
               delta_var = "delta",
               x_vars = c("age", "sex", "abo", "year"),
               train = transplant_bis[train_lines,],
               test = transplant_bis[-train_lines,],
               phi = function(x){(x > 365) * 1},
               max_time = 366,
               types_w_ev = c("KM", "Cox", "RSF", "unif"))

For the first 10 observations of the test set, we draw the predicted survival curves :

dpi=300
data_surv_test = cbind(melt(t(res2$surv_test[1:10,])), time = res2$time_points)
ggplot(data = data_surv_test, 
       aes(x = time, y = value, group = factor(Var2), color = factor(Var2))) +
  geom_line() +
  theme(legend.position = "bottom") +
  ggtitle("Prédiction des courbes de survie des 10 premiers individus test (RSF)")

Observe that we don't estimate the survival curves beyond max_time, since these estimates would be useless.

These survival curves correspond to the following predictions in terms of $\phi$ ($\phi(t) = 1_{t > 365}$):

res2$pred_test[1:10]

We have the same quality of fit criteria as for the function sw_reg :

print(res2$perf_test)

Cox model

res3 = cox_reg(y_var = "futime",
               delta_var = "delta",
               x_vars = c("age", "sex", "abo", "year"),
               train = transplant_bis[train_lines,],
               test = transplant_bis[-train_lines,],
               phi = function(x){(x > 365) * 1},
               max_time = 366,
               types_w_ev = c("KM", "Cox", "RSF", "unif")
               )
data_surv_test2 = cbind(melt(t(res3$surv_test[1:10,])), time = res3$time_points)
ggplot(data = data_surv_test2, 
       aes(x = time, y = value, group = factor(Var2), color = factor(Var2))) +
  geom_line() +
  theme(legend.position = "bottom") +
  ggtitle("Prédiction des courbes de survie des 10 premiers individus test (Cox)")
print(res3$perf_test)

Comparison of the results

In this example where the goal is to predict the probabily that the waiting time until leaver transplant exceeeds 1 year, we have seen that the weighted random forest approach achieve the best accuracy among the different model, followed closely by the Random Survival Forest benchmark.

The advantage of the weighted random forest compared to the other models is that it focus on optimising the criteria that we want to minimise, which is the quadratic error measured on the test set. On the other hand, the information processed by sw_reg is different from the information processed by a benchmark (rsf_reg for example). Indeed, sw_reg processes the information $(\phi(y_i), x_i, \hat{w}i){i=1,..,n}$ whereas rsf_reg processes the data $(y_i, \delta_i, x_i)_{i=1,..n}$. Since, $\phi$ is an indicator function, sw_reg doesn't see the difference between the uncensored observations $(y_i = 370, \delta_i = 1, x_i = ..)$ and $(y_i = 900, \delta_i = 1, x_i = ..)$ whereas rsf_reg does make the difference.

We have compared sw_reg and the benchmarks on many datasets, and though there is no general rule that would identify the best model, there are some facts that we can highlight from the experiments. First, sw_reg appears to perform well when the censoring rate is low (less than 30%), and not very well when the rate of censoring increases. We think that this can be explained by the fact that sw_reg is really a generalisation of the classical random forest algorithm (for regression) to the right censored case. Hence, as the censoring rate gets low, we get closer from the case where there is no censoring, and sw_reg behaves similarly as the random forest in the uncensored case. Second, the $\phi$ function that is considered seems to have a big influence on the results. Finally, the weighted apprach of the function sw_reg seems to work better with conditionnal weights computed with Cox or RSF, than with Kaplan-Meier weights. But the latter observation does not appear in the experiments with the transplant data because the censoring rate of this dataset is too low (then the difference is very small between weights KM and the conditionnal weights).

We have also observed that the predictions obtained with sw_reg and the predictions made with benchmarks are not correlated a lot, and that combining them usually leads to better accuracy

# là j'ai calculé le R2 du modele qui est la moyenne entre weighted rf et RSF, mai visiblement pas ouf (trop de variance dans les résultats de tte façon)
mean = sum( res1$mat_w_test[,"KM"] * res1$test$phi)
R2 = 1 - sum( res1$mat_w_test[,"KM"] * ((res1$pred_test + res2$pred_test)/2 - res1$test$phi )^2) / 
  sum( res1$mat_w_test[,"KM"] * (res1$test$phi - mean )^2)

print(mean)
print(R2)

Conclusion

In this document, we have presented the package sword which aims to provide an implementation of the weighted regression algorihtm for survival data, that we have studied in the article [??]. To illustrate the fonctionalities of the package, we have explored the transplant data and we have applied the methods provided in sword to answer the problem of the estimation of the probabilty that a patient remain in the waiting list for liver transplant one year after entering the list.



YohannLeFaou/sword documentation built on May 28, 2019, 3:21 p.m.