sw_reg: Fit classical regression model (GAM, RF) on right censored...
In YohannLeFaou/sword: Random forest with IPC weights for survival analysis

Description Usage Arguments Details Value References See Also Examples

sw_reg is the core function of the package. It implements the method we study in [Gerber et al. (2018)] to adapt regression algorithms to right censored target variable. Given a right censored variable T, a function phi and covariates X, sw_reg aims to estimate E[phi(T)|X]. The methods is based on the IPCW (Inverse probability of Censoring Weighting) principle for right censored variables which is used to compensate for the censoring. Though the method may generalise to many regression algorithms, sw_reg only implements random forest and GAM solutions. Technicaly, sw_reg is a wrapper for rfsrc and rpart(random forest) and gam (GAM). Different methods are available to assess the quality of fit of rsf_reg.

The notations we use are :

C : Censoring variable
Y = min(T, C)
δ = 1_{T ≤ C} (delta)

sw_reg(y_var, delta_var, x_vars, train, test = NULL, type_reg = "RF",
  type_w = "KM", phi = function(x) {     x }, phi.args = list(),
  max_time = NULL, sw_reg_obj = T, cens_mod_obj = T,
  ev_methods = c("concordance", "weighted"), bandwidths = NULL,
  types_w_ev = "KM", max_w_mod = NULL, max_w_ev = 1000, mat_w = NULL,
  y_no_cens_var = NULL, mode_sw_RF = 1, ntree = 100, minleaf = 5,
  maxdepth = 6, mtry = NULL, ...)

`y_var`	A character string which gives the name of the Y variable
`delta_var`	A character string which gives the name of the δ variable (this variable should be binary : 1 = non censored, 0 = censored)
`x_vars`	A vector of character strings which gives the names of the explanatory variables
`train`	A data.frame of training observations. It must contain the column names `y_var`, `delta_var` and `x_vars`
`test`	A data.frame of testing observations (default = `NULL`)
`type_reg`	A character string giving the regression algorithm to use in the model (default = `"RF"`). Other possible value is "gam".
`type_w`	A character string giving the type of IPC weights used to train the regression model (default = `"KM"`). Other possible values are "Cox", "RSF" and "unif". Set to `colnames(mat_w)[1]` if `mat_w` is provided.
`phi`	A function to be applied to `y_var`
`phi.args`	A list of additional parameters for the function `phi` (default = NULL). See Examples for a use case
`max_time`	A real number giving a threshold for `y_var` (default = `NULL`). If `NULL`, then `max_time` is set to the maximum non censored observation of `y_var` among the training set. We note T' = min(T, `max_time`)
`sw_reg_obj`	A boolean which indicates if the random forest model fitted to the training data should be returned (default = `TRUE`)
`cens_mod_obj`	A boolean which indicates if the model fitted to the censoring variable to compute the weights ("KM", "Cox" or "RSF") used for training should be returned (default = `TRUE`)
`ev_methods`	A vector of character strings which gives the methods that should be used for the evaluation of the model (default = `c("concordance","weighted")`). Possible choices are `"concordance"`, `"weighted"` and `"group"`. Multiple choices are possible. See Details - Evaluation criteria for more information
`bandwidths`	A vector of real numbers for the bandwidths to use for the model evaluation if `"group"` is used as an `ev_methods` (default = `NULL` : set to 50 if `"group"` in `ev_methods`). Only used if `"group"` is used as `ev_methods`. See Details - Evaluation criteria for more information
`types_w_ev`	A vector of character strings which gives the types of weights to be used for IPCW (Inverse Probability of Censoring Weighting) in the model evaluation (default = `c("KM")` (Kaplan Meier)). Possible choices are `"KM"`, `"Cox"`, `"RSF"` and `"unif"`. Set to `colnames(mat_w)` if `mat_w` is provided. See Details - Evaluation criteria for more information
`max_w_mod`	A real number which gives the maximum admissible ratio for the IPC weights (default = NULL : then set to `floor(sqrt(nrow(train))/2)`) used in model fitting. See Details - Evaluation criteria for more information
`max_w_ev`	A real number which gives the maximum admissible ratio for the IPC weights (default = 1000) used in model evaluation. See Details - Evaluation criteria for more information
`mat_w`	A matrix to provide handmade IPC weights for the model evaluation (default = `NULL`). `mat_w` should satisfied `nrow(mat_w) = nrow(train) + nrow(test)` and a column should correspond to a type of weights (multiple columns are possible). Column names of `mat_w` may be used to specify names for the provided weights (by default names will be "w1", "w2", ...)
`y_no_cens_var`	A character string which gives the name of the non censored `y_var` (default = `NULL`). To be used only in the context of simulated data where full about is available
`mode_sw_RF`	An integer (`1` or `2`) which specifiy the type of weighted random forest to grow : `1` = wRF1, `2` = wRF2 or wRF3 (default = `1`). See Details - Random Forest modes for more information. Only used if `type_reg = "RF"`
`ntree`	A positive integer which gives the number of trees to grow in the forest (default = `100`).
`minleaf`	A positive integer indicating the minimum number of observations that must be present in a (terminal) leaf (default = `5`).
`maxdepth`	A positive integer indicating the maximum number of layers in individual trees (default = 6).
`mtry`	A positive integer indicating the number of random variables to draw at each node for the split selection (default = `NULL`). If `NULL`, `mtry` is set to `floor(sqrt(length(x_vars)))` by default. Warning (Exception to he latter statement) : if `mode_sw_RF = 2`, `mtry` is set to `length(x_vars)` and can not be modified
`...`	Additional parameter that may be pass to the regression algorithm used (see Details - Additional parameters)

Evaluation criteria

The quality of fit may be assess throught three different criteria. There are two main criteria : "weighted" and "concordance", and one criteria that is expimental : "group".
- "weighted" : the weighed criteria is described in <c2><a7>? of [Gerb. et al.]. This criteria aims to estimate the quadratic error of the model in the context of right censoring. It has the form ∑_i W_i (y_i - \hat{y}_i)^2 where (y_i)i are the censored targets of the model, (W_i)i are the IPC weights, and (\hat{y}_i)i are the predictions made.
  
  The types_w_ev argument allows the use of four kinds of IPC weights : "KM", "Cox", "RSF" and "unif". The first three types of weights correspond to different ways to estimate the survival function of the censoring. On the other hand, "unif" corresponds to W_i = 1 for all i.
  
  Since the IPC weights may take too large values in some situation, max_w_ev allows to threshold the weights (W_i)i so that the ratio between the largest and the smallest weights doesn't exceed max_w_ev. If W_max is the maximum weight, considered weights are then min(W_i, W_max) / ( ∑_i min(W_i, W_max) )
  
  You can also manually provide weights to be used for IPCW with the argument mat_w (those weights will also be threshold w.r.t. max_ration_weights_eval). mat_w should be a matrix satisfying nrow(mat_w) = nrow(train) + nrow(test), any numbers of columns may be provided and then each column of the matrix corresponds to a type of weights. Columns name of the matrix are taken as names for the different types of weights. If there is no column name, then default names are "w1", "w2', ...
- "concordance" : The concordance is a classical measure of performance when modelling censored variables. It indicates if the order of the predicted values of the model is similar to the order of the observed values. The concordance generalizes the Kendall tau to the censored case.
- "group" : This is an experimental criteria that we didn't mention in [Gerb. et al.]. Here, the idea to take the censoring into account is to measure errors given groups of observations and not single observations. First, the test sample is ordered w.r.t. the predicted values of the model. Second, respecting this order the test sample is splited into groups of size v_bandwidht[1], or v_bandwidht[2], etc ... (each bandwidth in v_bandwidht corresponds to a different score "group").
  
  Then, inside a group, an estimator of the survival function of T may be obtained by Kaplan Meier, and we can deduce an estimator of phi(T) by integration. This estimator of phi(T) may be viewed as an "empirical" value of phi(T) inside the group.
  
  On the other other hand, each observation of a group is associated to a prediction of phi(T). The prediction for the group may be defined as the mean of the predictions of the observations inside the group.
  
  The resulting group scores correspond to the classical quality of fit criteria (e.g. quadratic error, Kendall tau, Gini index) but taken on groups and not on single observations.
  
  The "group" criteria is interesting in the context of big database, in which sufficient number of groups are available. This criteria has a high variance if applied to small test sample.
Random Forest modes

modes correspond to different ways to take the weights into account in the random forest :
- mode_sw_RF = 1 : weights are computed a single time (on the whole train sample) before the growing of the forest and are passed to the forest as probabilities of sampling single observations for the bootstrap of the random forest. This mode corresponds to wRF1 in [Gerb. et al.], it internally calls the rfsrc function.
- mode_sw_RF = 2 : weights are computed ntree times ; for a given tree, a bootsrap sample is drawn uniformly with replacement and then weights are evaluated on the bootstrap sample. The tree growing procedure use then weighted error as splitting criteria. Two types of predictions are made in this mode : the first prediction is output as pred_train/test and it uses the same weights as those used for training to compute predictions in terminal leafs (wRF2 in [Gerb. et al.]). The second prediction is output as pred_train/test_KMloc and it makes terminal leafs estimation by using Kaplan Meier to estimate the within leaf survival function of T. This mode internally calls the rpart function.
Additional parameters

sw_reg allows to pass additional parameters to the underlying regression algorithm. Depending on type_reg and mode_sw_RF, the wrapped function is as follow :
- type_reg = "RF" and mode_sw_RF = 1 : rfsrc
- type_reg = "RF" and mode_sw_RF = 2 : rpart
- type_reg = "gam" : gam
For instance in the first case, one may pass to sw_reg a parameter that is then passed to rfsrc (e.g. proximity = TRUE)

A list with the following elements :

`pred_train`	The vector of the predicted values for `phi`(T') (with T' = min(T, `max_time`)) for the observations of the train set. See Details - Random Forest modes for more information
`pred_test`	The vector of the predicted values for `phi`(T') for the observations of the test set (require `test` != `NULL`). See Details - Random Forest modes for more information
`perf_train`	The list with the values for the evaluation criteria computed on the train set
`perf_test`	The list with the values for the evaluation criteria computed on the test set (require `test` != `NULL`))
`w_mod_train`	The vector of the weights used to train the model, after applying `max_w_mod` and normalising
`n_w_mod_modif_train`	The vector giving the number of train weights modified due to `max_w_mod`
`mat_w_train`	The matrix which contains the values of the weights used for the `"weighted"` criteria, for the observations of the train set
`mat_w_test`	The matrix which contains the values of the weights used for the `"weighted"` criteria, for the observations of the test set
`sum_w_train`	The sum of the gross weights for the train data, before applying `max_w_ev` and normalising
`sum_w_test`	The sum of the gross weights for the test data, before applying `max_w_ev` and normalising
`n_w_ev_modif_train`	The vector giving the number of train weights modified due to `max_w_ev`
`n_w_ev_modif_test`	The vector giving the number of test weights modified due to `max_w_ev`
`sw_RF_obj`	The obj returned by `rfsrc` (when `type_reg = "RF"` and `mode_sw_RF = 1`)
`sw_gam_obj`	The obj returned by `gam` (when `type_reg = "gam"`)
`sw_rpartRF_obj`	The obj returned by `rfsrc` (when `type_reg = "RF"` and `mode_sw_RF = 2`)
`max_time`	The real number giving the threshold used by the model
`cens_rate`	The real number giving the rate of censoring of T', computed on the concatenation of `train` and `test`
`pred_train_KMloc`	The vector of the predicted values for `phi`(T') for the observations of the train set (require `mode_sw_RF = 2`). See Details - Random Forest modes for more information
`pred_test_KMloc`	The vector of the predicted values for `phi`(T') for the observations of the test set (require `mode_sw_RF = 2`). See Details - Random Forest modes for more information
`perf_train_KMloc`	The list with the values for the evaluation criteria computed on the train set (require `mode_sw_RF = 2`). See Details - Random Forest modes for more information
`perf_test_KMloc`	The list with the values for the evaluation criteria computed on the test set (require `mode_sw_RF = 2`). See Details - Random Forest modes for more information
`surv_train_KMloc`	The matrix which contains the estimated values of the survival curves at `time_points` (within leaf Kapaln Meier estimator), for the observations of the train set (require `mode_sw_RF = 2`)
`surv_test_KMloc`	The matrix which contains the estimated values of the survival curves at `time_points` (within leaf Kapaln Meier estimator), for the observations of the test set (require `mode_sw_RF = 2`)
`time_points`	The vector of the time points where the survival curves are evaluated (require `mode_sw_RF = 2`)
`train`	The data.frame of the train data provided as arguments, plus columns : Y' = min(Y, `max_time` ), δ' = 1_{T' ≤ C} and `phi`(T')
`test`	The data.frame of the test data provided as arguments, plus columns : Y' = min(Y, `max_time` ), δ' = 1_{T' ≤ C} and `phi`(T')
`type_reg`	See Argument
`type_w`	See Argument
`phi`	See Argument
`phi.args`	See Argument
`x_vars`	See Argument
`max_w_mod`	See Argument
`max_w_ev`	See Argument
`mode_sw_RF`	See Argument

Gerber, G., Le Faou, Y., Lopez, O., & Trupin, M. (2018). The impact of churn on prospect value in health insurance, evaluation using a random forest under random censoring. https://hal.archives-ouvertes.fr/hal-01807623/

rfsrc, rpart, gam, predict_sw_reg

# ------------------------------------------------
#   Load "transplant" data
# ------------------------------------------------
data("transplant", package = "survival")
transplant$delta = 1 * (transplant$event == "ltx") # create binary var
# which indicate censoring/non censoring

# keep only rows with no missing value
apply(transplant, MARGIN = 2, FUN = function(x){sum(is.na(x))})
transplant_bis = transplant[stats::complete.cases(transplant),]

# plot the survival curve of transplant data
KM_transplant = survfit(formula = survival::Surv(time = futime, event = delta) ~ 1,
                        data = transplant_bis)
plot(KM_transplant)

# ------------------------------------------------
#   Basic call to train a model
# ------------------------------------------------
res1 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis
)
# parameters set by default
res1$type_w
res1$type_reg
res1$max_w_mod
res1$max_w_ev
res1$mode_sw_RF # 1 corresponds to wRF1 in [Gerb. et al.]


# train errors
res1$perf_train

# ------------------------------------------------
#   Training with estimation of test error
# ------------------------------------------------
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res2 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))

print(res2$max_time) # default \code{max_time} has changed since train set
# is different

# there is a uge overfitting in terms of quadratic errors
print(res2$perf_train)
print(res2$perf_test)

# default parameters for the random forest are
res2$sw_RF_obj$ntree
res2$sw_RF_obj$mtry
res2$sw_RF_obj$nodesize
res2$sw_RF_obj$nodedepth # means there is no depth limit


# -----------------------------------------------------
#   Modify the \code{max_time} argument & look for
#       the best model under this setting
# -----------------------------------------------------

set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res30 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    type_w = "KM", # default value
                                    max_time = 600, # we set \code{max_time} to 600
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res30$perf_test)

# are the other types of weights giving better results ?
## Cox weights
res31 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    type_w = "Cox",
                                    max_time = 600, # we set \code{max_time} to 600
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res31$perf_test) # slight improvment compared with weights KM

## RSF weights
res32 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "RSF",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res32$perf_test)

## unif weights
res33 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "unif",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res33$perf_test)

# In terms of quadratic, the best weights are the Cox weights
# remark : in this example there is not a big difference between the "unif" weights
# and the other weights because there is little censoring in the data :
res33$cens_rate

# -----------------------------------------------------------------
#     Try wRF2  and wRF3 (both are obtained with
#               \code{mode_sw_RF = 2})
# -----------------------------------------------------------------

res40 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "Cox",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"),
                                     mode_sw_RF = 2)
print(res40$perf_test) # wRF2 : not as good as wRF1
print(res40$perf_test_KMloc) # wRF3 : worse than wRF2


# -------------------------------------------------------
#               Try a GAM model
# -------------------------------------------------------

## GLM with Cox weights
res5 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "Cox",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"),
                                    type_reg = "gam")
print(res5$perf_test) # not as good as random forest


# ------------------------------------------------------------
#     Analyse the weights used for "weighted" criteria
# ------------------------------------------------------------

print(res31$cens_rate) # rate of censoring taking into account \code{max_time}
print(head(res31$mat_w_test))
## ratio max(weights)/min(weights)
print(apply(X = res31$mat_w_test,
            MARGIN = 2,
            FUN = function(x){max(x[x != 0])/min(x[x != 0])}))
# ratios are low because the censoring rate is low

# in this case, it is not meaningful to to modify the
# \code{max_w_ev} argument since the maximum ratios
# between weights are around 2 and the test data has 197 rows.
# But in other situation it may be pertinent


# ------------------------------------------------
#   Use custom \code{phi} function
# ------------------------------------------------
g = function(x,a) abs(x-a)
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res6 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    phi = g,
                                    phi.args = list(a = 200),
                                    type_w = "Cox",
                                    max_time = 600, # we set \code{max_time} to 600
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res6$perf_test) # slight improvment compared with weights KM