sw_reg: Fit classical regression model (GAM, RF) on right censored...

Description Usage Arguments Details Value References See Also Examples

View source: R/sw_reg.R

Description

sw_reg is the core function of the package. It implements the method we study in [Gerber et al. (2018)] to adapt regression algorithms to right censored target variable. Given a right censored variable T, a function phi and covariates X, sw_reg aims to estimate E[phi(T)|X]. The methods is based on the IPCW (Inverse probability of Censoring Weighting) principle for right censored variables which is used to compensate for the censoring. Though the method may generalise to many regression algorithms, sw_reg only implements random forest and GAM solutions. Technicaly, sw_reg is a wrapper for rfsrc and rpart(random forest) and gam (GAM). Different methods are available to assess the quality of fit of rsf_reg.

The notations we use are :

Usage

1
2
3
4
5
6
7
sw_reg(y_var, delta_var, x_vars, train, test = NULL, type_reg = "RF",
  type_w = "KM", phi = function(x) {     x }, phi.args = list(),
  max_time = NULL, sw_reg_obj = T, cens_mod_obj = T,
  ev_methods = c("concordance", "weighted"), bandwidths = NULL,
  types_w_ev = "KM", max_w_mod = NULL, max_w_ev = 1000, mat_w = NULL,
  y_no_cens_var = NULL, mode_sw_RF = 1, ntree = 100, minleaf = 5,
  maxdepth = 6, mtry = NULL, ...)

Arguments

y_var

A character string which gives the name of the Y variable

delta_var

A character string which gives the name of the δ variable (this variable should be binary : 1 = non censored, 0 = censored)

x_vars

A vector of character strings which gives the names of the explanatory variables

train

A data.frame of training observations. It must contain the column names y_var, delta_var and x_vars

test

A data.frame of testing observations (default = NULL)

type_reg

A character string giving the regression algorithm to use in the model (default = "RF"). Other possible value is "gam".

type_w

A character string giving the type of IPC weights used to train the regression model (default = "KM"). Other possible values are "Cox", "RSF" and "unif". Set to colnames(mat_w)[1] if mat_w is provided.

phi

A function to be applied to y_var

phi.args

A list of additional parameters for the function phi (default = NULL). See Examples for a use case

max_time

A real number giving a threshold for y_var (default = NULL). If NULL, then max_time is set to the maximum non censored observation of y_var among the training set. We note T' = min(T, max_time)

sw_reg_obj

A boolean which indicates if the random forest model fitted to the training data should be returned (default = TRUE)

cens_mod_obj

A boolean which indicates if the model fitted to the censoring variable to compute the weights ("KM", "Cox" or "RSF") used for training should be returned (default = TRUE)

ev_methods

A vector of character strings which gives the methods that should be used for the evaluation of the model (default = c("concordance","weighted")). Possible choices are "concordance", "weighted" and "group". Multiple choices are possible. See Details - Evaluation criteria for more information

bandwidths

A vector of real numbers for the bandwidths to use for the model evaluation if "group" is used as an ev_methods (default = NULL : set to 50 if "group" in ev_methods). Only used if "group" is used as ev_methods. See Details - Evaluation criteria for more information

types_w_ev

A vector of character strings which gives the types of weights to be used for IPCW (Inverse Probability of Censoring Weighting) in the model evaluation (default = c("KM") (Kaplan Meier)). Possible choices are "KM", "Cox", "RSF" and "unif". Set to colnames(mat_w) if mat_w is provided. See Details - Evaluation criteria for more information

max_w_mod

A real number which gives the maximum admissible ratio for the IPC weights (default = NULL : then set to floor(sqrt(nrow(train))/2)) used in model fitting. See Details - Evaluation criteria for more information

max_w_ev

A real number which gives the maximum admissible ratio for the IPC weights (default = 1000) used in model evaluation. See Details - Evaluation criteria for more information

mat_w

A matrix to provide handmade IPC weights for the model evaluation (default = NULL). mat_w should satisfied nrow(mat_w) = nrow(train) + nrow(test) and a column should correspond to a type of weights (multiple columns are possible). Column names of mat_w may be used to specify names for the provided weights (by default names will be "w1", "w2", ...)

y_no_cens_var

A character string which gives the name of the non censored y_var (default = NULL). To be used only in the context of simulated data where full about is available

mode_sw_RF

An integer (1 or 2) which specifiy the type of weighted random forest to grow : 1 = wRF1, 2 = wRF2 or wRF3 (default = 1). See Details - Random Forest modes for more information. Only used if type_reg = "RF"

ntree

A positive integer which gives the number of trees to grow in the forest (default = 100).

minleaf

A positive integer indicating the minimum number of observations that must be present in a (terminal) leaf (default = 5).

maxdepth

A positive integer indicating the maximum number of layers in individual trees (default = 6).

mtry

A positive integer indicating the number of random variables to draw at each node for the split selection (default = NULL). If NULL, mtry is set to floor(sqrt(length(x_vars))) by default.

Warning (Exception to he latter statement) : if mode_sw_RF = 2, mtry is set to length(x_vars) and can not be modified

...

Additional parameter that may be pass to the regression algorithm used (see Details - Additional parameters)

Details

Value

A list with the following elements :

pred_train

The vector of the predicted values for phi(T') (with T' = min(T, max_time)) for the observations of the train set. See Details - Random Forest modes for more information

pred_test

The vector of the predicted values for phi(T') for the observations of the test set (require test != NULL). See Details - Random Forest modes for more information

perf_train

The list with the values for the evaluation criteria computed on the train set

perf_test

The list with the values for the evaluation criteria computed on the test set (require test != NULL))

w_mod_train

The vector of the weights used to train the model, after applying max_w_mod and normalising

n_w_mod_modif_train

The vector giving the number of train weights modified due to max_w_mod

mat_w_train

The matrix which contains the values of the weights used for the "weighted" criteria, for the observations of the train set

mat_w_test

The matrix which contains the values of the weights used for the "weighted" criteria, for the observations of the test set

sum_w_train

The sum of the gross weights for the train data, before applying max_w_ev and normalising

sum_w_test

The sum of the gross weights for the test data, before applying max_w_ev and normalising

n_w_ev_modif_train

The vector giving the number of train weights modified due to max_w_ev

n_w_ev_modif_test

The vector giving the number of test weights modified due to max_w_ev

sw_RF_obj

The obj returned by rfsrc (when type_reg = "RF" and mode_sw_RF = 1)

sw_gam_obj

The obj returned by gam (when type_reg = "gam")

sw_rpartRF_obj

The obj returned by rfsrc (when type_reg = "RF" and mode_sw_RF = 2)

max_time

The real number giving the threshold used by the model

cens_rate

The real number giving the rate of censoring of T', computed on the concatenation of train and test

pred_train_KMloc

The vector of the predicted values for phi(T') for the observations of the train set (require mode_sw_RF = 2). See Details - Random Forest modes for more information

pred_test_KMloc

The vector of the predicted values for phi(T') for the observations of the test set (require mode_sw_RF = 2). See Details - Random Forest modes for more information

perf_train_KMloc

The list with the values for the evaluation criteria computed on the train set (require mode_sw_RF = 2). See Details - Random Forest modes for more information

perf_test_KMloc

The list with the values for the evaluation criteria computed on the test set (require mode_sw_RF = 2). See Details - Random Forest modes for more information

surv_train_KMloc

The matrix which contains the estimated values of the survival curves at time_points (within leaf Kapaln Meier estimator), for the observations of the train set (require mode_sw_RF = 2)

surv_test_KMloc

The matrix which contains the estimated values of the survival curves at time_points (within leaf Kapaln Meier estimator), for the observations of the test set (require mode_sw_RF = 2)

time_points

The vector of the time points where the survival curves are evaluated (require mode_sw_RF = 2)

train

The data.frame of the train data provided as arguments, plus columns : Y' = min(Y, max_time ), δ' = 1_{T' ≤ C} and phi(T')

test

The data.frame of the test data provided as arguments, plus columns : Y' = min(Y, max_time ), δ' = 1_{T' ≤ C} and phi(T')

type_reg

See Argument

type_w

See Argument

phi

See Argument

phi.args

See Argument

x_vars

See Argument

max_w_mod

See Argument

max_w_ev

See Argument

mode_sw_RF

See Argument

References

Gerber, G., Le Faou, Y., Lopez, O., & Trupin, M. (2018). The impact of churn on prospect value in health insurance, evaluation using a random forest under random censoring. https://hal.archives-ouvertes.fr/hal-01807623/

See Also

rfsrc, rpart, gam, predict_sw_reg

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# ------------------------------------------------
#   Load "transplant" data
# ------------------------------------------------
data("transplant", package = "survival")
transplant$delta = 1 * (transplant$event == "ltx") # create binary var
# which indicate censoring/non censoring

# keep only rows with no missing value
apply(transplant, MARGIN = 2, FUN = function(x){sum(is.na(x))})
transplant_bis = transplant[stats::complete.cases(transplant),]

# plot the survival curve of transplant data
KM_transplant = survfit(formula = survival::Surv(time = futime, event = delta) ~ 1,
                        data = transplant_bis)
plot(KM_transplant)

# ------------------------------------------------
#   Basic call to train a model
# ------------------------------------------------
res1 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis
)
# parameters set by default
res1$type_w
res1$type_reg
res1$max_w_mod
res1$max_w_ev
res1$mode_sw_RF # 1 corresponds to wRF1 in [Gerb. et al.]


# train errors
res1$perf_train

# ------------------------------------------------
#   Training with estimation of test error
# ------------------------------------------------
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res2 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))

print(res2$max_time) # default \code{max_time} has changed since train set
# is different

# there is a uge overfitting in terms of quadratic errors
print(res2$perf_train)
print(res2$perf_test)

# default parameters for the random forest are
res2$sw_RF_obj$ntree
res2$sw_RF_obj$mtry
res2$sw_RF_obj$nodesize
res2$sw_RF_obj$nodedepth # means there is no depth limit


# -----------------------------------------------------
#   Modify the \code{max_time} argument & look for
#       the best model under this setting
# -----------------------------------------------------

set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res30 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    type_w = "KM", # default value
                                    max_time = 600, # we set \code{max_time} to 600
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res30$perf_test)

# are the other types of weights giving better results ?
## Cox weights
res31 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    type_w = "Cox",
                                    max_time = 600, # we set \code{max_time} to 600
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res31$perf_test) # slight improvment compared with weights KM

## RSF weights
res32 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "RSF",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res32$perf_test)

## unif weights
res33 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "unif",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res33$perf_test)

# In terms of quadratic, the best weights are the Cox weights
# remark : in this example there is not a big difference between the "unif" weights
# and the other weights because there is little censoring in the data :
res33$cens_rate

# -----------------------------------------------------------------
#     Try wRF2  and wRF3 (both are obtained with
#               \code{mode_sw_RF = 2})
# -----------------------------------------------------------------

res40 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "Cox",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"),
                                     mode_sw_RF = 2)
print(res40$perf_test) # wRF2 : not as good as wRF1
print(res40$perf_test_KMloc) # wRF3 : worse than wRF2


# -------------------------------------------------------
#               Try a GAM model
# -------------------------------------------------------

## GLM with Cox weights
res5 = sw_reg(y_var = "futime",
                                     delta_var = "delta",
                                     x_vars = setdiff(colnames(transplant_bis),
                                                      c("futime", "delta", "event")),
                                     train = transplant_bis[train_lines,],
                                     test = transplant_bis[-train_lines,],
                                     type_w = "Cox",
                                     max_time = 600, # we set \code{max_time} to 600
                                     types_w_ev = c("KM", "Cox", "RSF", "unif"),
                                    type_reg = "gam")
print(res5$perf_test) # not as good as random forest


# ------------------------------------------------------------
#     Analyse the weights used for "weighted" criteria
# ------------------------------------------------------------

print(res31$cens_rate) # rate of censoring taking into account \code{max_time}
print(head(res31$mat_w_test))
## ratio max(weights)/min(weights)
print(apply(X = res31$mat_w_test,
            MARGIN = 2,
            FUN = function(x){max(x[x != 0])/min(x[x != 0])}))
# ratios are low because the censoring rate is low

# in this case, it is not meaningful to to modify the
# \code{max_w_ev} argument since the maximum ratios
# between weights are around 2 and the test data has 197 rows.
# But in other situation it may be pertinent


# ------------------------------------------------
#   Use custom \code{phi} function
# ------------------------------------------------
g = function(x,a) abs(x-a)
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res6 = sw_reg(y_var = "futime",
                                    delta_var = "delta",
                                    x_vars = setdiff(colnames(transplant_bis),
                                                     c("futime", "delta", "event")),
                                    train = transplant_bis[train_lines,],
                                    test = transplant_bis[-train_lines,],
                                    phi = g,
                                    phi.args = list(a = 200),
                                    type_w = "Cox",
                                    max_time = 600, # we set \code{max_time} to 600
                                    types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res6$perf_test) # slight improvment compared with weights KM

YohannLeFaou/sword documentation built on May 28, 2019, 3:21 p.m.