Description Usage Arguments Details Value References See Also Examples
cox_reg
is a benchmark model we use in [Gerber et al. (2018)]. To model the
variable phi
(T), where T is a right censored time and phi
is a given function, we first
fit a Cox model to the data to estimate the survival function of T given the covariates. Then,
we deduce an estimator of phi
(T) by integration of the function phi
with respect
to the
estimated survival function. Different methods are available to assess the quality of fit of
cox_reg
. cox_reg
is a wrapper for the coxph
function
The notations we use are :
C : Censoring variable
Y = min(T, C)
δ = 1_{T ≤ C} (delta)
1 2 3 4 5 |
y_var |
A character string which gives the name of the Y variable |
delta_var |
A character string which gives the name of the δ variable (this variable should be binary : 1 = non censored, 0 = censored) |
x_vars |
A vector of character strings which gives the names of the explanatory variables |
train |
A data.frame of training observations. It must contain the column names |
test |
A data.frame of testing observations (default = |
phi |
A function to be applied to |
phi.args |
A list of additional parameters for the function |
max_time |
A real number giving a threshold for |
cox_obj |
A boolean which indicates if the Cox model fitted to the training data should
be returned (default = |
ev_methods |
A vector of character strings which gives the methods that should be
used for the evaluation of the model (default = |
bandwidths |
A vector of real numbers for the bandwidths to use for the model
evaluation if |
types_w_ev |
A vector of character strings which gives the types of weights to be used for IPCW
(Inverse Probability of Censoring Weighting) in the model evaluation (default = |
max_w_ev |
A real number which gives the maximum admissible ratio for the IPC weights (default = 1000). See Details - Evaluation criteria for more information |
mat_w |
A matrix to provide handmade IPC weights for the model evaluation (default = |
y_no_cens_var |
A character string which gives the name of the non censored |
... |
Additional parameter that may be pass to the |
Evaluation criteria
The quality of fit may be assess throught three different criteria. There are two main
criteria : "weighted"
and "concordance"
, and one criteria that is expimental : "group"
.
"weighted"
: the weighed criteria is described in <c2><a7>? of [Gerb. et al.]. This criteria aims
to estimate the quadratic error of the model in the context of right censoring. It has the form
∑_i W_i (y_i - \hat{y}_i)^2 where (y_i)i are the censored targets of the model, (W_i)i
are the IPC weights, and (\hat{y}_i)i are the predictions made.
The types_w_ev
argument allows the use of four kinds of IPC weights :
"KM"
, "Cox"
, "RSF"
and "unif"
. The first three types of weights correspond
to different ways to estimate the survival function of the censoring. On the other hand, "unif"
corresponds to W_i = 1 for all i.
Since the IPC weights may take too large values in some situation, max_w_ev
allows
to threshold the weights (W_i)i so that the ratio between the largest and the smallest weights doesn't
exceed max_w_ev
. If W_max is the maximum weight, considered weights are then
min(W_i, W_max) / ( ∑_i min(W_i, W_max) )
You can also manually provide weights to be used for IPCW with the argument mat_w
(those
weights will also be threshold w.r.t. max_ration_weights_eval
). mat_w
should be a
matrix satisfying nrow(mat_w) = nrow(train) + nrow(test)
, any numbers of
columns may be provided and then each column of the matrix corresponds to a type of weights.
Columns name of the matrix are taken as names for the different types of weights. If there is no
column name, then default names are "w1", "w2', ...
"concordance"
: The concordance is a classical measure of performance when modelling
censored variables. It indicates if the order of the predicted values of the model is similar to
the order of the observed values. The concordance generalizes the Kendall tau to the censored case.
"group"
: This is an experimental criteria that we didn't mention in [Gerb. et al.]. Here,
the idea to take the censoring into account is to measure errors given groups of observations and
not single observations. First, the test sample is ordered w.r.t. the predicted values of the
model. Second, respecting this order the test sample is splited into groups of size v_bandwidht[1]
, or
v_bandwidht[2]
, etc ... (each bandwidth in v_bandwidht
corresponds to a different
score "group"
).
Then, inside a group, an estimator of the survival function of T may be obtained by
Kaplan Meier, and we can deduce an estimator of phi
(T) by integration. This estimator
of phi
(T) may be
viewed as an "empirical" value of phi
(T) inside the group.
On the other other hand, each observation of a group is associated to a prediction of phi
(T).
The prediction for the group may be defined as the mean of the predictions of the observations
inside the group.
The resulting group scores correspond to the classical quality of fit criteria (e.g. quadratic error, Kendall tau, Gini index) but taken on groups and not on single observations.
The "group"
criteria is interesting in the context of big database, in which sufficient
number of groups are available. This criteria has a high variance if applied to small test sample.
A list with the following elements :
pred_train |
The vector of the predicted values for |
pred_test |
The vector of the predicted values for
|
perf_train |
The list with the values for the evaluation criteria computed on the train set |
perf_test |
The list with the values for the evaluation criteria computed on the test
set (require |
surv_train |
The matrix which contains the estimated values of the survival curves at
|
surv_test |
The matrix which contains the estimated values of the survival curves at
|
time_points |
The vector of the time points where the survival curves are evaluated |
mat_w_train |
The matrix which contains the values of the weights used for the
|
mat_w_test |
The matrix which contains the values of the weights used for the
|
n_w_ev_modif_train |
The vector giving the number of train weights modified due to
|
n_w_ev_modif_test |
The vector giving the number of test weights modified due to
|
cox_obj |
The obj returned by the |
max_time |
The real number giving the threshold used by the model |
cens_rate |
The real number giving the rate of censoring
of T', computed on the concatenation of |
train |
The data.frame of the train data provided as arguments, plus columns :
Y' = min(Y, |
test |
The data.frame of the test data provided as arguments, plus columns :
Y' = min(Y, |
phi |
See Argument |
phi.args |
See Argument |
x_vars |
See Argument |
max_w_ev |
See Argument |
Gerber, G., Le Faou, Y., Lopez, O., & Trupin, M. (2018). The impact of churn on prospect value in health insurance, evaluation using a random forest under random censoring. https://hal.archives-ouvertes.fr/hal-01807623/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | # ------------------------------------------------
# Load "transplant" data
# ------------------------------------------------
data("transplant", package = "survival")
transplant$delta = 1 * (transplant$event == "ltx") # create binary var
# which indicate censoring/non censoring
# keep only rows with no missing value
apply(transplant, MARGIN = 2, FUN = function(x){sum(is.na(x))})
transplant_bis = transplant[stats::complete.cases(transplant),]
# plot the survival curve of transplant data
KM_transplant = survfit(formula = survival::Surv(time = futime, event = delta) ~ 1,
data = transplant_bis)
plot(KM_transplant)
# ------------------------------------------------
# Basic call to train a model
# ------------------------------------------------
res1 = cox_reg(y_var = "futime",
delta_var = "delta",
x_vars = setdiff(colnames(transplant_bis),
c("futime", "delta", "event")),
train = transplant_bis,
types_w_ev = c("KM", "Cox", "RSF", "unif"))
matplot(y = t(res1$surv_train[1:30,]), x = res1$time_points, type = "l")
print(res1$perf_train)
print(res1$max_time) # by default \code{max_time} is set to 2055 which is too large
# to have good predictions (training R2 with different weights are negative !)
# ------------------------------------------------
# Training with estimation of test error
# ------------------------------------------------
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res2 = cox_reg(y_var = "futime",
delta_var = "delta",
x_vars = setdiff(colnames(transplant_bis),c("futime", "delta", "event")),
train = transplant_bis[train_lines,],
test = transplant_bis[-train_lines,],
types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res2$max_time) # default \code{max_time} has changed since train set
# is different
# train error is now positive but test error is still negative
print(res2$perf_train)
print(res2$perf_test)
# visualise the predictions
print(res2$pred_test[1:30])
matplot(y = t(res2$surv_test[1:30,]), x = res2$time_points, type = "l")
# ------------------------------------------------
# Modify the \code{max_time} argument
# ------------------------------------------------
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res3 = cox_reg(y_var = "futime",
delta_var = "delta",
x_vars = setdiff(colnames(transplant_bis),c("futime", "delta", "event")),
train = transplant_bis[train_lines,],
test = transplant_bis[-train_lines,],
max_time = 600,
types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res3$perf_train)
print(res3$perf_test) # test error is much better
print(res3$pred_test[1:30])
matplot(y = t(res3$surv_test[1:30,]), x = res3$time_points, type = "l")
# analyse the weights used for "weighted" criteria
print(res3$cens_rate) # rate of censoring taking into account \code{max_time}
print(head(res3$mat_w_test))
## ratio max(weights)/min(weights)
print(apply(X = res3$mat_w_test,
MARGIN = 2,
FUN = function(x){max(x[x != 0])/min(x[x != 0])}))
# ratios are low because the censoring rate is low
# in this case, it is not meaningful to to modify the
# \code{max_w_ev} argument since the maximum ratios
# between weights are around 2 and the test data has 197 rows.
# But in other situation it may be pertinent
# ------------------------------------------------
# Use custom \code{phi} function
# ------------------------------------------------
g = function(x,a) abs(x-a)
set.seed(17)
train_lines = sample(1:nrow(transplant_bis), 600)
res4 = cox_reg(y_var = "futime",
delta_var = "delta",
x_vars = setdiff(colnames(transplant_bis),c("futime", "delta", "event")),
train = transplant_bis[train_lines,],
test = transplant_bis[-train_lines,],
phi = g,
phi.args = list(a = 200), # set value for "a"
max_time = 600,
types_w_ev = c("KM", "Cox", "RSF", "unif"))
print(res4$perf_test)
print(res4$pred_test[1:30])
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.