README.md
In jeroenhoogland/regsurv: Regularized parametric survival modeling

regsurv

The goal of regsurv is to aid in parametric survival model development in settings where regularization is required. It provides access to flexible log hazard and log cumulative hazard modeling based where the baseline (cumulative) hazard is modeled by means of restricted cubic splines. The main functions for data preparation, model fitting, cross-validation, and prediction are illustrated is this document. First of all however, let’s install the package.

You can install the development version of regsurv from GitHub with:

# install.packages("devtools")
devtools::install_github("jeroenhoogland/regsurv")

Documentation is available for all functions and can be accessed using ? followed by the function name. For instance, ?regsurv will provide the documentation for the regurv() function.

A basic analysis example follows below:

library(regsurv) # load the package

head(simdata) # tte data with 9 covariates
#>   id  eventtime status         x1          x2          x3         x4
#> 1  1  0.9613914      1 -0.6264538  0.13622189  0.07730312 -1.4546914
#> 2  2  5.2742237      1  0.1836433  0.40716760 -0.29686864 -0.8456543
#> 3  3 10.0000000      0 -0.8356286 -0.06965481 -1.18324224 -1.2504797
#> 4  4  2.6667278      1  1.5952808 -0.24766434  0.01129269  0.6672881
#> 5  5  0.2201874      1  0.3295078  0.69555081  0.99160104 -1.2907697
#> 6  6  2.3345753      1 -0.8204684  1.14622836  1.59396745 -2.0350035
#>            x5          x6         x7          x8          x9
#> 1  1.13496509  0.06526642  0.8500435  0.25313781 -0.88614959
#> 2  1.11193185  1.03532642 -0.9253130  0.01149799 -1.92225490
#> 3 -0.87077763  2.26021579  0.8935812 -1.97578396  1.61970074
#> 4  0.21073159  1.31469628 -0.9410097 -0.42632672  0.51926990
#> 5  0.06939565 -0.87002335  0.5389521  0.61230256 -0.05584993
#> 6 -1.66264885 -0.50313004 -0.1819744  1.23081411  0.69641761
simdata <- simdata[1:100, ] # 100 cases

# prepare the data for model fitting
prep <- survprep(tte=simdata$eventtime,
                 delta=simdata$status,
                 X=as.matrix(simdata[ ,grep("x", names(simdata))]),
                 model.scale="loghazard",
                 time.scale="logtime",
                 spline.type="rcs",
                 ntimebasis=4,
                 qpoints=9)

Note that the covariate data argument required matrix format. The argument ntimebasis specifies the number of basis columns for the restricted cubic spline, and the qpoints specifies the number of quadrature points for the Gauss-Legendre numerical integration required to evaluate the log-likelihood.

Among other things, survprep() creates the model matrix (scaled by default), which can be viewed using

head(prep$mm$d) # columns of the model matrix (scaled by default)
#>   int      basis1    basis2      basis3     basis4         x1          x2
#> 1   1 -0.03937364 0.2234748 0.000000000 0.00000000 -0.6264538  0.13622189
#> 2   1  1.66283150 2.2361864 0.102974713 0.01459886  0.1836433  0.40716760
#> 3   1  2.30258509 3.7686726 0.341972242 0.09622914 -0.8356286 -0.06965481
#> 4   1  0.98085219 1.0828761 0.006637081 0.00000000  1.5952808 -0.24766434
#> 5   1 -1.51327611 0.0000000 0.000000000 0.00000000  0.3295078  0.69555081
#> 6   1  0.84783000 0.9186898 0.002361127 0.00000000 -0.8204684  1.14622836
#>            x3         x4          x5          x6         x7          x8
#> 1  0.07730312 -1.4546914  1.13496509  0.06526642  0.8500435  0.25313781
#> 2 -0.29686864 -0.8456543  1.11193185  1.03532642 -0.9253130  0.01149799
#> 3 -1.18324224 -1.2504797 -0.87077763  2.26021579  0.8935812 -1.97578396
#> 4  0.01129269  0.6672881  0.21073159  1.31469628 -0.9410097 -0.42632672
#> 5  0.99160104 -1.2907697  0.06939565 -0.87002335  0.5389521  0.61230256
#> 6  1.59396745 -2.0350035 -1.66264885 -0.50313004 -0.1819744  1.23081411
#>            x9
#> 1 -0.88614959
#> 2 -1.92225490
#> 3  1.61970074
#> 4  0.51926990
#> 5 -0.05584993
#> 6  0.69641761
prep$which.param # helper to see which parameters relate to 
#> [[1]]
#> [1] 1 2 3 4 5
#> 
#> [[2]]
#> [1]  6  7  8  9 10 11 12 13 14
#> 
#> [[3]]
#> NULL
# baseline (cumulative) hazard [[1]], main effects [[2]], and 
# time-varying effects [[3]] (corresponding to prep$mm$d)

Model fitting is performed using resurv(). First however, the desired penalty needs to be specified. This is done by specification of the parameters that should (1) and should not (0) be penalized using the penpars argument, and a specification of the type of penalty using the l12 argument (0 for ridge, 1 for lasso, in between for elastic net).

# For instance, penalize the main effects with a lasso penalty:
penpars <- c(rep(0, length(prep$which.param[[1]])), 
             rep(1, length(prep$which.param[[2]])))
l1l2 <- rep(1, 14)

# fit model over the default lambda grid
mod <- regsurv(prep, penpars, l1l2)
plot(mod) # for non-baseline parameters, original scale

plot(mod, incl.baseline = TRUE, scaled.betas = TRUE) # include baseline parameters (dotted),

# plotting coefficients on for the scaled data.

Subsequently, 5-fold cross-validation can be performed using cv.regsurv()

set.seed(123)
cv <- cv.regsurv(mod, prep, nfolds = 5, plot=TRUE)

The dotted line shows cv$lambda.min, the lambda value minimizing the deviance.

Coefficients at lambda.min can be derived using

# ?coef.regsurv
coef(mod, s = cv$lambda.min)
#>          int       basis1       basis2       basis3       basis4           x1 
#>  -1.65886451   1.58897010  -1.37397564  10.61867922 -11.86023252   0.48510186 
#>           x2           x3           x4           x5           x6           x7 
#>   0.63555800   0.30636952  -0.33847814   0.00000000   0.00000000  -0.07699789 
#>           x8           x9 
#>   0.00000000  -0.03385316

Predictions can be derived using predict()

# ?predict.regsurv
pred <- predict(mod, prep, 
        lambda.index = cv$lambda.min.index, 
        type = "surv")
plot(pred ~ simdata$eventtime, ylab="Shat", xlab="time to event")