# Regularized Cox Regression In glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models

# the code in this chunk enables us to truncate the print output for each
# chunk using the out.lines option
# save the built-in output hook
hook_output <- knitr::knit_hooks$get("output") # set a new output hook to truncate text output knitr::knit_hooks$set(output = function(x, options) {
if (!is.null(n <- options$out.lines)) { x <- xfun::split_lines(x) if (length(x) > n) { # truncate the output x <- c(head(x, n), "....\n") } x <- paste(x, collapse = "\n") } hook_output(x, options) }) ## Introduction This vignette describes how one can use the glmnet package to fit regularized Cox models. The Cox proportional hazards model is commonly used for the study of the relationship beteween predictor variables and survival time. In the usual survival analysis framework, we have data of the form$(y_1, x_1, \delta_1), \ldots, (y_n, x_n, \delta_n)$where$y_i$, the observed time, is a time of failure if$\delta_i$is 1 or a right-censored time if$\delta_i$is 0. We also let$t_1 < t_2 < \ldots < t_m$be the increasing list of unique failure times, and let$j(i)$denote the index of the observation failing at time$t_i$. The Cox model assumes a semi-parametric form for the hazard $$h_i(t) = h_0(t) e^{x_i^T \beta},$$ where$h_i(t)$is the hazard for patient$i$at time$t$,$h_0(t)$is a shared baseline hazard, and$\beta$is a fixed, length$p$vector. In the classic setting$n \geq p$, inference is made via the partial likelihood $$L(\beta) = \prod_{i=1}^m \frac{e^{x_{j(i)}^T \beta}}{\sum_{j \in R_i} e^{x_j^T \beta}},$$ where$R_i$is the set of indices$j$with$y_j \geq t_i$(those at risk at time$t_i$). Note there is no intercept in the Cox model as it is built into the baseline hazard, and like it, would cancel in the partial likelihood. In glmnet, we penalize the negative log of the partial likelihood with an elastic net penalty. (Credits: The original "coxnet" algorithm for right-censored data was developed by Noah Simon, Jerome Friedman, Trevor Hastie and Rob Tibshirani: see @coxnet for details. The other features for Cox models, introduced in v4.1, were developed by Kenneth Tay, Trevor Hastie, Balasubramanian Narasimhan and Rob Tibshirani.) ## Basic usage for right-censored data We use a pre-generated set of sample data and response. x must be an$n\times p$matrix of covariate values --- each row corresponds to a patient and each column a covariate. y is an$n \times 2$matrix, with a column "time" of failure/censoring times, and "status" a 0/1 indicator, with 1 meaning the time is a failure time, and 0 a censoring time. The Surv function in the survival package creates such a response matrix, and it is recommended that the user uses the output of a call to Surv for the response to glmnet. (For backward compatibility, glmnet can accept a two-column matrix with column names "time" and "status" for right-censored data.) library(glmnet) library(survival) data(CoxExample) x <- CoxExample$x
cvfit$lambda.1se Second, the option grouped = TRUE (default) obtains the CV partial likelihood for the Kth fold by subtraction, i.e. by subtracting the log partial likelihood evaluated on the full dataset from that evaluated on the$(K-1)/K$dataset. This makes more efficient use of risk sets. With grouped = FALSE the log partial likelihood is computed only on the$K$th fold, which is only reasonable if each fold has a large number of observations. ### Handling of ties glmnet handles ties in survival time with the Breslow approximation. This is different from survival package's coxph function, whose default tie-handling method is the Efron approximation. # create x matrix set.seed(1) nobs <- 100; nvars <- 15 x <- matrix(rnorm(nobs * nvars), nrow = nobs) # create response ty <- rep(rexp(nobs / 5), each = 5) tcens <- rbinom(n = nobs, prob = 0.3, size = 1) y <- Surv(ty, tcens) # coefficients from these two models will not line up because # of different tie handling methods glmnet_fit <- glmnet(x, y, family = "cox", lambda = 0) coxph_fit <- coxph(y ~ x) plot(coef(glmnet_fit), coef(coxph_fit)) abline(0, 1) glmnet is not able to perform the Efron approximation at the moment. survival's coxph can perform the Breslow approximation by specifying ties = "breslow": # coefficients from these two models will line up glmnet_fit <- glmnet(x, y, family = "cox", lambda = 0) coxph_fit <- coxph(y ~ x, ties = "breslow") plot(coef(glmnet_fit), coef(coxph_fit)) abline(0, 1) ## Cox models for start-stop data Since version 4.1 glmnet can fit models where the response is a (start, stop] time interval. As explained in @Therneau2000, the ability to work with start-stop responses opens the door to fitting regularized Cox models with • time-dependent covariates, • time-dependent strata, • left truncation, • multiple time scales, • multiple events per subject, • independent increment, marginal, and conditional models for correlated data, and • various forms of case-cohort models. The code below shows how to create a response of this type (using survival package's Surv function) and how to fit such a model with glmnet. # create x matrix set.seed(2) nobs <- 100; nvars <- 15 xvec <- rnorm(nobs * nvars) xvec[sample.int(nobs * nvars, size = 0.4 * nobs * nvars)] <- 0 x <- matrix(xvec, nrow = nobs) # non-sparse x x_sparse <- Matrix::Matrix(xvec, nrow = nobs, sparse = TRUE) # sparse x # create start-stop data response beta <- rnorm(5) fx <- x_sparse[, 1:5] %*% beta / 3 ty <- rexp(nobs, drop(exp(fx))) tcens <- rbinom(n = nobs, prob = 0.3, size = 1) starty <- runif(nobs) yss <- Surv(starty, starty + ty, tcens) # fit regularized Cox model with start-stop data fit <- glmnet(x, yss, family = "cox") (Note that the call above would have worked as well if x was replaced by x_sparse.) cv.glmnet works with start-stop data too: cv.fit <- cv.glmnet(x, yss, family = "cox", nfolds = 5) plot(cv.fit) As a sanity check, the code below shows that fitting start-stop responses using glmnet with lambda = 0 matches up with coxph's result: glmnet_fit <- glmnet(x, yss, family = "cox", lambda = 0) coxph_fit <- coxph(yss ~ x) plot(coef(glmnet_fit), coef(coxph_fit)) abline(0, 1) ## Stratified Cox models One extension of the Cox regression model is to allow for strata that divide the observations into disjoint groups. Each group has its own baseline hazard function, but the groups share the same coefficient vector for the covariates provided by the design matrix x. glmnet can fit stratified Cox models with the elastic net penalty. With coxph one can specify strata in the model formula. Since glmnet does not use a model formula, we achieve this by adding a strata attribute to the Surv response object. We achieve this via the function stratifySurv: strata <- rep(1:5, length.out = nobs) y2 <- stratifySurv(y, strata) str(y2[1:6]) stratifySurv returns an object of class stratifySurv. We can then pass this stratifySurv object as the response to a glmnet call. glmnet will fit a stratified Cox model if it detects that the response has class stratifySurv. fit <- glmnet(x, y2, family = "cox") This stratifySurv object can also be passed to cv.glmnet to fit stratified Cox models with cross-validation: cv.fit <- cv.glmnet(x, y2, family = "cox", nfolds = 5) plot(cv.fit) Note that simply giving the response a "strata" attribute is not enough! The response needs to be of class stratifySurv in order for subsetting to work correctly. To protect against this, an error will be thrown if the response has a "strata" attribute but is not of class stratifySurv. Add strata via the stratifySurv function. y3 <- y attr(y3, "strata") <- strata str(y3[1:6]) # note that the strata attribute is no longer there fit <- glmnet(x, y3, family = "cox") ## Plotting survival curves Fitting a regularized Cox model using glmnet with family = "cox" returns an object of class "coxnet". Class "coxnet" objects have a survfit method which allows the user to visualize the survival curves from the model. In addition to the "coxnet" object, the user must pass the x and y objects used to fit the model (for computation of the baseline hazard), as well as the lambda value for which the survival curve is desired: fit <- glmnet(x, y, family = "cox") survival::survfit(fit, s = 0.05, x = x, y = y) We are unable to provide standard errors for these survival curves, so we do not present the confidence bounds for them. To plot the survival curve, pass the result of the survfit call to plot: plot(survival::survfit(fit, s = 0.05, x = x, y = y)) As noted in the documentation for survival::survfit.coxph, without new data, a curve is produced for a single "pseudo" subject with covariate values equal to the means of the data set, and this resulting curve(s) almost never make sense. We can get survival curves for individual observations by passing a newx argument: survival::survfit(fit, s = 0.05, x = x, y = y, newx = x[1:3, ]) plot(survival::survfit(fit, s = 0.05, x = x, y = y, newx = x[1:3, ])) If the original model was fit with strata, then the strata option needs to be specified as well. If newx is being passed for such a model, the strata for these new observations need to be passed via newstrata. y2 <- stratifySurv(y, rep(1:2, length.out = nobs)) fit <- glmnet(x, y2, family = "cox") survival::survfit(fit, s = 0.01, x = x, y = y2) # survival curve plot for first two individuals in dataset plot(survival::survfit(fit, s = 0.01, x = x, y = y2, newx = x[1:2, ], newstrata = strata[1:2])) To be consistent with other methods in glmnet, if the s parameter is not specified, survival curves are returned for the entire lambda sequence. The survival curves are returned as a list, one element for each lambda value. sf <- survival::survfit(fit, x = x, y = y2) length(sf) length(fit$lambda)

The survfit method is available for cv.glmnet objects as well. By default, the s value chosen is the "lambda.1se" value stored in the CV object. The s value can also be set to the "lambda.min" value stored in the CV object.

cv.fit <- cv.glmnet(x, y2, family = "cox", nfolds = 5)
survival::survfit(cv.fit, x = x, y = y2)
survival::survfit(cv.fit, s = "lambda.min", x = x, y = y2)

## Try the glmnet package in your browser

Any scripts or data that you put into this service are public.

glmnet documentation built on Nov. 3, 2021, 1:07 a.m.