knitr::opts_chunk$set(echo = TRUE) knitr::opts_knit$set(root.dir = "C:/RFactory/rstanhaz")
Survival regression focuses clinical interest because it may explain in a simple way the effect of a set of measured covariates on the survival/clinical outcome. Proportional hazard allows measuring the prognostic index of individuals depending on their covariates across follow-up while estimating the parameters that contribute to a better or worst prognosis. The hazard at time $t$ for an individual $i$ given her set of covariates $x$ is available by the product of hazard ratios and baseline (cumulative) hazard. Then the survival may be obtained by its relation with the hazard function given by the following equality: $\mathrm{S}(t)=\mathrm{exp} \left ( -\int \mathrm{h}(t) \mathrm{d}t \right) = \mathrm{exp}\left ( -\int h_{0} ยท \mathrm{exp(\beta x)}\right ) $ This article will show how to fit and validate proportional hazard regression models and introduce the use of I-splines as a Bayesian alternative to model survival data. Different types of covariate models will be fitted and compared, such as clinical, genomic, and clinico-genomic models. Besides, we will conduct an assessment of integrative clustering ability to predict survival outcome.
When to use regression models in survival analysis?
When we aim to understand in terms of covariates a time-varying outcome such as Death/Alive, Recurred/Disease Free.
Examples of covariates are steady-state exposure, tumour size or high-throughput genomic measurements.
Types of questions:
Is the covariate significant or has a credible effect on the clinical/survival outcome?
Does the treatment change significantly or in a credible way the survival outcome?
Models to describe:
Is the survival data well-fitted by the set of covariates?
Models to predict:
How well can a survival model predict the outcome of a new patient? Is the prognostic model credible?
This document is a starting point for evaluating I-splines models included in the rstanhaz package. It is not yet clear what libraries will be imported by the package. For now, the following libraries are recommended to be called. Note that we will be using the glmnet package to fit elastic net and rstan to fit Bayesian models with HMC and the NUTS sampling algorithms. Note that some models may take long run-time (~ 2 hours in a PC with 6 cores), using multi-core option is recommended to speed up calculations.
#Load libraries suppressMessages(library(dplyr) ) #For glmnet suppressMessages(library(glmnet)) suppressMessages( library(glmnetUtils)) # For stan suppressMessages( library(rstan) ) suppressMessages( library(rstanarm) ) #set options for rstan rstan_options(auto_write = TRUE) options(mc.cores = min(4, parallel::detectCores())) # For survival analysis library(survival) library(survcomp) # For MC-crossvalidations suppressMessages( library(rsample) ) suppressMessages( library(tidyposterior) ) #for this package suppressMessages( devtools::document() ) # source("~/rstanhaz/R") # if(!require("devtools")) install.packages("devtools") # if(!require("rstanhaz")) devtools::install_github("csetraynor/rstanhaz")
To exemplify this vignette we will use the Lung Squamous carcinoma TCGA dataset.
data(lusc) ?lusc
Observe the dataset, which has been pre-processed to select the clinical variable of interest. Note that a previous step of this project was to conduct Bayesian integrative clustering using the package iClusterPlus. To do so, genomic variables were imputed using K-nearest neighbours (K=10). After conducting the clustering all variables were sorted by absolute posterior probability of importance and then a soft cut was implemented to keep the first quantile of variables (~ 1200 genomic variables). Besides, continuous clinical variables were imputed through multiple imputations by chained equations.
glimpse(lusc)
Load all "omics" data from TCGA:
# load sigfeatures sigfeatures <- readRDS("data-raw/lusc_sigfeatures.RDS") sigfeatures_head <- readRDS("data-raw/lusc_sigfeatures_head.RDS") ## Load genomic datasets gene = readRDS("data-raw/gene.RDS") seg.cn = readRDS("data-raw/seg.cn.RDS") meth450 = readRDS("data-raw/meth450.RDS") sample = readRDS("data-raw/sample.RDS")
Although some data pre-process has been done beforehand, which may be reviewed in the iClusterPlus package branch: test-surv. More data pre-process is needed to conduct survival modelling, which can be skipped for the occasional reader.
## Get common sample id sample_id <- list(rownames(gene$matrix), rownames(seg.cn), sample$sample_id, rownames(meth450$matrix)) sample_id <- Reduce(intersect, sample_id) ## Filter gene$matrix <- filter_sample_id(s = sample_id, m = gene$matrix) meth450$matrix <- filter_sample_id(s = sample_id, m = meth450$matrix) seg.cn <- filter_sample_id(s = sample_id, m = seg.cn) sample <- sample[sample$sample_id %in% sample_id, ] sample <- sample[order(sample$sample_id), ] require(assertthat) assert_that(all(rownames(meth450$matrix) == rownames(seg.cn))) assert_that(all(rownames(gene$matrix) == rownames(seg.cn))) ## Select sigfeatures gene$matrix <- gene$matrix[, sigfeatures[["gene expression"]]] meth450$matrix <- meth450$matrix[, sigfeatures[["methylation"]]] seg.cn <- seg.cn[, sigfeatures[["copy number"]]] ## Add patient id gene$matrix <- dplyr::as_data_frame(gene$matrix, rownames = "sample_id") meth450$matrix <- dplyr::as_data_frame(meth450$matrix, rownames = "sample_id") seg.cn <- dplyr::as_data_frame(seg.cn, rownames = "sample_id") ## join genomic head vars genomic <- dplyr::left_join(gene$matrix, meth450$matrix, by = "sample_id") %>% dplyr::left_join(., seg.cn, by = "sample_id") rm(gene) rm(meth450) rm(seg.cn) # ## remove near zero vars nzvars <- caret::nearZeroVar(genomic) if (length(nzvars) > 0) { genomic <- genomic[, -nzvars] } genomic_vars <- colnames(genomic)[-match("sample_id", colnames(genomic))] ## Join genomic and clinical data lusc_omic <- dplyr::left_join(lusc, sample %>% dplyr::select(patient_id, sample_id), by = "patient_id") %>% dplyr::left_join(., genomic, by = "sample_id") %>% dplyr::select(-sample_id) ## Select most sigfeatures head sigfeatures_head <- unlist(sigfeatures_head)[unlist(sigfeatures_head) %in% colnames(lusc_omic)]
Standardise continuous variables:
### Standardise covariates ---------------------- ## preprocess omic preProcValues <- caret::preProcess(lusc_omic[, c("age", genomic_vars)], method = c("center", "scale")) Transformed <- as.data.frame(predict(preProcValues, lusc_omic[c("age", genomic_vars)])) lusc_omic <- cbind(lusc_omic[, -match(c("age", genomic_vars), colnames(lusc_omic))], Transformed) %>% dplyr::mutate(time = os_months, status = os_status)
We are going to fit a simple I-splines model. Let's review the model in detail:
stan_file <- read_stan_file("src/stan_files/surv.stan") print_stan_file(stan_file, section = 'model')
The model calculates the linear predictor, which has a multiplicative effect on the baseline hazard. Now we are going to fit our first model using one clinical variable: pathological staging(stage), which has three levels low, medium and high. We will use the function generate_stan_data to create a data list that will be passed to Stan.
splines.data <- generate_stan_data(formula = Surv(time, status) ~ stage, data = lusc_omic, basehaz = "fpm", prior_aux = exponential(), prior = normal()) splines.fit <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, iter = 2000, refresh = 0)
Now we are ready to plot the model and compare it side-by-side with the Cox's model. To visualise the models we have to extract the baseline hazard and the regression coefficient, also known as hazard ratios, for each model:
# Fit Cox models cox.fit <- coxph(Surv(time, status) ~ stage, data = lusc_omic) # Extract baseline hazard cox.bh <- survival::basehaz(cox.fit, centered = FALSE) get_splines.bh <- function(fit, data) { splines.basis <- data$basehaz$spline_basis splines.basehaz_coefs <- rstan::extract(fit, "basehaz_coefs")$basehaz_coefs splines.basehaz_coefs <- apply(X = splines.basehaz_coefs, MARGIN = 2, FUN = mean) basehaz.post <- sapply(seq_along(1:nrow(splines.basis)), function(n) { splines.basehaz_coefs %*% splines.basis[n, ] }) out <- data.frame(hazard = c(0, basehaz.post), time = c(1e-05, data$standata$t_end[data$standata$d])) unique(out[order(out$time), ]) } splines.bh <- get_splines.bh(fit = splines.fit, data = splines.data) # Extract regression coefficients cox.beta <- c(0, coef(cox.fit)) splines.beta <- rstan::extract(splines.fit, "beta")$beta splines.beta <- c(0, apply(splines.beta, 2, mean)) names(cox.beta) <- names(splines.beta) <- c("high", "low", "medium")
Using the theory of proportional hazard modelling we are able to predict the survival function for each stage in R and overlay the predictions of each model to the Kaplan-Meier maximum-likelihood estimate of survival:
#Function to link (cumulative) baseline hazard to survival linkto_surv <- function(bh, hr){ exp( - bh)^(exp( hr ) ) } # calculate predicted survival for Cox and I-splines models cox.surv <- lapply(seq_along(cox.beta), function(b){ data.frame( surv = linkto_surv( bh = cox.bh$hazard, hr = cox.beta[b] ) , time = cox.bh$time, stage = names(cox.beta[b]) ) }) cox.surv <- do.call(rbind.data.frame, cox.surv) splines.surv <- lapply(seq_along(splines.beta), function(b){ data.frame( surv = linkto_surv( bh = splines.bh$hazard, hr = splines.beta[b] ), time = splines.bh$time, stage = names(splines.beta[b]) ) }) splines.surv <- do.call(rbind.data.frame, splines.surv) # Observed Maximum likelihood estimate mle.surv <- survfit(Surv(time, status) ~ stage , data = lusc_omic ) obs.mortality <- data.frame(time = mle.surv$time, surv = mle.surv$surv, strata = summary(mle.surv, censored=T)$strata) zeros <- data.frame(time=0, surv=1, strata=unique(obs.mortality$strata)) obs.mortality <- rbind(obs.mortality, zeros) strata <- str_strata(obs.mortality) obs.mortality$strata <- NULL obs.mortality <- cbind( obs.mortality, strata) # Design plots cox.plot <- ggplot2::ggplot(data = obs.mortality, aes(x = time, y = surv, group = stage)) + geom_step(aes(colour = stage, linetype = "K-M estimate")) + geom_line(data = cox.surv, aes(x = time, y = surv, colour = stage, linetype = "Prediction Cox")) + labs(x = "Time(months)", y = "Survival", title = "Cox's model") + theme_bw() splines.plot <- ggplot2::ggplot(data = obs.mortality, aes(x = time, y = surv, group = stage)) + geom_step(aes(colour = stage, linetype = "K-M estimate")) + geom_line(data = splines.surv, aes(x = time, y = surv, colour = stage, linetype = "Prediction I-Splines")) + labs(x = "Time(months)", y = "Survival", title = "I-Splines model") + theme_bw() gridExtra::grid.arrange(cox.plot, splines.plot, ncol = 2)
Visualising the predictions may help to understand the model. However, it has not yet been established a way to plot continuous variables and/or more than one variable which is something that can be included in the future work.
However, more interesting than fitting the simple model above may be to estimate how a manifold of variables influences the outcome when taken into account together.
For example, we can fit a clinical model. Moreover, by using multivariate survival models we can take into account confounding factors for overall survival, such as age. Let's specify the clinical variables:
clinical_vars <- c("stage", "smoke", "age")
Fitting the clinical Cox's models.
## Cox model surv_formula_clinical <- as.formula(paste0("Surv(time, status) ~ ", paste(clinical_vars, collapse = "+"))) cox.fit <- coxph(surv_formula_clinical, data = lusc_omic) summary(cox.fit)
A measure we will be interested in showing in the summary object: the Concordance index. Which with for this models is 0.632. Secondly, we will fit the Bayesian proportional hazard modelling the baseline hazard function with splines.
## splines model gen_inits <- function() { list(beta = rnorm(n = 5, mean = 0, sd = 1)) } splines.data <- generate_stan_data(formula = surv_formula_clinical, data = lusc_omic, basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), init = gen_inits, prior = normal()) splines.fit <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, iter = 2000, seed = 1328025050, init = gen_inits, refresh = 0)
To examine these models we will need a model matrix to compute predictions.
## Create model matrix -- x_clinical <- model.matrix(surv_formula_clinical, data = lusc_omic)[, -1]
Firstly, we will check whether the posterior probabilities of the log-hazard ratios computed with I-splines are similar to the parameter estimates calculated with the Cox's model:
# extract posterior draws beta_draws <- rstan::extract(splines.fit, "beta")$beta beta_draws_mean <- apply(beta_draws, 2, mean) splines.tab <- t(apply(beta_draws, 2, quantile, probs = c(0.025, 0.975, 0.5))) rownames(splines.tab) <- colnames(x_clinical) splines.tab
We can compare this estimates to the Cox model; the parameter estimates can be extracted using the following function:
par.table <- function(fit) { ## needs glm object d1 <- summary(fit)$coefficients[, c("coef", "Pr(>|z|)")] dc <- confint(fit) cbind(dc, d1) } par.table(cox.fit)
We can observe that many of the parameters evaluated agree between the two models. However, for the smokeX category, the two estimates are substantially different 0.15 for the Bayesian I-splines model and 0.52 for the Cox's model. To understand what is going on, let's focus on the smoking variable, which in this data has three categories: currently smoking, reformed and not available (X). While the model is secure about currently smoking and reformed categories effect, the data might not be very informative about the not available category. In this case, the Cox's model has given large confidence intervals and a p-value that indicates that the parameter estimate is not very reliable. On the other hand, the Bayesian model has shrunk the posterior towards zero leveraging the normal prior centred at zero.
We may compare how these two models perform using various metrics such as the C-index, survival ROC and the survival Brier score. To do so, we will need to calculate the risk score or prognostic index for each individual:
# calculate risk score splines_rp <- as.matrix(x_clinical) %*% beta_draws_mean cox_rp = as.matrix(x_clinical) %*% as.vector(unlist(coef(cox.fit)))
Let us shortly review the metrics we will be using to assess the goodness of fit. The concordance index [@harrell_regression_2014, ch. 12] is a score that measures the rightly ranked pairs of predicted and observed event times. The survival ROC [@heagerty_time-dependent_2000] is an extension of binary ROC curves to include the time dependence of survival outcomes. The survival Brier Score [@mogensen_evaluating_2012] is the squared difference between the model-based predicted survival probabilities and the true disease status.
We will wrap in the following function the call to evaluate the survival Brier score, time-dependent ROC and concordance index using.
# prepare calculate metrics get_metrics <- function(train, test = NULL, cox_rp, splines_rp, cox_rp_train = NULL, splines_rp_train = NULL) { if (is.null(test)) { test <- train cox_rp_train <- cox_rp splines_rp_train <- splines_rp } ## Prepare to calculate metrics ws <- rep(1, length(test$time)) ws[test$time > max(test$time[test$status])] <- 0.01 obs.test.time <- sort(unique(test$time[test$status])) # cindex Cindex.cox <- survcomp::concordance.index(x = cox_rp, surv.time = test$time, surv.event = test$status, method = "conservative", na.rm = TRUE, weights = ws)$c.index Cindex.splines <- survcomp::concordance.index(x = splines_rp, surv.time = test$time, surv.event = test$status, method = "conservative", na.rm = TRUE, weights = ws)$c.index # brier score bs.cox <- survcomp::sbrier.score2proba(data.tr = data.frame(time = train$time, event = train$status, score = cox_rp_train), data.ts = data.frame(time = test$time, event = test$status, score = cox_rp), method = c("prodlim")) bs.splines <- survcomp::sbrier.score2proba(data.tr = data.frame(time = train$time, event = train$status, score = splines_rp_train), data.ts = data.frame(time = test$time, event = test$status, score = splines_rp), method = c("prodlim")) # AUC ROC AUC.cox <- survcomp::tdrocc(x = as.vector(cox_rp), surv.time = test$time, surv.event = test$status, time = max(obs.test.time)) AUC.splines <- survcomp::tdrocc(x = as.vector(splines_rp), surv.time = test$time, surv.event = test$status, time = max(obs.test.time)) list(Cindex.cox = Cindex.cox, Cindex.splines = Cindex.splines, bs.cox = bs.cox, bs.splines = bs.splines, AUC.cox = AUC.cox, AUC.splines = AUC.splines) } # Get metrics metrics_list <- get_metrics(train = lusc_omic, splines_rp = splines_rp, cox_rp = cox_rp) models <- c("Cox clinical", "I-Splines clinical") cindex <- c(metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]]) bs <- c(metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated) auc <- c(metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC) metrics <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc) metrics
Slightly better for the splines model, but very similar indeed. Let's visualise now the survival Brier score and ROC curves.
# get a beautiful palette to plot get_colors <- function(n) { colors <- c("#89C5DA", "#DA5724", "#74D944", "#CE50CA", "#3F4921", "#C0717C", "#CBD588", "#5F7FC7", "#673770", "#D3D93E", "#38333E", "#508578", "#D7C1B1", "#689030", "#AD6F3B", "#CD9BCD", "#D14285", "#6DDE88", "#652926", "#7FDCC0", "#C84248", "#8569D5", "#5E738F", "#D1A33D", "#8A7C64", "#599861") colors[1:n] } # prepare plot fun plot_bs <- function(brier_score, models) { data.plot <- lapply(seq_along(brier_score), function(d) { data.frame(BS = brier_score[[d]]$bsc, time = brier_score[[d]]$time, model = models[[d]]) }) data.caption <- sapply(seq_along(brier_score), function(d) { paste0(models[[d]], " = ", round(brier_score[[d]]$bsc.integrated, 3)) }) data.caption <- paste0("iBrier:\n ", paste(data.caption, collapse = "\n")) my_theme <- theme_bw() + theme(axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 10), plot.title = element_text(size = 15, face = "bold", color = "darkgreen")) # The palette with black: cbbPalette <- get_colors(n = length(models)) ggplot2::ggplot(data = do.call(rbind.data.frame, data.plot), aes(x = time, y = BS)) + geom_line(aes(color = model), size = 0.55) + labs(title = "Prediction error curve\n", x = "Time [months]", y = "survival Brier score", color = "Model\n", caption = data.caption) + scale_color_manual(labels = models, values = cbbPalette) + my_theme } # plot pec pec <- list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]]) plot_bs(brier_score = pec, models = models)
We can make another plot for the ROC curve:
#prepare function plot_roc <- function(roc, models) { data.plot <- lapply(seq_along(roc), function(d) { data.frame(fpr = sort(1 - roc[[d]]$spec), tpr = sort(roc[[d]]$sens), model = models[[d]]) }) data.caption <- sapply(seq_along(roc), function(d) { paste0(models[[d]], " = ", round(roc[[d]]$AUC, 3)) }) data.caption <- paste0("AUC:\n ", paste(data.caption, collapse = "\n")) my_theme <- theme_bw() + theme(axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 10), plot.title = element_text(size = 15, face = "bold", color = "darkgreen")) cbbPalette <- get_colors(n = length(models)) ggplot2::ggplot(data = do.call(rbind.data.frame, data.plot), aes(x = fpr, y = tpr)) + geom_line(aes(color = model), size = 0.55) + labs(title = "survival ROC curve\n", x = "False postive rate (1-specificity)", y = "True positive rate (sensitivity)", color = "Model\n", caption = data.caption) + scale_color_manual(labels = models, values = cbbPalette) + my_theme + geom_abline(slope = 1) + coord_fixed() } # design plot roc <- list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]) plot_roc(roc = roc, models = models)
Now we will investigate the hypothesis of including genomic variables from different data sources such as proteomics, transcriptomics and genomics. We have a curated big dataset. The hypothesis is that the variables that are relevant for integrative clustering as assess by their posterior predictive importance might also have a strong effect on the survival outcome. We will be now moving on to the high-dimensional setting where the number of features (genomic variables), is usually considerately larger than the number of individuals observed. Therefore, some form of model tunning needs to be considered, i.e. over-fitting and high variance have to be tackled. Moreover, it is likely that these covariates are correlated, therefore we will be using tunned models for the high-dimensional setting that can handle correlated covariates. The Cox' model will be tunned via the Elastic Net [see @zou_regularization_2005] and the I-splines model via the horseshoe prior [see @carvalho_horseshoe_2010].
# give dummy names to vars to avoid issues when parsing the model formula x <- lusc_omic[, genomic_vars] colnames(x) <- paste0("var", 1:length(genomic_vars)) surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), collapse = "+"))) y <- lusc_omic[, c("time", "status")] in_data <- cbind.data.frame(y, x) # expected zeros nz = ceiling(0.01 * length(genomic_vars)) z = length(genomic_vars) - nz # register for parallelisation if available set.seed(82405) foldid <- caret::createFolds(in_data$status, k = 10, list = FALSE) ## do crossvalidation to find optimal alpha and lambda date() elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", grouped = TRUE, foldid = foldid, parallel = FALSE) date() plot(elasticnet_stuff)
The plot of lambda vs deviance for different values of alpha shows that for smaller alpha values the fit is considerably better. Therefore, this implies that the selected model is close to ridge regression.
The horseshoe prior belongs to the family of shrinkage priors, which cancels the parameters linked to noise while making allowances for the parameters linked to signal. The horseshoe has very tall modes and thick-tailed tails due to the half-Cauchy prior placed to the local hyperparameters.
## generate intits gen_inits <- function() { list(beta = rnorm(n = length(genomic_vars), mean = 0, sd = 1)) } # generate data splines.data <- generate_stan_data(surv_formula, data = in_data, basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), prior = hs(global_scale = nz/z)) date() options(mc.cores = min(4, parallel::detectCores())) # fit model splines.hs.genomic <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, iter = 2000, seed = 13250, init = gen_inits, control = list(adapt_delta = 0.99), refresh = 0) date() rstan::traceplot(splines.hs.genomic, "lp__")
Divergent transitions may appear when sampling the horseshoe model. But as long as there is only a handful of divergent transition the chain should be fine. Reviewing the traceplot of the log-posterior indicates that it's well-mixed.
We can extract the maximum (penalised) likelihood estimates and posterior draws respectively to calculate the prognostic index.
# coefficients cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm[i$nzero > nz])) elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]] lam <- match(min(cvm), elasticnet$cvm) # posterior draws beta_draws <- rstan::extract(splines.hs.genomic, "beta")$beta beta_draws_mean <- apply(beta_draws, 2, mean) #mean because the distribution is not normal # calculate risk score enet_rp <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam]) splines_rp <- as.matrix(x) %*% beta_draws_mean
Calculate the metrics:
metrics_list <- get_metrics(train = in_data, splines_rp = splines_rp, cox_rp = enet_rp) # add to list of models models <- c(models, "Elastic net Cox genomic", "Horseshoe I-S genomic") cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]]) bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated) auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC) # update record metrics <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc) metrics
Similarly, update the pec and roc plots.
pec <- c(pec, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]])) plot_bs(brier_score = pec, models = models)
roc <- c(roc, list( metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]) ) plot_roc(roc, models)
In the era of high-throughput genomic data, doctors are still guided by classical variables, such as age or pathological staging to guide clinical practice. It is not yet clear whether or not we should still consider classical variables when modelling survival data. Therefore, we are going to consider a two-sided model where we include a "prior" belief that classical variables are going to be relevant and high-throughout derived genomic variable should be shrunk.
Let's consider the Elastic net, the parameter penalty factor may be use to indicate thatthe clinical covariates should not be shrunk.
# Clinico-genomic model give dummy names to vars to avoid issues # when creating model formula x_genomic <- lusc_omic[, genomic_vars] colnames(x_genomic) <- paste0("var", 1:length(genomic_vars)) # create clinico-genomic model matrix x <- cbind(x_clinical, x_genomic) surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), collapse = "+"))) y <- lusc_omic[, c("time", "status")] in_data <- cbind.data.frame(y, x) ## Expected number of non-zeros nz = 0.01 * length(genomic_vars) z = length(genomic_vars) - nz # register for parallelisation if available set.seed(1347294) foldid <- caret::createFolds(lusc_omic$status, k = 10, list = FALSE) pf = rep(1, ncol(x)) pf[match(colnames(x_clinical), colnames(x))] <- 0 #clinico-genomic assumption # do crossvalidation to find optimal alpha and lambda date() elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", grouped = TRUE, foldid = foldid, parallel = FALSE, penalty.factor = pf) date() plot(elasticnet_stuff)
Again the plot of deviance for each value of alpha shows that smalled alpha has lower deviance, therefore, a ridge regression-like model is selected. In the Bayesian approach, two separate set of covariates are specified separately in the model by placing different priors. The clinical covariates receive a normal standardize prior while the genomic covariates a shrinkage prior via the horseshoe density. Let us review the model specification:
print_stan_file(stan_file_hs, section = 'model')
## splines model # prepare inits gen_inits <- function() { list(beta_1 = rnorm(n = ncol(x_clinical), mean = 0, sd = 1), beta_2 = rnorm(n = length(genomic_vars), mean = 0, sd = 1)) } # create separate model formulas surv_formula_1 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_clinical), collapse = "+"))) surv_formula_2 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x)[-match(colnames(x_clinical), colnames(x))], collapse = "+"))) # create data list cg_splines.data <- generate_stan_data_2(formula1 = surv_formula_1, formula2 = surv_formula_2, data = in_data, basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), prior_1 = normal(), prior_2 = hs(global_scale = nz/z)) date() stan_file_hs <- read_stan_file("src/stan_files/surv_hs.stan") splines.hs.clinico.genomic <- rstan::stan(stan_file_hs, data = cg_splines.data$standata, chains = 4, iter = 2000, seed = 13280, init = gen_inits, control = list(adapt_delta = 0.99), refresh = 0) date() rstan::traceplot(splines.hs.clinico.genomic, "lp__")
The chains are properly mixed. Although, a few divergent transitions might have shown we should not worry. Now we can obtain the prognostic index and calculate the metrics:
## select best coefficient cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm)) elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]] lam <- match(min(cvm), elasticnet$cvm) ## posterior draws beta_draws_cg <- cbind(rstan::extract(splines.hs.clinico.genomic, "beta_1")$beta_1, rstan::extract(splines.hs.clinico.genomic, "beta_2")$beta_2) beta_draws_cg_mean <- apply(beta_draws_cg, 2, mean) ## calculate risk score enet_rp <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam]) splines_rp <- as.matrix(x) %*% beta_draws_cg_mean # calcualte metrics metrics_list <- get_metrics(train = in_data, cox_rp = enet_rp, splines_rp = splines_rp) cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]]) bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated) auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.cox"]]$AUC) models <- c(models, "Elastic net Cox clinico-genomic", "Horseshoe I-S clinico-genomic") metrics <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc) metrics
And update the pec and roc plots:
pec <- c(pec, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]])) plot_bs(brier_score = pec, models = models)
roc <- c(roc, list( metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]) ) plot_roc(roc, models)
We can see that the models that include more covariates, i.e genomic models, might perform a bit better. However, how can we be sure that we are not over-fitting the model? Could we anticipate if this modelling will have external validity? We will approach the first solution to this question in the next section.
We have established that testing the model in the same data-set was the training may be overly simplistic and misleading because it can conceal over-fitting. To compare the models we may consider another approach such as the train/test split.
set.seed(237921) tmp <- caret::createDataPartition(lusc_omic$status, p = 0.8, list = FALSE) train <- lusc_omic[tmp, ] test <- lusc_omic[-tmp, ]
Let's now redo all the models fitting only in the training set and measuring the performance using the hold-out test set.
date() ## Cox model clinical_vars <- c("stage", "smoke", "age") surv_formula_clinical <- as.formula(paste0("Surv(time, status) ~ ", paste(clinical_vars, collapse = "+"))) ## Create model matrix -- x_clinical <- model.matrix(surv_formula_clinical, data = train)[, -1] #drop the intercept x_clinical_test <- model.matrix(surv_formula_clinical, data = test)[, -1] #drop the intercept cox.train <- coxph(surv_formula_clinical, data = train) ## splines model gen_inits <- function() { list(beta = rnorm(n = ncol(x_clinical), mean = 0, sd = 1)) } splines.data <- generate_stan_data(formula = surv_formula_clinical, data = train, basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), init = gen_inits, prior = normal()) splines.train <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, iter = 2000, seed = 1328025050, init = gen_inits, refresh = 0) # extract posterior draws beta_draws <- rstan::extract(splines.train, "beta")$beta beta_draws_mean <- apply(beta_draws, 2, mean) # calculate risk score predictions test and train splines_rp <- as.matrix(x_clinical_test) %*% beta_draws_mean cox_rp = as.matrix(x_clinical_test) %*% as.vector(unlist(coef(cox.train))) splines_rp_train <- as.matrix(x_clinical) %*% beta_draws_mean cox_rp_train = as.matrix(x_clinical) %*% as.vector(unlist(coef(cox.train))) # calculate metrics metrics_list <- get_metrics(train = train, test = test, cox_rp = cox_rp, splines_rp = splines_rp, cox_rp_train = cox_rp_train, splines_rp_train = splines_rp_train) cindex <- c(metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]]) bs <- c(metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated) auc <- c(metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC) models <- c("Cox-clinical", "Bayesian Splines-clinical") metrics_test <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc) pec_test <- list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]]) roc_test <- list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]) ## Genomic models x <- train[, genomic_vars] x_test <- test[, genomic_vars] colnames(x) <- colnames(x_test) <- paste0("var", 1:length(genomic_vars)) surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), collapse = "+"))) y <- train[, c("time", "status")] y_test <- test[, c("time", "status")] in_data <- cbind.data.frame(y, x) in_data_test <- cbind.data.frame(y_test, x_test) # expected number of non-zeros nz = ceiling(0.01 * length(genomic_vars)) z = length(genomic_vars) - nz # register for parallelisation if available set.seed(82405) foldid <- caret::createFolds(in_data$status, k = 10, list = FALSE) ## do crossvalidation to find optimal alpha and lambda elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", grouped = TRUE, foldid = foldid, parallel = FALSE) # I-splines model gen_inits <- function() { list(beta = rnorm(n = length(genomic_vars), mean = 0, sd = 1)) } # generate data splines.data <- generate_stan_data(surv_formula, data = in_data, basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), prior = hs(global_scale = nz/z)) # fit model splines.hs.genomic <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, iter = 2000, seed = 1328025050, init = gen_inits, control = list(adapt_delta = 0.99), refresh = 0) # calculate prognostic index cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm)) elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]] lam <- match(min(cvm), elasticnet$cvm) beta_draws <- rstan::extract(splines.hs.genomic, "beta")$beta beta_draws_mean <- apply(beta_draws, 2, mean) # calculate risk score enet_rp <- as.matrix(x_test) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam]) enet_rp_train <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam]) # calculate risk score: predictions test and train splines_rp <- as.matrix(x_test) %*% beta_draws_mean splines_rp_train <- as.matrix(x) %*% beta_draws_mean metrics_list <- get_metrics(train = in_data, test = in_data_test, splines_rp = splines_rp, cox_rp = enet_rp, splines_rp_train = splines_rp_train, cox_rp_train = enet_rp_train) #------------------- models <- c(models, "Elastic net genomic", "Horseshoe genomic") cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]]) bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated) auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC) metrics_test <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc) pec_test <- c(pec_test, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]])) roc_test <- c(roc_test, list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]])) # ## Clinico-genomic model give dummy names to vars to avoid issues # when creating model formula x_genomic <- train[, genomic_vars] x_genomic_test <- test[, genomic_vars] colnames(x_genomic) <- colnames(x_genomic_test) <- paste0("var", 1:length(genomic_vars)) ## prepare input data x <- cbind(x_clinical, x_genomic) x_test <- cbind(x_clinical_test, x_genomic_test) surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), collapse = "+"))) surv_formula_clinical <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_clinical), collapse = "+"))) y <- train[, c("time", "status")] y_test <- test[, c("time", "status")] in_data <- cbind.data.frame(y, x) ## Train model # register for parallelisation if available set.seed(1347294) foldid <- caret::createFolds(train$status, k = 10, list = FALSE) pf = rep(1, ncol(x)) pf[match(colnames(x_clinical), colnames(x))] <- 0 #clinico-genomic assumption # do crossvalidation to find optimal alpha and lambda elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", grouped = TRUE, foldid = foldid, parallel = FALSE, penalty.factor = pf) ## splines model gen_inits <- function() { list(beta_1 = rnorm(n = ncol(x_clinical), mean = 0, sd = 1), beta_2 = rnorm(n = ncol(x_genomic), mean = 0, sd = 1)) } surv_formula_1 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_clinical), collapse = "+"))) surv_formula_2 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_genomic), collapse = "+"))) cg_splines.data <- generate_stan_data_2(formula1 = surv_formula_1, formula2 = surv_formula_2, data = in_data, basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), prior_1 = normal(), prior_2 = hs(global_scale = nz/z)) splines.hs.train.clinico.genomic <- rstan::stan(stan_file_hs, data = cg_splines.data$standata, chains = 4, iter = 2000, seed = 1328025050, init = gen_inits, control = list(adapt_delta = 0.99), refresh = 0) ## cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm)) elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]] lam <- match(min(cvm), elasticnet$cvm) ## calculate risk score enet_rp_train <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam]) enet_rp <- as.matrix(x_test) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam]) ## Calculate risk scores beta_draws_cg <- cbind(rstan::extract(splines.hs.train.clinico.genomic, "beta_1")$beta_1, rstan::extract(splines.hs.train.clinico.genomic, "beta_2")$beta_2) beta_draws_cg_mean <- apply(beta_draws_cg, 2, mean) # calculate risk score: predictions test and train splines_rp <- as.matrix(x_test) %*% beta_draws_cg_mean splines_rp_train <- as.matrix(x) %*% beta_draws_cg_mean in_data_test <- cbind(y_test, x_test) metrics_list <- get_metrics(train = in_data, test = in_data_test, splines_rp = splines_rp, cox_rp = enet_rp, splines_rp_train = splines_rp_train, cox_rp_train = enet_rp_train) #------------------- models <- c(models, "Elastic net: clinico-genomic", "Horseshoe: clinico-genomic") cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]]) bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated) auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC) metrics_test <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc) pec_test <- c(pec_test, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]])) roc_test <- c(roc_test, list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]])) date()
Let's review the metrics:
metrics
And plot the survival ROC and PEC curves.
plot_bs(brier_score = pec_test, models = models)
plot_roc(roc = roc_test, models = models)
This vignette has tested the Bayesian semi-parametric I-splines model sampled with Hamiltonian Monte-Carlo and No-U-Turn-Sample. Besides, we conducted for first time a fair comparision between the I-Splines model and the Cox's model considering genomic and clinico-genomic models. Genomic and clinico-genomic models outperform clinical covariate models. However, clinical covariates should be considered when evaluating the effect of newly discovered genomic targets. A possible reasoning is that the elastic net and horseshoe prior are already tunned to attach more prognistic performance out-the-sample. Although, the train/test split is considered an standard practice to evaluate models [@bovelstad_predicting_2007] shown that the split may have an strong effect on the evalutaion. Therefore, we suggest follow-up this work conducting a Monte-Carlo Cross-Validation.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.