knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "C:/RFactory/rstanhaz")

Introduction

Survival regression focuses clinical interest because it may explain in a simple way the effect of a set of measured covariates on the survival/clinical outcome. Proportional hazard allows measuring the prognostic index of individuals depending on their covariates across follow-up while estimating the parameters that contribute to a better or worst prognosis. The hazard at time $t$ for an individual $i$ given her set of covariates $x$ is available by the product of hazard ratios and baseline (cumulative) hazard. Then the survival may be obtained by its relation with the hazard function given by the following equality: $\mathrm{S}(t)=\mathrm{exp} \left ( -\int \mathrm{h}(t) \mathrm{d}t \right) = \mathrm{exp}\left ( -\int h_{0} ยท \mathrm{exp(\beta x)}\right ) $ This article will show how to fit and validate proportional hazard regression models and introduce the use of I-splines as a Bayesian alternative to model survival data. Different types of covariate models will be fitted and compared, such as clinical, genomic, and clinico-genomic models. Besides, we will conduct an assessment of integrative clustering ability to predict survival outcome.

When to use regression models in survival analysis?

When we aim to understand in terms of covariates a time-varying outcome such as Death/Alive, Recurred/Disease Free.

Examples of covariates are steady-state exposure, tumour size or high-throughput genomic measurements.

Types of questions:

Is the covariate significant or has a credible effect on the clinical/survival outcome?

Does the treatment change significantly or in a credible way the survival outcome?

Models to describe:

Is the survival data well-fitted by the set of covariates?

Models to predict:

How well can a survival model predict the outcome of a new patient? Is the prognostic model credible?

Load libraries

This document is a starting point for evaluating I-splines models included in the rstanhaz package. It is not yet clear what libraries will be imported by the package. For now, the following libraries are recommended to be called. Note that we will be using the glmnet package to fit elastic net and rstan to fit Bayesian models with HMC and the NUTS sampling algorithms. Note that some models may take long run-time (~ 2 hours in a PC with 6 cores), using multi-core option is recommended to speed up calculations.

#Load libraries
suppressMessages(library(dplyr) )
#For glmnet
suppressMessages(library(glmnet))
suppressMessages( library(glmnetUtils))
# For stan
suppressMessages( library(rstan) )
suppressMessages( library(rstanarm) )
#set options for rstan
rstan_options(auto_write = TRUE)
options(mc.cores = min(4, parallel::detectCores()))
# For survival analysis
library(survival)
library(survcomp)
# For MC-crossvalidations
suppressMessages( library(rsample) )
suppressMessages( library(tidyposterior) )
#for this package
suppressMessages( devtools::document() )
# source("~/rstanhaz/R")
# if(!require("devtools")) install.packages("devtools")
# if(!require("rstanhaz")) devtools::install_github("csetraynor/rstanhaz")

To exemplify this vignette we will use the Lung Squamous carcinoma TCGA dataset.

data(lusc)
?lusc

Observe the dataset, which has been pre-processed to select the clinical variable of interest. Note that a previous step of this project was to conduct Bayesian integrative clustering using the package iClusterPlus. To do so, genomic variables were imputed using K-nearest neighbours (K=10). After conducting the clustering all variables were sorted by absolute posterior probability of importance and then a soft cut was implemented to keep the first quantile of variables (~ 1200 genomic variables). Besides, continuous clinical variables were imputed through multiple imputations by chained equations.

glimpse(lusc)

Load all "omics" data from TCGA:

# load sigfeatures
sigfeatures <- readRDS("data-raw/lusc_sigfeatures.RDS")
sigfeatures_head <- readRDS("data-raw/lusc_sigfeatures_head.RDS")

## Load genomic datasets
gene = readRDS("data-raw/gene.RDS")
seg.cn = readRDS("data-raw/seg.cn.RDS")
meth450 = readRDS("data-raw/meth450.RDS")
sample = readRDS("data-raw/sample.RDS")

Data preprocess

Although some data pre-process has been done beforehand, which may be reviewed in the iClusterPlus package branch: test-surv. More data pre-process is needed to conduct survival modelling, which can be skipped for the occasional reader.

## Get common sample id
sample_id <- list(rownames(gene$matrix), rownames(seg.cn), sample$sample_id, 
    rownames(meth450$matrix))
sample_id <- Reduce(intersect, sample_id)

## Filter
gene$matrix <- filter_sample_id(s = sample_id, m = gene$matrix)
meth450$matrix <- filter_sample_id(s = sample_id, m = meth450$matrix)
seg.cn <- filter_sample_id(s = sample_id, m = seg.cn)
sample <- sample[sample$sample_id %in% sample_id, ]
sample <- sample[order(sample$sample_id), ]

require(assertthat)
assert_that(all(rownames(meth450$matrix) == rownames(seg.cn)))
assert_that(all(rownames(gene$matrix) == rownames(seg.cn)))

## Select sigfeatures
gene$matrix <- gene$matrix[, sigfeatures[["gene expression"]]]
meth450$matrix <- meth450$matrix[, sigfeatures[["methylation"]]]
seg.cn <- seg.cn[, sigfeatures[["copy number"]]]

## Add patient id

gene$matrix <- dplyr::as_data_frame(gene$matrix, rownames = "sample_id")
meth450$matrix <- dplyr::as_data_frame(meth450$matrix, rownames = "sample_id")
seg.cn <- dplyr::as_data_frame(seg.cn, rownames = "sample_id")

## join genomic head vars
genomic <- dplyr::left_join(gene$matrix, meth450$matrix, by = "sample_id") %>% 
    dplyr::left_join(., seg.cn, by = "sample_id")
rm(gene)
rm(meth450)
rm(seg.cn)

# ## remove near zero vars
nzvars <- caret::nearZeroVar(genomic)
if (length(nzvars) > 0) {
    genomic <- genomic[, -nzvars]
}

genomic_vars <- colnames(genomic)[-match("sample_id", colnames(genomic))]

## Join genomic and clinical data
lusc_omic <- dplyr::left_join(lusc, sample %>% dplyr::select(patient_id, 
    sample_id), by = "patient_id") %>% dplyr::left_join(., genomic, by = "sample_id") %>% 
    dplyr::select(-sample_id)

## Select most sigfeatures head
sigfeatures_head <- unlist(sigfeatures_head)[unlist(sigfeatures_head) %in% 
    colnames(lusc_omic)]

Standardise continuous variables:

### Standardise covariates ----------------------

## preprocess omic
preProcValues <- caret::preProcess(lusc_omic[, c("age", genomic_vars)], 
    method = c("center", "scale"))
Transformed <- as.data.frame(predict(preProcValues, lusc_omic[c("age", 
    genomic_vars)]))

lusc_omic <- cbind(lusc_omic[, -match(c("age", genomic_vars), colnames(lusc_omic))], 
    Transformed) %>% dplyr::mutate(time = os_months, status = os_status)

Fitting a simple model and visualising

We are going to fit a simple I-splines model. Let's review the model in detail:

stan_file <- read_stan_file("src/stan_files/surv.stan")
print_stan_file(stan_file, section = 'model')

The model calculates the linear predictor, which has a multiplicative effect on the baseline hazard. Now we are going to fit our first model using one clinical variable: pathological staging(stage), which has three levels low, medium and high. We will use the function generate_stan_data to create a data list that will be passed to Stan.

splines.data <- generate_stan_data(formula = Surv(time, status) ~ stage, 
    data = lusc_omic, basehaz = "fpm", prior_aux = exponential(), prior = normal())
splines.fit <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, 
    iter = 2000, refresh = 0)

Now we are ready to plot the model and compare it side-by-side with the Cox's model. To visualise the models we have to extract the baseline hazard and the regression coefficient, also known as hazard ratios, for each model:

# Fit Cox models
cox.fit <- coxph(Surv(time, status) ~ stage, data = lusc_omic)
# Extract baseline hazard
cox.bh <- survival::basehaz(cox.fit, centered = FALSE)
get_splines.bh <- function(fit, data) {
    splines.basis <- data$basehaz$spline_basis
    splines.basehaz_coefs <- rstan::extract(fit, "basehaz_coefs")$basehaz_coefs
    splines.basehaz_coefs <- apply(X = splines.basehaz_coefs, MARGIN = 2, 
        FUN = mean)

    basehaz.post <- sapply(seq_along(1:nrow(splines.basis)), function(n) {
        splines.basehaz_coefs %*% splines.basis[n, ]
    })
    out <- data.frame(hazard = c(0, basehaz.post), time = c(1e-05, data$standata$t_end[data$standata$d]))
    unique(out[order(out$time), ])
}
splines.bh <- get_splines.bh(fit = splines.fit, data = splines.data)
# Extract regression coefficients
cox.beta <- c(0, coef(cox.fit))
splines.beta <- rstan::extract(splines.fit, "beta")$beta
splines.beta <- c(0, apply(splines.beta, 2, mean))
names(cox.beta) <- names(splines.beta) <- c("high", "low", "medium")

Using the theory of proportional hazard modelling we are able to predict the survival function for each stage in R and overlay the predictions of each model to the Kaplan-Meier maximum-likelihood estimate of survival:

#Function to link (cumulative) baseline hazard to survival
linkto_surv <- function(bh, hr){
  exp( - bh)^(exp( hr ) )
}
# calculate predicted survival for Cox and I-splines models
cox.surv <- lapply(seq_along(cox.beta), function(b){
  data.frame( surv =  linkto_surv( bh = cox.bh$hazard, hr = cox.beta[b] ) ,
              time = cox.bh$time,
              stage = names(cox.beta[b])
  )
})
cox.surv <- do.call(rbind.data.frame, cox.surv)
splines.surv <- lapply(seq_along(splines.beta), function(b){
  data.frame( surv = linkto_surv( bh = splines.bh$hazard, hr = splines.beta[b] ),
              time = splines.bh$time,
              stage = names(splines.beta[b])
  )
})
splines.surv <- do.call(rbind.data.frame, splines.surv)
# Observed Maximum likelihood estimate
mle.surv <- survfit(Surv(time, status) ~ stage , data = lusc_omic  )
obs.mortality <- data.frame(time = mle.surv$time,
                            surv = mle.surv$surv,
                            strata = summary(mle.surv, censored=T)$strata)
zeros <- data.frame(time=0, surv=1, strata=unique(obs.mortality$strata))
obs.mortality <- rbind(obs.mortality, zeros)

strata <- str_strata(obs.mortality)
obs.mortality$strata <- NULL
obs.mortality <- cbind( obs.mortality, strata)
# Design plots
cox.plot <- ggplot2::ggplot(data = obs.mortality, aes(x = time, y = surv, group = stage)) +
  geom_step(aes(colour = stage, linetype = "K-M estimate")) +
  geom_line(data = cox.surv, aes(x = time, y = surv, colour = stage, linetype = "Prediction Cox")) + labs(x = "Time(months)", y = "Survival", title = "Cox's model") + theme_bw() 
splines.plot <- ggplot2::ggplot(data = obs.mortality, aes(x = time, y = surv, group = stage)) +
  geom_step(aes(colour = stage, linetype = "K-M estimate")) +
  geom_line(data = splines.surv, aes(x = time, y = surv, colour = stage, linetype = "Prediction I-Splines")) + 
    labs(x = "Time(months)", y = "Survival", title = "I-Splines model") + theme_bw()
gridExtra::grid.arrange(cox.plot, splines.plot, ncol = 2)

Visualising the predictions may help to understand the model. However, it has not yet been established a way to plot continuous variables and/or more than one variable which is something that can be included in the future work.

Fit models

However, more interesting than fitting the simple model above may be to estimate how a manifold of variables influences the outcome when taken into account together.

Clinical models

For example, we can fit a clinical model. Moreover, by using multivariate survival models we can take into account confounding factors for overall survival, such as age. Let's specify the clinical variables:

clinical_vars <- c("stage", "smoke", "age")

Fitting the clinical Cox's models.

## Cox model
surv_formula_clinical <- as.formula(paste0("Surv(time, status) ~ ", paste(clinical_vars, 
    collapse = "+")))
cox.fit <- coxph(surv_formula_clinical, data = lusc_omic)
summary(cox.fit)

A measure we will be interested in showing in the summary object: the Concordance index. Which with for this models is 0.632. Secondly, we will fit the Bayesian proportional hazard modelling the baseline hazard function with splines.

## splines model
gen_inits <- function() {
    list(beta = rnorm(n = 5, mean = 0, sd = 1))
}

splines.data <- generate_stan_data(formula = surv_formula_clinical, data = lusc_omic, 
    basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), 
    init = gen_inits, prior = normal())
splines.fit <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, 
    iter = 2000, seed = 1328025050, init = gen_inits, refresh = 0)

To examine these models we will need a model matrix to compute predictions.

## Create model matrix --
x_clinical <- model.matrix(surv_formula_clinical, data = lusc_omic)[, -1]

Firstly, we will check whether the posterior probabilities of the log-hazard ratios computed with I-splines are similar to the parameter estimates calculated with the Cox's model:

# extract posterior draws
beta_draws <- rstan::extract(splines.fit, "beta")$beta
beta_draws_mean <- apply(beta_draws, 2, mean)
splines.tab <- t(apply(beta_draws, 2, quantile, probs = c(0.025, 0.975, 
    0.5)))
rownames(splines.tab) <- colnames(x_clinical)
splines.tab

We can compare this estimates to the Cox model; the parameter estimates can be extracted using the following function:

par.table <- function(fit) {
    ## needs glm object
    d1 <- summary(fit)$coefficients[, c("coef", "Pr(>|z|)")]
    dc <- confint(fit)
    cbind(dc, d1)
}
par.table(cox.fit)

We can observe that many of the parameters evaluated agree between the two models. However, for the smokeX category, the two estimates are substantially different 0.15 for the Bayesian I-splines model and 0.52 for the Cox's model. To understand what is going on, let's focus on the smoking variable, which in this data has three categories: currently smoking, reformed and not available (X). While the model is secure about currently smoking and reformed categories effect, the data might not be very informative about the not available category. In this case, the Cox's model has given large confidence intervals and a p-value that indicates that the parameter estimate is not very reliable. On the other hand, the Bayesian model has shrunk the posterior towards zero leveraging the normal prior centred at zero.

Model comparison

We may compare how these two models perform using various metrics such as the C-index, survival ROC and the survival Brier score. To do so, we will need to calculate the risk score or prognostic index for each individual:

# calculate risk score
splines_rp <- as.matrix(x_clinical) %*% beta_draws_mean
cox_rp = as.matrix(x_clinical) %*% as.vector(unlist(coef(cox.fit)))

Let us shortly review the metrics we will be using to assess the goodness of fit. The concordance index [@harrell_regression_2014, ch. 12] is a score that measures the rightly ranked pairs of predicted and observed event times. The survival ROC [@heagerty_time-dependent_2000] is an extension of binary ROC curves to include the time dependence of survival outcomes. The survival Brier Score [@mogensen_evaluating_2012] is the squared difference between the model-based predicted survival probabilities and the true disease status.

We will wrap in the following function the call to evaluate the survival Brier score, time-dependent ROC and concordance index using.

# prepare calculate metrics
get_metrics <- function(train, test = NULL, cox_rp, splines_rp, cox_rp_train = NULL, 
    splines_rp_train = NULL) {

    if (is.null(test)) {
        test <- train
        cox_rp_train <- cox_rp
        splines_rp_train <- splines_rp
    }

    ## Prepare to calculate metrics
    ws <- rep(1, length(test$time))
    ws[test$time > max(test$time[test$status])] <- 0.01

    obs.test.time <- sort(unique(test$time[test$status]))
    # cindex
    Cindex.cox <- survcomp::concordance.index(x = cox_rp, surv.time = test$time, 
        surv.event = test$status, method = "conservative", na.rm = TRUE, 
        weights = ws)$c.index
    Cindex.splines <- survcomp::concordance.index(x = splines_rp, surv.time = test$time, 
        surv.event = test$status, method = "conservative", na.rm = TRUE, 
        weights = ws)$c.index

    # brier score
    bs.cox <- survcomp::sbrier.score2proba(data.tr = data.frame(time = train$time, 
        event = train$status, score = cox_rp_train), data.ts = data.frame(time = test$time, 
        event = test$status, score = cox_rp), method = c("prodlim"))
    bs.splines <- survcomp::sbrier.score2proba(data.tr = data.frame(time = train$time, 
        event = train$status, score = splines_rp_train), data.ts = data.frame(time = test$time, 
        event = test$status, score = splines_rp), method = c("prodlim"))

    # AUC ROC
    AUC.cox <- survcomp::tdrocc(x = as.vector(cox_rp), surv.time = test$time, 
        surv.event = test$status, time = max(obs.test.time))
    AUC.splines <- survcomp::tdrocc(x = as.vector(splines_rp), surv.time = test$time, 
        surv.event = test$status, time = max(obs.test.time))
    list(Cindex.cox = Cindex.cox, Cindex.splines = Cindex.splines, bs.cox = bs.cox, 
        bs.splines = bs.splines, AUC.cox = AUC.cox, AUC.splines = AUC.splines)
}
# Get metrics
metrics_list <- get_metrics(train = lusc_omic, splines_rp = splines_rp, 
    cox_rp = cox_rp)
models <- c("Cox clinical", "I-Splines clinical")
cindex <- c(metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]])
bs <- c(metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated)
auc <- c(metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC)
metrics <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc)
metrics

Slightly better for the splines model, but very similar indeed. Let's visualise now the survival Brier score and ROC curves.

# get a beautiful palette to plot
get_colors <- function(n) {
    colors <- c("#89C5DA", "#DA5724", "#74D944", "#CE50CA", "#3F4921", 
        "#C0717C", "#CBD588", "#5F7FC7", "#673770", "#D3D93E", "#38333E", 
        "#508578", "#D7C1B1", "#689030", "#AD6F3B", "#CD9BCD", "#D14285", 
        "#6DDE88", "#652926", "#7FDCC0", "#C84248", "#8569D5", "#5E738F", 
        "#D1A33D", "#8A7C64", "#599861")
    colors[1:n]
}
# prepare plot fun
plot_bs <- function(brier_score, models) {


    data.plot <- lapply(seq_along(brier_score), function(d) {
        data.frame(BS = brier_score[[d]]$bsc, time = brier_score[[d]]$time, 
            model = models[[d]])
    })

    data.caption <- sapply(seq_along(brier_score), function(d) {
        paste0(models[[d]], " = ", round(brier_score[[d]]$bsc.integrated, 
            3))
    })
    data.caption <- paste0("iBrier:\n ", paste(data.caption, collapse = "\n"))
    my_theme <- theme_bw() + theme(axis.text.x = element_text(size = 10), 
        axis.text.y = element_text(size = 10), plot.title = element_text(size = 15, 
            face = "bold", color = "darkgreen"))
    # The palette with black:
    cbbPalette <- get_colors(n = length(models))
    ggplot2::ggplot(data = do.call(rbind.data.frame, data.plot), aes(x = time, 
        y = BS)) + geom_line(aes(color = model), size = 0.55) + labs(title = "Prediction error curve\n", 
        x = "Time [months]", y = "survival Brier score", color = "Model\n", 
        caption = data.caption) + scale_color_manual(labels = models, values = cbbPalette) + 
        my_theme
}
# plot pec
pec <- list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]])
plot_bs(brier_score = pec, models = models)

We can make another plot for the ROC curve:

#prepare function
plot_roc <- function(roc, models) {
    data.plot <- lapply(seq_along(roc), function(d) {
        data.frame(fpr = sort(1 - roc[[d]]$spec), tpr = sort(roc[[d]]$sens), 
            model = models[[d]])
    })

    data.caption <- sapply(seq_along(roc), function(d) {
        paste0(models[[d]], " = ", round(roc[[d]]$AUC, 3))
    })
    data.caption <- paste0("AUC:\n ", paste(data.caption, collapse = "\n"))
    my_theme <- theme_bw() + theme(axis.text.x = element_text(size = 10), 
        axis.text.y = element_text(size = 10), plot.title = element_text(size = 15, 
            face = "bold", color = "darkgreen"))
    cbbPalette <- get_colors(n = length(models))
    ggplot2::ggplot(data = do.call(rbind.data.frame, data.plot), aes(x = fpr, 
        y = tpr)) + geom_line(aes(color = model), size = 0.55) + labs(title = "survival ROC curve\n", 
        x = "False postive rate (1-specificity)", y = "True positive rate (sensitivity)", 
        color = "Model\n", caption = data.caption) + scale_color_manual(labels = models, 
        values = cbbPalette) + my_theme + geom_abline(slope = 1) + coord_fixed()
}
# design plot
roc <- list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]])
plot_roc(roc = roc, models = models)

Genomic models (high-dimensional)

Now we will investigate the hypothesis of including genomic variables from different data sources such as proteomics, transcriptomics and genomics. We have a curated big dataset. The hypothesis is that the variables that are relevant for integrative clustering as assess by their posterior predictive importance might also have a strong effect on the survival outcome. We will be now moving on to the high-dimensional setting where the number of features (genomic variables), is usually considerately larger than the number of individuals observed. Therefore, some form of model tunning needs to be considered, i.e. over-fitting and high variance have to be tackled. Moreover, it is likely that these covariates are correlated, therefore we will be using tunned models for the high-dimensional setting that can handle correlated covariates. The Cox' model will be tunned via the Elastic Net [see @zou_regularization_2005] and the I-splines model via the horseshoe prior [see @carvalho_horseshoe_2010].

# give dummy names to vars to avoid issues when parsing the model formula
x <- lusc_omic[, genomic_vars]
colnames(x) <- paste0("var", 1:length(genomic_vars))

surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), 
    collapse = "+")))

y <- lusc_omic[, c("time", "status")]
in_data <- cbind.data.frame(y, x)

# expected zeros
nz = ceiling(0.01 * length(genomic_vars))
z = length(genomic_vars) - nz

# register for parallelisation if available
set.seed(82405)
foldid <- caret::createFolds(in_data$status, k = 10, list = FALSE)

## do crossvalidation to find optimal alpha and lambda
date()
elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", 
    grouped = TRUE, foldid = foldid, parallel = FALSE)
date()
plot(elasticnet_stuff)

The plot of lambda vs deviance for different values of alpha shows that for smaller alpha values the fit is considerably better. Therefore, this implies that the selected model is close to ridge regression.

The horseshoe prior belongs to the family of shrinkage priors, which cancels the parameters linked to noise while making allowances for the parameters linked to signal. The horseshoe has very tall modes and thick-tailed tails due to the half-Cauchy prior placed to the local hyperparameters.

## generate intits
gen_inits <- function() {
    list(beta = rnorm(n = length(genomic_vars), mean = 0, sd = 1))
}
# generate data
splines.data <- generate_stan_data(surv_formula, data = in_data, basehaz = "fpm", 
    prior_aux = exponential(rate = 1, autoscale = TRUE), prior = hs(global_scale = nz/z))
date()
options(mc.cores = min(4, parallel::detectCores()))
# fit model
splines.hs.genomic <- rstan::stan(stan_file, data = splines.data$standata, 
    chains = 4, iter = 2000, seed = 13250, init = gen_inits, control = list(adapt_delta = 0.99), 
    refresh = 0)
date()
rstan::traceplot(splines.hs.genomic, "lp__")

Divergent transitions may appear when sampling the horseshoe model. But as long as there is only a handful of divergent transition the chain should be fine. Reviewing the traceplot of the log-posterior indicates that it's well-mixed.

We can extract the maximum (penalised) likelihood estimates and posterior draws respectively to calculate the prognostic index.

# coefficients
cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm[i$nzero > 
    nz]))
elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]]
lam <- match(min(cvm), elasticnet$cvm)
# posterior draws
beta_draws <- rstan::extract(splines.hs.genomic, "beta")$beta
beta_draws_mean <- apply(beta_draws, 2, mean)  #mean because the distribution is not normal
# calculate risk score
enet_rp <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam])
splines_rp <- as.matrix(x) %*% beta_draws_mean

Calculate the metrics:

metrics_list <- get_metrics(train = in_data, splines_rp = splines_rp, cox_rp = enet_rp)
# add to list of models
models <- c(models, "Elastic net Cox genomic", "Horseshoe I-S genomic")
cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]])
bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated)
auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC)
# update record 
metrics <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc)
metrics

Similarly, update the pec and roc plots.

pec <- c(pec, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]]))
plot_bs(brier_score = pec, models = models)
roc <- c(roc, list( metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]) )
plot_roc(roc, models)

Clinico-genomic models

In the era of high-throughput genomic data, doctors are still guided by classical variables, such as age or pathological staging to guide clinical practice. It is not yet clear whether or not we should still consider classical variables when modelling survival data. Therefore, we are going to consider a two-sided model where we include a "prior" belief that classical variables are going to be relevant and high-throughout derived genomic variable should be shrunk.

Let's consider the Elastic net, the parameter penalty factor may be use to indicate thatthe clinical covariates should not be shrunk.

# Clinico-genomic model give dummy names to vars to avoid issues
# when creating model formula
x_genomic <- lusc_omic[, genomic_vars]
colnames(x_genomic) <- paste0("var", 1:length(genomic_vars))
# create clinico-genomic model matrix
x <- cbind(x_clinical, x_genomic)
surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), 
    collapse = "+")))

y <- lusc_omic[, c("time", "status")]
in_data <- cbind.data.frame(y, x)
## Expected number of non-zeros
nz = 0.01 * length(genomic_vars)
z = length(genomic_vars) - nz
# register for parallelisation if available
set.seed(1347294)
foldid <- caret::createFolds(lusc_omic$status, k = 10, list = FALSE)
pf = rep(1, ncol(x))
pf[match(colnames(x_clinical), colnames(x))] <- 0  #clinico-genomic assumption
# do crossvalidation to find optimal alpha and lambda
date()
elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", 
    grouped = TRUE, foldid = foldid, parallel = FALSE, penalty.factor = pf)
date()
plot(elasticnet_stuff)

Again the plot of deviance for each value of alpha shows that smalled alpha has lower deviance, therefore, a ridge regression-like model is selected. In the Bayesian approach, two separate set of covariates are specified separately in the model by placing different priors. The clinical covariates receive a normal standardize prior while the genomic covariates a shrinkage prior via the horseshoe density. Let us review the model specification:

print_stan_file(stan_file_hs, section = 'model')
## splines model
# prepare inits
gen_inits <- function() {
    list(beta_1 = rnorm(n = ncol(x_clinical), mean = 0, sd = 1), beta_2 = rnorm(n = length(genomic_vars), 
        mean = 0, sd = 1))
}
# create separate model formulas
surv_formula_1 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_clinical), 
    collapse = "+")))
surv_formula_2 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x)[-match(colnames(x_clinical), 
    colnames(x))], collapse = "+")))
# create data list
cg_splines.data <- generate_stan_data_2(formula1 = surv_formula_1, formula2 = surv_formula_2, 
    data = in_data, basehaz = "fpm", prior_aux = exponential(rate = 1, 
        autoscale = TRUE), prior_1 = normal(), prior_2 = hs(global_scale = nz/z))
date()
stan_file_hs <- read_stan_file("src/stan_files/surv_hs.stan")
splines.hs.clinico.genomic <- rstan::stan(stan_file_hs, data = cg_splines.data$standata, 
    chains = 4, iter = 2000, seed = 13280, init = gen_inits, control = list(adapt_delta = 0.99), 
    refresh = 0)
date()
rstan::traceplot(splines.hs.clinico.genomic, "lp__")

The chains are properly mixed. Although, a few divergent transitions might have shown we should not worry. Now we can obtain the prognostic index and calculate the metrics:

## select best coefficient
cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm))
elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]]
lam <- match(min(cvm), elasticnet$cvm)
## posterior draws
beta_draws_cg <- cbind(rstan::extract(splines.hs.clinico.genomic, "beta_1")$beta_1, 
    rstan::extract(splines.hs.clinico.genomic, "beta_2")$beta_2)
beta_draws_cg_mean <- apply(beta_draws_cg, 2, mean)
## calculate risk score
enet_rp <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, lam])
splines_rp <- as.matrix(x) %*% beta_draws_cg_mean
# calcualte metrics
metrics_list <- get_metrics(train = in_data, cox_rp = enet_rp, splines_rp = splines_rp)
cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]])
bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated)
auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.cox"]]$AUC)
models <- c(models, "Elastic net Cox clinico-genomic", "Horseshoe I-S clinico-genomic")
metrics <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc)
metrics

And update the pec and roc plots:

pec <- c(pec, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]]))
plot_bs(brier_score = pec, models = models)
roc <- c(roc, list( metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]) )
plot_roc(roc, models)

We can see that the models that include more covariates, i.e genomic models, might perform a bit better. However, how can we be sure that we are not over-fitting the model? Could we anticipate if this modelling will have external validity? We will approach the first solution to this question in the next section.

Model validation

We have established that testing the model in the same data-set was the training may be overly simplistic and misleading because it can conceal over-fitting. To compare the models we may consider another approach such as the train/test split.

set.seed(237921)
tmp <- caret::createDataPartition(lusc_omic$status, p = 0.8, list = FALSE)
train <- lusc_omic[tmp, ]
test <- lusc_omic[-tmp, ]

Let's now redo all the models fitting only in the training set and measuring the performance using the hold-out test set.

date()
## Cox model
clinical_vars <- c("stage", "smoke", "age")
surv_formula_clinical <- as.formula(paste0("Surv(time, status) ~ ", paste(clinical_vars, 
    collapse = "+")))
## Create model matrix --
x_clinical <- model.matrix(surv_formula_clinical, data = train)[, -1]  #drop the intercept
x_clinical_test <- model.matrix(surv_formula_clinical, data = test)[, -1]  #drop the intercept
cox.train <- coxph(surv_formula_clinical, data = train)
## splines model
gen_inits <- function() {
    list(beta = rnorm(n = ncol(x_clinical), mean = 0, sd = 1))
}
splines.data <- generate_stan_data(formula = surv_formula_clinical, data = train, 
    basehaz = "fpm", prior_aux = exponential(rate = 1, autoscale = TRUE), 
    init = gen_inits, prior = normal())
splines.train <- rstan::stan(stan_file, data = splines.data$standata, chains = 4, 
    iter = 2000, seed = 1328025050, init = gen_inits, refresh = 0)
# extract posterior draws
beta_draws <- rstan::extract(splines.train, "beta")$beta
beta_draws_mean <- apply(beta_draws, 2, mean)
# calculate risk score predictions test and train
splines_rp <- as.matrix(x_clinical_test) %*% beta_draws_mean
cox_rp = as.matrix(x_clinical_test) %*% as.vector(unlist(coef(cox.train)))
splines_rp_train <- as.matrix(x_clinical) %*% beta_draws_mean
cox_rp_train = as.matrix(x_clinical) %*% as.vector(unlist(coef(cox.train)))
# calculate metrics
metrics_list <- get_metrics(train = train, test = test, cox_rp = cox_rp, 
    splines_rp = splines_rp, cox_rp_train = cox_rp_train, splines_rp_train = splines_rp_train)
cindex <- c(metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]])
bs <- c(metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated)
auc <- c(metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC)
models <- c("Cox-clinical", "Bayesian Splines-clinical")
metrics_test <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc)
pec_test <- list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]])
roc_test <- list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]])
## Genomic models
x <- train[, genomic_vars]
x_test <- test[, genomic_vars]
colnames(x) <- colnames(x_test) <- paste0("var", 1:length(genomic_vars))
surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), 
    collapse = "+")))

y <- train[, c("time", "status")]
y_test <- test[, c("time", "status")]
in_data <- cbind.data.frame(y, x)
in_data_test <- cbind.data.frame(y_test, x_test)
# expected number of non-zeros
nz = ceiling(0.01 * length(genomic_vars))
z = length(genomic_vars) - nz
# register for parallelisation if available
set.seed(82405)
foldid <- caret::createFolds(in_data$status, k = 10, list = FALSE)
## do crossvalidation to find optimal alpha and lambda
elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", 
    grouped = TRUE, foldid = foldid, parallel = FALSE)
# I-splines model
gen_inits <- function() {
    list(beta = rnorm(n = length(genomic_vars), mean = 0, sd = 1))
}
# generate data
splines.data <- generate_stan_data(surv_formula, data = in_data, basehaz = "fpm", 
    prior_aux = exponential(rate = 1, autoscale = TRUE), prior = hs(global_scale = nz/z))
# fit model
splines.hs.genomic <- rstan::stan(stan_file, data = splines.data$standata, 
    chains = 4, iter = 2000, seed = 1328025050, init = gen_inits, control = list(adapt_delta = 0.99), 
    refresh = 0)
# calculate prognostic index
cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm))
elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]]
lam <- match(min(cvm), elasticnet$cvm)

beta_draws <- rstan::extract(splines.hs.genomic, "beta")$beta
beta_draws_mean <- apply(beta_draws, 2, mean) 
# calculate risk score
enet_rp <- as.matrix(x_test) %*% as.vector(coef(elasticnet$glmnet.fit)[, 
    lam])
enet_rp_train <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, 
    lam])
# calculate risk score: predictions test and train
splines_rp <- as.matrix(x_test) %*% beta_draws_mean
splines_rp_train <- as.matrix(x) %*% beta_draws_mean
metrics_list <- get_metrics(train = in_data, test = in_data_test, splines_rp = splines_rp, 
    cox_rp = enet_rp, splines_rp_train = splines_rp_train, cox_rp_train = enet_rp_train)
#-------------------
models <- c(models, "Elastic net genomic", "Horseshoe genomic")
cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]])
bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated)
auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC)
metrics_test <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc)
pec_test <- c(pec_test, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]]))
roc_test <- c(roc_test, list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]))
# ## Clinico-genomic model give dummy names to vars to avoid issues
# when creating model formula
x_genomic <- train[, genomic_vars]
x_genomic_test <- test[, genomic_vars]
colnames(x_genomic) <- colnames(x_genomic_test) <- paste0("var", 1:length(genomic_vars))
## prepare input data
x <- cbind(x_clinical, x_genomic)
x_test <- cbind(x_clinical_test, x_genomic_test)
surv_formula <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x), 
    collapse = "+")))
surv_formula_clinical <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_clinical), 
    collapse = "+")))

y <- train[, c("time", "status")]
y_test <- test[, c("time", "status")]
in_data <- cbind.data.frame(y, x)
## Train model
# register for parallelisation if available
set.seed(1347294)
foldid <- caret::createFolds(train$status, k = 10, list = FALSE)
pf = rep(1, ncol(x))
pf[match(colnames(x_clinical), colnames(x))] <- 0  #clinico-genomic assumption
# do crossvalidation to find optimal alpha and lambda
elasticnet_stuff <- cva.glmnet(surv_formula, data = in_data, family = "cox", 
    grouped = TRUE, foldid = foldid, parallel = FALSE, penalty.factor = pf)
## splines model
gen_inits <- function() {
    list(beta_1 = rnorm(n = ncol(x_clinical), mean = 0, sd = 1), beta_2 = rnorm(n = ncol(x_genomic), 
        mean = 0, sd = 1))
}
surv_formula_1 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_clinical), 
    collapse = "+")))
surv_formula_2 <- as.formula(paste0("Surv(time, status) ~", paste(colnames(x_genomic), 
    collapse = "+")))
cg_splines.data <- generate_stan_data_2(formula1 = surv_formula_1, formula2 = surv_formula_2, 
    data = in_data, basehaz = "fpm", prior_aux = exponential(rate = 1, 
        autoscale = TRUE), prior_1 = normal(), prior_2 = hs(global_scale = nz/z))
splines.hs.train.clinico.genomic <- rstan::stan(stan_file_hs, data = cg_splines.data$standata, 
    chains = 4, iter = 2000, seed = 1328025050, init = gen_inits, control = list(adapt_delta = 0.99), 
    refresh = 0)
## 
cvm <- sapply(elasticnet_stuff$modlist, function(i) min(i$cvm))
elasticnet <- elasticnet_stuff$modlist[[match(min(cvm), cvm)]]
lam <- match(min(cvm), elasticnet$cvm)
## calculate risk score
enet_rp_train <- as.matrix(x) %*% as.vector(coef(elasticnet$glmnet.fit)[, 
    lam])
enet_rp <- as.matrix(x_test) %*% as.vector(coef(elasticnet$glmnet.fit)[, 
    lam])
## Calculate risk scores
beta_draws_cg <- cbind(rstan::extract(splines.hs.train.clinico.genomic, 
    "beta_1")$beta_1, rstan::extract(splines.hs.train.clinico.genomic, 
    "beta_2")$beta_2)
beta_draws_cg_mean <- apply(beta_draws_cg, 2, mean)
# calculate risk score: predictions test and train
splines_rp <- as.matrix(x_test) %*% beta_draws_cg_mean
splines_rp_train <- as.matrix(x) %*% beta_draws_cg_mean
in_data_test <- cbind(y_test, x_test)
metrics_list <- get_metrics(train = in_data, test = in_data_test, splines_rp = splines_rp, 
    cox_rp = enet_rp, splines_rp_train = splines_rp_train, cox_rp_train = enet_rp_train)
#-------------------
models <- c(models, "Elastic net: clinico-genomic", "Horseshoe: clinico-genomic")
cindex <- c(cindex, metrics_list[["Cindex.cox"]], metrics_list[["Cindex.splines"]])
bs <- c(bs, metrics_list[["bs.cox"]]$bsc.integrated, metrics_list[["bs.splines"]]$bsc.integrated)
auc <- c(auc, metrics_list[["AUC.cox"]]$AUC, metrics_list[["AUC.splines"]]$AUC)
metrics_test <- data.frame(models = models, cindex = cindex, bs = bs, auc = auc)
pec_test <- c(pec_test, list(metrics_list[["bs.cox"]], metrics_list[["bs.splines"]]))
roc_test <- c(roc_test, list(metrics_list[["AUC.cox"]], metrics_list[["AUC.splines"]]))
date()

Let's review the metrics:

metrics

And plot the survival ROC and PEC curves.

plot_bs(brier_score = pec_test, models = models)
plot_roc(roc = roc_test, models = models)

Conclusions

This vignette has tested the Bayesian semi-parametric I-splines model sampled with Hamiltonian Monte-Carlo and No-U-Turn-Sample. Besides, we conducted for first time a fair comparision between the I-Splines model and the Cox's model considering genomic and clinico-genomic models. Genomic and clinico-genomic models outperform clinical covariate models. However, clinical covariates should be considered when evaluating the effect of newly discovered genomic targets. A possible reasoning is that the elastic net and horseshoe prior are already tunned to attach more prognistic performance out-the-sample. Although, the train/test split is considered an standard practice to evaluate models [@bovelstad_predicting_2007] shown that the split may have an strong effect on the evalutaion. Therefore, we suggest follow-up this work conducting a Monte-Carlo Cross-Validation.

References



csetraynor/rstanhaz documentation built on May 9, 2019, 8:14 a.m.