srvALL: Continuous Biomarker Assessment by Exhaustive Survival Analysis

Rationale

Dominic Pearce 2018-04-30

library(survivALL)
library(survival)
library(survcomp)
library(Biobase)
library(magrittr)
library(ggplot2)
library(GGally)
library(ggthemes)
library(cowplot); theme_set(theme_grey())

Rationale

Survival analysis typically stratifies a patient cohort by some measure and compares the relative survival between groups. In practice, for a continuous measure, the decision of where to draw this division relies on previous knowledge or, more problematically, it may simply be a researchers best guess - for example separating around the median. Below are outlined 3 alternative survival analysis approaches --- median, hypothesis-driven and our own data-driven approach as part of the survivALL package.

We begin with an expressionSet nki, which details both gene expression as well as survival information (i.e. event and time-to-event data) for 337 invasive breast cancer patients of no specific subtype. Here the event we will be measuring is distant metastasis free survival or dmfs.

It is important to ensure that there are no NA values in our survival information. In our example this has already been accounted for and the data is included as part of the survivALL package.

data(nki_subset)

pData(nki_subset)[1:3, ]

Survival Information samplename age grade e.dmfs t.dmfs NKI_4 NKI_4 41 3 0 4747 NKI_6 NKI_6 49 2 0 4075 NKI_7 NKI_7 46 1 0 3703

We can see that our complete survival data includes the three critical components of our analysis, namely sample names, events (e.dmfs) and times-to-event (t.dmfs).

To determine a variables prognostic capacity we can apply the Kaplan-Meier estimator and plot. Here we will split our cohort using the expression of ERBB2 a gene whose increased expression is known to be associated with poor prognosis in a mixed population of invasive breast cancers.

We will stratify the cohort into high and low ERBB2 expression using median ERBB2 expression...

#ERBB2 expression vector
erbb2_xpr <- exprs(nki_subset)["NM_004448",] 
#convert to binary classifier
erbb2_med <- ifelse(erbb2_xpr >= median(erbb2_xpr), "high", "low")

...and produce our Kaplan-Meier plot and statistics

srv_obj <- survival::Surv(nki_subset$t.dmfs, nki_subset$e.dmfs)
median_fit <- survival::survfit(srv_obj ~ erbb2_med)

p_med <- GGally::ggsurv(median_fit, surv.col = c("#525252", "#bdbdbd")) +
    ylim(0.4, 1) +
    labs(title = "Figure S1", 
         subtitle = "Median approach", 
         x = "Years") +
    theme_pander() + 
    theme(axis.line = element_line(size = 0.1), 
          legend.position = 'bottom')

p_med <- ggdraw(add_sub(p_med, "p = 0.9", 
                    vpadding=grid::unit(0, "lines"), 
                    y = 17, 
                    x = 0.55, 
                    hjust = 0, 
                    size = 10))
p_med

broom::tidy(survival::coxph(srv_obj ~ erbb2_med)) %>% pandoc.table()

term estimate std.error statistic p.value conf.low conf.high erbb2_medlow -1.434e-05 0.1917 -7.482e-05 0.9999 -0.3758 0.3757

Surprisingly and despite ERBB2 known to be highly prognostic in invasive breast cancer, there is no association between ERBB2 expression and prognosis.

However, knowing that ERBB2 is overexpressed in ~25% of invasive breast cancer cases, we can modify our cohort stratification to reflect this.

erbb2_hypothesis <- ifelse(erbb2_xpr >= quantile(erbb2_xpr, probs = 0.75), "high", "low") 
hypothesis_fit <- survival::survfit(srv_obj ~ erbb2_hypothesis)

p_hyp <- GGally::ggsurv(hypothesis_fit, surv.col = c("#525252", "#bdbdbd")) +
    ylim(0.4, 1) +
    labs(title = "Figure S2", 
         subtitle = "Hypothesis-driven approach", 
         x = "Years") +
    theme_pander() + 
    theme(axis.line = element_line(size = 0.1), 
          legend.position = 'none')

p_hyp <- ggdraw(add_sub(p_hyp, "p = 0.043", 
                        vpadding=grid::unit(0, "lines"), 
                        y = 17, 
                        x = 0.55, 
                        hjust = 0, 
                        size = 10))
p_hyp

broom::tidy(survival::coxph(srv_obj ~ erbb2_hypothesis)) %>% pandoc.table()

Table continues below term estimate std.error statistic p.value conf.low erbb2_hypothesislow -0.3961 0.2086 -1.899 0.05757 -0.8049 conf.high 0.01273

And consequently demonstrate a significant association between prognosis and ERBB2 expression.

By comparing multiple patient stratifications we have successful revealed ERBB2's relationship with prognosis based on a reasoned estimate of where we expect optimal stratification to occur. It may still be, however, that our cohort's composition in terms of ERBB2 expression does not exactly mirror population levels - i.e. that the best point-of-separation is not found with a 75-25% split. This will become even more problematic for a novel biomarker for which population levels may be unknown. A solution to this unknown quality is, instead of estimating where we anticipate our point-of-separation to occur, we calculate survival statistics for all separations possible.

For our 319 patients and having calculated our median and hypothesis divisions, there remain an additional 316 points at which we could otherwise stratify our cohort, any of which may demonstrate a more significant association with prognosis. By additionally calculating survival statistics for each of these possible point-of-separation we guarantee that we reveal the true optimal stratification.

We can visually inspect the best point of separation by using the plotALL() function.

plotALL(measure = erbb2_xpr,
        srv = pData(nki_subset), 
        time = "t.dmfs", 
        event = "e.dmfs") +
    labs(title = "Figure S3", subtitle = "survivALL plot output") + 
    theme(plot.title = element_text(hjust = 0))

We can now see that the best point at which we could separate our cohort based on ERBB2 expression is closer to a 85-15% split, rather than 75-25%, indicating a sampling bias towards HER2 negative patients in the NKI cohort.

We can further investigate the most significant separation using survivALL(), which follows the same procedure as plotALL() but returns a dataframe of calculations rather than plotting.

Using this we can select the point of most significant separation...

srvall <- survivALL(measure = erbb2_xpr,
               srv = pData(nki_subset), 
               time = "t.dmfs", 
               event = "e.dmfs", 
               measure_name = "ERBB2") 

srvall[which.min(srvall$p),] %>% pandoc.table()

Table continues below samples event_time event measure HR p NKI_181 NKI_181 3951 FALSE 0.474 1.201 0.002695 p_adj log10_p bsp bsp_adj index name dsr clsf NKI_181 0.1682 -2.569 NA NA 284 ERBB2 NA 0

...and re-draw our Kaplan-Meier

erbb2_data <- ifelse(srvall$clsf == 0, "low", "high") 
srv_obj <- survival::Surv(as.numeric(srvall$event_time), srvall$event)
data_fit <- survival::survfit(srv_obj ~ erbb2_data)

p_data <- GGally::ggsurv(data_fit, surv.col = c("#525252", "#bdbdbd")) +
    ylim(0.4, 1) +
    labs(title = "Figure S4", subtitle = "Data-drive approach", x = "Years") +
    theme_pander() + 
    theme(axis.line = element_line(size = 0.1), 
          legend.position = 'none')

p_data <- ggdraw(add_sub(p_data, 
                         "p = 0.001", 
                         vpadding=grid::unit(0, "lines"), 
                         y = 17, 
                         x = 0.55, 
                         hjust = 0, 
                         size = 10))
p_data

broom::tidy(survival::coxph(srv_obj ~ erbb2_data)) %>% pandoc.table()

term estimate std.error statistic p.value conf.low conf.high erbb2_datalow -0.8324 0.253 -3.29 0.001002 -1.328 -0.3365

pearcedom/srvALL documentation built on May 24, 2019, 2:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pearcedom/srvALL
Continuous Biomarker Assessment by Exhaustive Survival Analysis

vignettes/rationale.md
In pearcedom/srvALL: Continuous Biomarker Assessment by Exhaustive Survival Analysis

Rationale

Rationale

Libraries

Rationale

Median dichotomisation

Hypothesis-driven dichotomisation

Data-driven dichotomisation

R Package Documentation

Browse R Packages

We want your feedback!

pearcedom/srvALL Continuous Biomarker Assessment by Exhaustive Survival Analysis

vignettes/rationale.md In pearcedom/srvALL: Continuous Biomarker Assessment by Exhaustive Survival Analysis

Rationale

Rationale

Libraries

Rationale

Median dichotomisation

Hypothesis-driven dichotomisation

Data-driven dichotomisation

R Package Documentation

Browse R Packages

We want your feedback!

pearcedom/srvALL
Continuous Biomarker Assessment by Exhaustive Survival Analysis

vignettes/rationale.md
In pearcedom/srvALL: Continuous Biomarker Assessment by Exhaustive Survival Analysis