knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This R Markdown document is a step-by-step to reproducing the results from the original publication, as well as showing key gene expression results, handling pamr-object input/output, and generalizing the method to new datasets.

First, the process of discovering the BRS labeling via ConsensusClusterPlus is illustrated in the Cohort A pre-BCG samples. Then, the use of the main BRS-function from this BRSpred-package is illustrated by applying the trained pamr predictor on the discovered BRS-labels, and predicting for new samples.

Citation

If you use BRSpred, please use its corresponding citation at:

de Jong FC, Laajala TD, Hoedemaeker RF, Jordan KR, van der Made ACJ, Boevé ER, van der Schoot DKE, Nieuwkamer B, Janssen EAM, Mahmoudi T, Boormans JL, Theodorescu D, Costello JC, Zuiverloon TCM. Non-muscle-invasive bladder cancer molecular subtypes predict differential response to intravesical Bacillus Calmette-Guérin. Sci Transl Med. 2023 May 24;15(697):eabn4118. doi: 10.1126/scitranslmed.abn4118

BRS pipeline

Installation

# devtools::install_github("CostelloLab/BRSpred")

Setup

library(BRSpred)

Datasets

The package comes with two main dasets, with Cohort A split into pre- and post-BCG treatment samples:

data(CohortA_pre)
data(CohortA_post)
data(CohortB)

The clinical information for the patients from Erasmus MC are stored in:

data(erasmus_clinical)
table(erasmus_clinical$Cohort)

Discovery of BRS labels

Original discovery of the BRS subclass labels is conducted with the help of the ConsensusClusterPlus-package.

Select top 2000 genes based on their variance:

topVarGenes <- head(order(matrixStats::rowVars(as.matrix(CohortA_pre)), decreasing = TRUE), 2000)
TopVar_A  <- CohortA_pre[topVarGenes, ]
TopVar_A2 <- t(scale(t(TopVar_A)))

Use unsupervised consensus clustering on the training cohort (A) to discover the subgroups:

consensus_res <- ConsensusClusterPlus::ConsensusClusterPlus(TopVar_A2, maxK=5, reps=2500, pItem=.95, pFeature=.95, clusterAlg="pam", distance="pearson", seed=123)
icl <- ConsensusClusterPlus::calcICL(consensus_res)
icl[["clusterConsensus"]]

Append consensus classes to clinical information and reassign BRS-label numbers to be more intuitive:

consensus_df <- as.data.frame(consensus_res[[3]][["consensusClass"]])
Clinicaldata_A_pre <- erasmus_clinical[which(erasmus_clinical$Cohort == "A" & erasmus_clinical$Tumor == "Primary"),]
colnames(consensus_df)[1] <- paste("Erasmus.BRS")
Clinicaldata_A_pre$Erasmus.BRS <- consensus_df$Erasmus.BRS
Clinicaldata_A_pre$Erasmus.BRS <- gsub("^3", "BRS1", gsub("^2", "BRS3", gsub("^1", "BRS2", Clinicaldata_A_pre$Erasmus.BRS)))
Clinicaldata_A_pre$Erasmus.BRS <- factor(Clinicaldata_A_pre$Erasmus.BRS, levels=c("BRS1", "BRS2", "BRS3"))
table(Clinicaldata_A_pre$Erasmus.BRS)

Kaplan-Meier curves in the training data:

fit1 <- survival::survfit(survival::Surv(Time_to_prog_or_FUend, Progression) ~ Erasmus.BRS, data = Clinicaldata_A_pre)
survminer::ggsurvplot(fit1,   
    data = Clinicaldata_A_pre,   
    censor.shape="|",   
    censor.size = 3,  
    xlab = "Time in months",   
    title = "PFS in Cohort A",  
    font.title    = c(16, "bold"),
    font.y = "bold",  
    ylab = "Progression free proportion",  size = 1.2,   conf.int = FALSE,   xlim = c(0,120),
    pval = T, risk.table = TRUE, risk.table.col = "black",
    ggtheme = ggplot2::theme_classic(),
    risk.table.y.text.col = T, risk.table.y.text = FALSE,
    break.time.by = 24, palette = c("forestgreen", "dodgerblue3",  "firebrick")) 

ConsensusClusterPlus vs. pamr confusion matrix

The BRS labels provided by ConsensusClusterPlus are given as training labels for pamr; however, to avoid overfitting, the pamr will not perfectly predict these original labels.

The corresponding confusion matrix is as follows:

table(
  ConsensusClusterPlus_pred = Clinicaldata_A_pre$Erasmus.BRS,
  pamr_pred = BRSpred::BRS(newx = BRSpred::CohortA_pre, scale = "together")
)

Testing set (Cohort B)

Performing prediction for cohort B, with z-score applied independently to the training samples from cohort A and new data from cohort B:

CohortB_pred <- BRSpred::BRS(newx = BRSpred::CohortB, scale="independent")
table(CohortB_pred)

Comparing the predicts produced via BRSpred::BRS to the original predictions published in de Jong et al:

table(vignette_prediction = CohortB_pred, original_prediction = erasmus_clinical[colnames(BRSpred::CohortB),"Predicted.subtype"])

Kaplan-Meyer using the predictions for Cohort B:

Clinicaldata_B <- BRSpred::erasmus_clinical[which(BRSpred::erasmus_clinical$Cohort == "B"),]
Clinicaldata_B$CohortB_predict <- CohortB_pred
fit2 <- survival::survfit(survival::Surv(Time_to_prog_or_FUend, Progression) ~ CohortB_predict, data = Clinicaldata_B)
survminer::ggsurvplot(fit2,
    data = Clinicaldata_B,
    censor.shape="|", censor.size = 3,
    xlab = "Time in months", title = "PFS in Cohort B", ylab = "Progression free proportion",
    font.title = c(16, "bold"),
    font.y = "bold", size = 1.2, conf.int = FALSE, xlim = c(0,120),
    pval = T, risk.table = TRUE, risk.table.col = "black", 
    ggtheme = ggplot2::theme_classic(),
    risk.table.y.text.col = T, risk.table.y.text = FALSE,
    break.time.by = 24, palette = c("forestgreen", "dodgerblue3", "firebrick")) 

Post-BCG samples

We run the BRS-classifier for post-BCG samples while z-scaling the new data matrix together with the training data matrix (pre-BCG samples):

CohortA_post_pred <- BRSpred::BRS(newx = BRSpred::CohortA_post, scale="together")
table(CohortA_post_pred)

Comparing the predicts produced via BRSpred::BRS to the original predictions published in de Jong et al:

table(vignette_prediction = CohortA_post_pred, original_prediction = erasmus_clinical[colnames(BRSpred::CohortA_post),"Predicted.subtype"])

Heatmaps

We predefine some subsets of genes of interest and plot their expression in both the training and validation cohort:

genesets <- list(
    "EMT" = c("CDH3", "CLDN3", "CLDN4","CLDN7", "TWIST1", "SNAI1", "VIM", "ZEB1", "ZEB2"),
    "Basal" = c("CD44", "KRT5", "KRT6A", "KRT14","KRT16", "DDR2"),
    "Immune" = c("CD4","CD8A","CD19", "CD79A","CD80", "CD86", "CD274","CTLA4", "PDCD1","FOXP3","TGFB1","TNF","IDO1","NFKB1", "JAK3", "CSF1R"),
    "HLA" = c("HLA-A","HLA-B","HLA-C","HLA-DPA1","HLA-DPB1","HLA-DOA","HLA-DQA1", "HLA-DRB1"),
    "Luminal" = c("PPARG", "FOXA1", "GATA3", "ERBB2", "ERBB3","DDR1", "UPK1A", "UPK2","UPK3B", "UPK3A", "KRT18", "KRT19","KRT20", "CDH1"),
    "FGFR3" = "FGFR3",
    "MYC signaling" = c("MYC", "MYCL", "MYCN", "DUSP2", "EXOSC5", "PPAN", "MYBBP1A", "HK2", "NOC4L", "PUS1", "FARSA"),
    "Cell Cycle" = c("PLK1","E2F1","CDK1","CDK2","CDC7","CCNB1", "CCNB2","CCNE1","E2F3", "EZH2","CCNE2", "ORC5", "RB1","RFC3","STAG2", "BUB1"),
    "Autophagy" = c("MAP1LC3B","ATG3","ATG4C", "ATG5", "ATG7","ATG12","BECN1","MTOR","PIK3CA","PIK3C3"),
    "Intracellular transport" = c("RAB5A", "SEC22B", "VAMP4", "VAMP7", "SCAMP1", "ADAM10", "STX7", "STX12", "TOM1L1", "TMX1", "TSPAN8", "BRAF"),
    "Antigen processing" = c("GLMN","UBE2K","UBE2W","FBXO30","NEDD4L","PSMA1","UBA5","UBA6","FBXL5", "LTN1", "RNF182","RNF138")
)

Cohort A heatmap

Construct geneset-average expression within samples for genes that can be found from the data, based on the trained clustering classes (we order samples based on BRS-classes and inside each class based on progression status):

suborderA <- rownames(Clinicaldata_A_pre[order(Clinicaldata_A_pre$Erasmus.BRS, Clinicaldata_A_pre$Progression),])

exprsmatA <- do.call("rbind", lapply(genesets, FUN=function(genes){
    genes <- genes[which(genes %in% rownames(CohortA_pre))]
    t(as.matrix(colMeans(CohortA_pre[genes,suborderA,drop=FALSE], na.rm=TRUE)))
}))
rownames(exprsmatA) <- names(genesets)

These are then heatmap-plotted e.g. using pheatmap-package, showing clear trends connected to the BRS-classes:

breaksList = seq(-1, 1, by = 0.2)
metadata<-Clinicaldata_A_pre[,c("Progression","Erasmus.BRS")]
pcols = list( 
    Progression = c("No Progression"="limegreen", "Progression"="red"),
    Erasmus.BRS = c("BRS1"="forestgreen","BRS2"="dodgerblue3", "BRS3"="firebrick") 
)

pheatmap::pheatmap(exprsmatA, cluster_rows = FALSE, cluster_cols = FALSE, scale = "row",
         show_colnames = FALSE,
         show_rownames = TRUE,
         fontsize = 8,
         color = colorRampPalette((RColorBrewer::brewer.pal(n = 9, name = "YlOrRd")))(length(breaksList)),
         breaks = breaksList,
         clustering_distance_rows = "euclidean",
         color.scheme.symmetric = T,
         gaps_row = 1:10,
         gaps_col = which(!duplicated(metadata[suborderA,"Erasmus.BRS"]))[-1]-1,
         annotation_col = metadata,
         annotation_colors = pcols)        

Cohort B heatmap

We first construct geneset-average expression within samples, with ordering based on predicted BRS classes:

suborderB <- rownames(Clinicaldata_B[order(Clinicaldata_B$CohortB_predict, Clinicaldata_B$Progression, decreasing = FALSE), ])

exprsmatB <- do.call("rbind", lapply(genesets, FUN=function(genes){
    genes <- genes[which(genes %in% rownames(CohortB))]
    t(as.matrix(colMeans(CohortB[genes,suborderB,drop=FALSE], na.rm=TRUE)))
}))
rownames(exprsmatB) <- names(genesets)

These are then heatmap-plotted e.g. using pheatmap-package, showing clear trends connected to the BRS-classes:

metadata <- Clinicaldata_B[,c("Progression","CohortB_predict")]
pcols = list( 
    Progression = c("No Progression"="limegreen", "Progression"="red"),
    CohortB_predict = c("BRS1"="forestgreen","BRS2"="dodgerblue3", "BRS3"="firebrick") 
)

pheatmap::pheatmap(exprsmatB, cluster_rows = FALSE, cluster_cols = FALSE, scale = "row",
         show_colnames = FALSE,
         show_rownames = TRUE,
         fontsize = 8,
         color = colorRampPalette((RColorBrewer::brewer.pal(n = 9, name = "YlOrRd")))(length(breaksList)),
         breaks = breaksList,
         clustering_distance_rows = "euclidean",
         color.scheme.symmetric = T,
         gaps_row = 1:10,
         gaps_col = which(!duplicated(metadata[suborderB,"CohortB_predict"]))[-1]-1,
         annotation_col = metadata,
         annotation_colors = pcols)         

BRS for independent datasets

The main function for performing pre-trained BRS prediction for external datasets is the BRS-function. Type ?BRS in R to find out more about the syntax of this function, but essentially the only input required is an appropriate gene expression matrix with gene symbols as rows and samples as columns.

Scaling and quantile normalization

In order to make new predictions comparable, the BRS function offers a variety of parameters; of note, qnormalize will quantile normalize the input data to the pamr training data. Further, the scale parameter will allow different kinds of z-score scalings; value together will z-score scale both the new input data and existing training data together so that genes are zero-mean with unit variance, while value independent will scale both data matrices separately. Such normalization procedures ought to be investigated carefully, to make sure the provided pamr-pipeline is compatible with the input data.

Missing values and lack of gene overlap

If the BRS prediction wrapper function cannot find gene symbols used for the pamr clustering, it will impute median values for these genes from the original training gene expression matrix. The wrapper will throw a warning, but not an error; a low overlap in identified genes (e.g. lack of expression in some key genes) may therefore result in poor prediction performance, so these warning messages ought to be investigated accordingly for any given cohort.

Session info

utils::sessionInfo()


CostelloLab/BRSpred documentation built on Aug. 14, 2024, 9:17 a.m.