knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This R Markdown document is a step-by-step to reproducing the results from the original publication, as well as showing key gene expression results, handling pamr-object input/output, and generalizing the method to new datasets.
First, the process of discovering the BRS labeling via ConsensusClusterPlus
is illustrated in the Cohort A pre-BCG samples. Then, the use of the main BRS
-function from this BRSpred
-package is illustrated by applying the trained pamr
predictor on the discovered BRS-labels, and predicting for new samples.
If you use BRSpred
, please use its corresponding citation at:
de Jong FC, Laajala TD, Hoedemaeker RF, Jordan KR, van der Made ACJ, Boevé ER, van der Schoot DKE, Nieuwkamer B, Janssen EAM, Mahmoudi T, Boormans JL, Theodorescu D, Costello JC, Zuiverloon TCM. Non-muscle-invasive bladder cancer molecular subtypes predict differential response to intravesical Bacillus Calmette-Guérin. Sci Transl Med. 2023 May 24;15(697):eabn4118. doi: 10.1126/scitranslmed.abn4118
# devtools::install_github("CostelloLab/BRSpred")
library(BRSpred)
The package comes with two main dasets, with Cohort A split into pre- and post-BCG treatment samples:
data(CohortA_pre) data(CohortA_post) data(CohortB)
The clinical information for the patients from Erasmus MC are stored in:
data(erasmus_clinical) table(erasmus_clinical$Cohort)
Original discovery of the BRS subclass labels is conducted with the help of the ConsensusClusterPlus
-package.
Select top 2000 genes based on their variance:
topVarGenes <- head(order(matrixStats::rowVars(as.matrix(CohortA_pre)), decreasing = TRUE), 2000) TopVar_A <- CohortA_pre[topVarGenes, ] TopVar_A2 <- t(scale(t(TopVar_A)))
Use unsupervised consensus clustering on the training cohort (A) to discover the subgroups:
consensus_res <- ConsensusClusterPlus::ConsensusClusterPlus(TopVar_A2, maxK=5, reps=2500, pItem=.95, pFeature=.95, clusterAlg="pam", distance="pearson", seed=123) icl <- ConsensusClusterPlus::calcICL(consensus_res) icl[["clusterConsensus"]]
Append consensus classes to clinical information and reassign BRS-label numbers to be more intuitive:
consensus_df <- as.data.frame(consensus_res[[3]][["consensusClass"]]) Clinicaldata_A_pre <- erasmus_clinical[which(erasmus_clinical$Cohort == "A" & erasmus_clinical$Tumor == "Primary"),] colnames(consensus_df)[1] <- paste("Erasmus.BRS") Clinicaldata_A_pre$Erasmus.BRS <- consensus_df$Erasmus.BRS Clinicaldata_A_pre$Erasmus.BRS <- gsub("^3", "BRS1", gsub("^2", "BRS3", gsub("^1", "BRS2", Clinicaldata_A_pre$Erasmus.BRS))) Clinicaldata_A_pre$Erasmus.BRS <- factor(Clinicaldata_A_pre$Erasmus.BRS, levels=c("BRS1", "BRS2", "BRS3")) table(Clinicaldata_A_pre$Erasmus.BRS)
Kaplan-Meier curves in the training data:
fit1 <- survival::survfit(survival::Surv(Time_to_prog_or_FUend, Progression) ~ Erasmus.BRS, data = Clinicaldata_A_pre) survminer::ggsurvplot(fit1, data = Clinicaldata_A_pre, censor.shape="|", censor.size = 3, xlab = "Time in months", title = "PFS in Cohort A", font.title = c(16, "bold"), font.y = "bold", ylab = "Progression free proportion", size = 1.2, conf.int = FALSE, xlim = c(0,120), pval = T, risk.table = TRUE, risk.table.col = "black", ggtheme = ggplot2::theme_classic(), risk.table.y.text.col = T, risk.table.y.text = FALSE, break.time.by = 24, palette = c("forestgreen", "dodgerblue3", "firebrick"))
The BRS labels provided by ConsensusClusterPlus
are given as training labels for pamr
; however, to avoid overfitting, the pamr will not perfectly predict these original labels.
The corresponding confusion matrix is as follows:
table( ConsensusClusterPlus_pred = Clinicaldata_A_pre$Erasmus.BRS, pamr_pred = BRSpred::BRS(newx = BRSpred::CohortA_pre, scale = "together") )
Performing prediction for cohort B, with z-score applied independently to the training samples from cohort A and new data from cohort B:
CohortB_pred <- BRSpred::BRS(newx = BRSpred::CohortB, scale="independent") table(CohortB_pred)
Comparing the predicts produced via BRSpred::BRS
to the original predictions published in de Jong et al:
table(vignette_prediction = CohortB_pred, original_prediction = erasmus_clinical[colnames(BRSpred::CohortB),"Predicted.subtype"])
Kaplan-Meyer using the predictions for Cohort B:
Clinicaldata_B <- BRSpred::erasmus_clinical[which(BRSpred::erasmus_clinical$Cohort == "B"),] Clinicaldata_B$CohortB_predict <- CohortB_pred fit2 <- survival::survfit(survival::Surv(Time_to_prog_or_FUend, Progression) ~ CohortB_predict, data = Clinicaldata_B) survminer::ggsurvplot(fit2, data = Clinicaldata_B, censor.shape="|", censor.size = 3, xlab = "Time in months", title = "PFS in Cohort B", ylab = "Progression free proportion", font.title = c(16, "bold"), font.y = "bold", size = 1.2, conf.int = FALSE, xlim = c(0,120), pval = T, risk.table = TRUE, risk.table.col = "black", ggtheme = ggplot2::theme_classic(), risk.table.y.text.col = T, risk.table.y.text = FALSE, break.time.by = 24, palette = c("forestgreen", "dodgerblue3", "firebrick"))
We run the BRS-classifier for post-BCG samples while z-scaling the new data matrix together with the training data matrix (pre-BCG samples):
CohortA_post_pred <- BRSpred::BRS(newx = BRSpred::CohortA_post, scale="together") table(CohortA_post_pred)
Comparing the predicts produced via BRSpred::BRS
to the original predictions published in de Jong et al:
table(vignette_prediction = CohortA_post_pred, original_prediction = erasmus_clinical[colnames(BRSpred::CohortA_post),"Predicted.subtype"])
We predefine some subsets of genes of interest and plot their expression in both the training and validation cohort:
genesets <- list( "EMT" = c("CDH3", "CLDN3", "CLDN4","CLDN7", "TWIST1", "SNAI1", "VIM", "ZEB1", "ZEB2"), "Basal" = c("CD44", "KRT5", "KRT6A", "KRT14","KRT16", "DDR2"), "Immune" = c("CD4","CD8A","CD19", "CD79A","CD80", "CD86", "CD274","CTLA4", "PDCD1","FOXP3","TGFB1","TNF","IDO1","NFKB1", "JAK3", "CSF1R"), "HLA" = c("HLA-A","HLA-B","HLA-C","HLA-DPA1","HLA-DPB1","HLA-DOA","HLA-DQA1", "HLA-DRB1"), "Luminal" = c("PPARG", "FOXA1", "GATA3", "ERBB2", "ERBB3","DDR1", "UPK1A", "UPK2","UPK3B", "UPK3A", "KRT18", "KRT19","KRT20", "CDH1"), "FGFR3" = "FGFR3", "MYC signaling" = c("MYC", "MYCL", "MYCN", "DUSP2", "EXOSC5", "PPAN", "MYBBP1A", "HK2", "NOC4L", "PUS1", "FARSA"), "Cell Cycle" = c("PLK1","E2F1","CDK1","CDK2","CDC7","CCNB1", "CCNB2","CCNE1","E2F3", "EZH2","CCNE2", "ORC5", "RB1","RFC3","STAG2", "BUB1"), "Autophagy" = c("MAP1LC3B","ATG3","ATG4C", "ATG5", "ATG7","ATG12","BECN1","MTOR","PIK3CA","PIK3C3"), "Intracellular transport" = c("RAB5A", "SEC22B", "VAMP4", "VAMP7", "SCAMP1", "ADAM10", "STX7", "STX12", "TOM1L1", "TMX1", "TSPAN8", "BRAF"), "Antigen processing" = c("GLMN","UBE2K","UBE2W","FBXO30","NEDD4L","PSMA1","UBA5","UBA6","FBXL5", "LTN1", "RNF182","RNF138") )
Construct geneset-average expression within samples for genes that can be found from the data, based on the trained clustering classes (we order samples based on BRS-classes and inside each class based on progression status):
suborderA <- rownames(Clinicaldata_A_pre[order(Clinicaldata_A_pre$Erasmus.BRS, Clinicaldata_A_pre$Progression),]) exprsmatA <- do.call("rbind", lapply(genesets, FUN=function(genes){ genes <- genes[which(genes %in% rownames(CohortA_pre))] t(as.matrix(colMeans(CohortA_pre[genes,suborderA,drop=FALSE], na.rm=TRUE))) })) rownames(exprsmatA) <- names(genesets)
These are then heatmap-plotted e.g. using pheatmap
-package, showing clear trends connected to the BRS-classes:
breaksList = seq(-1, 1, by = 0.2) metadata<-Clinicaldata_A_pre[,c("Progression","Erasmus.BRS")] pcols = list( Progression = c("No Progression"="limegreen", "Progression"="red"), Erasmus.BRS = c("BRS1"="forestgreen","BRS2"="dodgerblue3", "BRS3"="firebrick") ) pheatmap::pheatmap(exprsmatA, cluster_rows = FALSE, cluster_cols = FALSE, scale = "row", show_colnames = FALSE, show_rownames = TRUE, fontsize = 8, color = colorRampPalette((RColorBrewer::brewer.pal(n = 9, name = "YlOrRd")))(length(breaksList)), breaks = breaksList, clustering_distance_rows = "euclidean", color.scheme.symmetric = T, gaps_row = 1:10, gaps_col = which(!duplicated(metadata[suborderA,"Erasmus.BRS"]))[-1]-1, annotation_col = metadata, annotation_colors = pcols)
We first construct geneset-average expression within samples, with ordering based on predicted BRS classes:
suborderB <- rownames(Clinicaldata_B[order(Clinicaldata_B$CohortB_predict, Clinicaldata_B$Progression, decreasing = FALSE), ]) exprsmatB <- do.call("rbind", lapply(genesets, FUN=function(genes){ genes <- genes[which(genes %in% rownames(CohortB))] t(as.matrix(colMeans(CohortB[genes,suborderB,drop=FALSE], na.rm=TRUE))) })) rownames(exprsmatB) <- names(genesets)
These are then heatmap-plotted e.g. using pheatmap
-package, showing clear trends connected to the BRS-classes:
metadata <- Clinicaldata_B[,c("Progression","CohortB_predict")] pcols = list( Progression = c("No Progression"="limegreen", "Progression"="red"), CohortB_predict = c("BRS1"="forestgreen","BRS2"="dodgerblue3", "BRS3"="firebrick") ) pheatmap::pheatmap(exprsmatB, cluster_rows = FALSE, cluster_cols = FALSE, scale = "row", show_colnames = FALSE, show_rownames = TRUE, fontsize = 8, color = colorRampPalette((RColorBrewer::brewer.pal(n = 9, name = "YlOrRd")))(length(breaksList)), breaks = breaksList, clustering_distance_rows = "euclidean", color.scheme.symmetric = T, gaps_row = 1:10, gaps_col = which(!duplicated(metadata[suborderB,"CohortB_predict"]))[-1]-1, annotation_col = metadata, annotation_colors = pcols)
The main function for performing pre-trained BRS prediction for external datasets is the BRS
-function. Type ?BRS
in R to find out more about the syntax of this function, but essentially the only input required is an appropriate gene expression matrix with gene symbols as rows and samples as columns.
In order to make new predictions comparable, the BRS
function offers a variety of parameters; of note, qnormalize
will quantile normalize the input data to the pamr training data. Further, the scale
parameter will allow different kinds of z-score scalings; value together
will z-score scale both the new input data and existing training data together so that genes are zero-mean with unit variance, while value independent
will scale both data matrices separately. Such normalization procedures ought to be investigated carefully, to make sure the provided pamr-pipeline is compatible with the input data.
If the BRS
prediction wrapper function cannot find gene symbols used for the pamr
clustering, it will impute median values for these genes from the original training gene expression matrix. The wrapper will throw a warning, but not an error; a low overlap in identified genes (e.g. lack of expression in some key genes) may therefore result in poor prediction performance, so these warning messages ought to be investigated accordingly for any given cohort.
utils::sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.