knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, warning = FALSE, message = FALSE, fig.align = "center", out.width = "70%", fig.asp = 1)
In this vignette, we perform colorectal cancer (CRC) subtyping analysis using GenomicSuperSignature and benchmark it against Figure 4A of the CRC paper (Ma et al.).
if (!"GenomicSuperSignaturePaper" %in% installed.packages()) devtools::install_github("shbrief/GenomicSuperSignaturePaper") suppressPackageStartupMessages({ library(Biobase) library(tidyverse) library(GenomicSuperSignature) library(GenomicSuperSignaturePaper) })
## If GenomicSuperSignaturePaper is built locally with RAVmodel in inst/extdata data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper") RAVmodel <- readRDS(file.path(data.dir, "RAVmodel_C2.rds"))
RAVmodel <- getModel("C2", load=TRUE)
RAVmodel
version(RAVmodel)
CRC paper used colorectal cancer datasets from curatedCRCData package. Eight of them are for training and ten are for validation.
load("data/eSets/setNames.RData") setNames ## Load validation samples for (set in setNames) { load(paste0("data/eSets_new/", set, '.RData')) }
CRC paper actually used both training and validation datasets (total of 18) for Figure 4A. So here, we reproduce the same plot using all 18 datasets.
Briefly, this plot summarizes the sample scores assigned to 3,567 tumor samples from 18 CRC studies. The samples in each of 18 datasets, assigned to either (i) one of the 4 previously proposed CMS subtypes by CRC Subtyping Consortium or (ii) not assigned to a CMS subtype (so 5 x 18 = 90 total groups), are represented by the mean (point) and standard deviation (error bar) of the sample scores.
val_only <- FALSE # PCSSs subtyping plot with 18 validation datasets source("R/Fig4A_PCSS.R")
saveRDS(pA, "outputs/scatter_PCSS.rds")
If we use only 10 validation datasets:
## CRC paper actually used both training and validation datasets (total of 18) ## for Figure 4. So here, we are using all of them. However, if you want to use ## only 10 validation datasets, run the below chunk. load("data/eSets/trainingSetNames.RData") validationSetNames <- setdiff(setNames, trainingSetNames) setNames <- validationSetNames
val_only <- TRUE # PCSSs subtyping plot with only 10 validation datasets source("R/Fig4A_PCSS.R")
To evaluate the performance of RAVs compared to PCSSs, we searched for colon cancer associated RAVs in three different ways.
We identified two RAVs, RAV1575 and RAV834, that have the highest Pearson correlation coefficients with PCSS1 and PCSS2, respectively (0.59 and 0.56). (detail)
We ran Kruskal-Wallis rank sum test between CMS subtypes and RAV-assigned scores. Two RAVs with the highest chi-square, RAV834 and RAV833, were selected for further testing. (detail)
We calculated validation scores for 18 colon cancer datasets used in CRC paper and collected top 10 validated RAVs from each dataset. We summarized the frequency of different RAVs validating each dataset without any additional filtering criteria and selected the top 2 most frequently validated RAVs, RAV188 and RAV832, which were captured 14 and 10 times, respectively. (detail)
In this section, we draw colon cancer subtype plots with these pairs of RAVs and look up MeSH terms, enriched pathways, and related studies associated with them.
For plotting with RAVs, you can decide whether to use all 18 datasets or only
10 validation datasets by setting val_only
variable before running
R/Fig4A_plotting.R
script. Here, we are using only 10 validation datasets.
val_only <- TRUE # FALSE for 18 datasets, TRUE for 10 validation datasets
We also provide the 'metadata + sample score' table for R/Fig4A_plotting.R
script.
df.results <- readRDS("data/SummaryForFig4.rds")
RAV1575 and RAV834 are identified as the RAVs most similar to PCSS1/2 in this vignette.
sampleScore1 <- 1575 sampleScore2 <- 834 source("R/Fig4A_plotting.R", print.eval = TRUE)
x <- source("R/Fig4A_plotting.R") saveRDS(x, "outputs/scatter_1575_834.rds")
MeSH terms associated with r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
drawWordcloud(RAVmodel, sampleScore1) drawWordcloud(RAVmodel, sampleScore2)
Enriched pathways for r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
annotateRAV(RAVmodel, sampleScore1) annotateRAV(RAVmodel, sampleScore2)
Studies associated with r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
findStudiesInCluster(RAVmodel, sampleScore1, studyTitle = TRUE) findStudiesInCluster(RAVmodel, sampleScore2, studyTitle = TRUE)
Before applying validate
function, any NA or Inf values in CRC datasets were
removed and the expression matrix was centered.
load("data/eSets/setNames.RData") # 18 CRC datasets validated_ind_all <- vector(mode = "list", length = length(setNames)) names(validated_ind_all) <- setNames for (set in setNames) { eSet <- get(set) exprs <- Biobase::exprs(eSet) %>% rmNaInf exprs <- apply(exprs, 1, function(x) x - mean(x)) %>% t val_all <- validate(exprs, RAVmodel) validated_ind_all[[set]] <- validatedSignatures(val_all, RAVmodel, num.out = 10, indexOnly = TRUE) }
We combined the top validated RAVs from all 18 studies and checked the frequency of them. Any RAV that appeared more often means they are consistently associated with the common feature of CRC datasets.
Top row is RAV number and the bottom row is the frequency of it making top validation. Because there were 18 datasets, the maximum frequency of showing up is 18. RAV188 and RAV832 are captured 14 and 10 times, respectively, as top 10 validated RAVs.
## Top 10 validated validated_ind_all %>% unlist %>% table %>% sort(., decreasing = TRUE) ## Top 5 validated lapply(validated_ind_all, function(x) x[1:5]) %>% unlist %>% table %>% sort(., decreasing = TRUE)
sampleScore1 <- 832 sampleScore2 <- 188 source("R/Fig4A_plotting.R", print.eval = TRUE)
x <- source("R/Fig4A_plotting.R") saveRDS(x, "outputs/scatter_832_188.rds")
sampleScore1 <- 832 sampleScore2 <- 833 source("R/Fig4A_plotting.R", print.eval = TRUE)
MeSH terms associated with r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
drawWordcloud(RAVmodel, sampleScore1) drawWordcloud(RAVmodel, sampleScore2)
Enriched pathways for r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
annotateRAV(RAVmodel, sampleScore1) annotateRAV(RAVmodel, sampleScore2)
Studies associated with r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
findStudiesInCluster(RAVmodel, sampleScore1, studyTitle = TRUE) findStudiesInCluster(RAVmodel, sampleScore2, studyTitle = TRUE)
We find the metadata-associated RAVs using Kruskal-Wallis Rank Sum Test. Further detail can be found here.
sampleScore1 <- 834 sampleScore2 <- 833 source("R/Fig4A_plotting.R", print.eval = TRUE)
x <- source("R/Fig4A_plotting.R") saveRDS(x, "outputs/scatter_834_833.rds") png("outputs/png/scatter_834_833.png", width = 400, height = 400) print(x)$visible dev.off()
MeSH terms associated with r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
drawWordcloud(RAVmodel, sampleScore1) drawWordcloud(RAVmodel, sampleScore2)
Enriched pathways for r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
annotateRAV(RAVmodel, sampleScore1) annotateRAV(RAVmodel, sampleScore2)
Studies associated with r paste0("RAV", sampleScore1, " (sampleScore1)")
and r paste0("RAV", sampleScore2, " (sampleScore2)")
findStudiesInCluster(RAVmodel, sampleScore1, studyTitle = TRUE) findStudiesInCluster(RAVmodel, sampleScore2, studyTitle = TRUE)
This table contains the top 10 validated RAVs for 18 CRC datasets.
CRC_validation_df <- data.frame(matrix(unlist(validated_ind_all), nrow = length(validated_ind_all), byrow = TRUE), stringsAsFactors=FALSE) rownames(CRC_validation_df) <- names(validated_ind_all) colnames(CRC_validation_df) <- paste0("validated_RAV_", 1:10) CRC_validation_df[1:4, 1:4]
write.table(CRC_validation_df, "outputs/CRC_top10_validated_ind.tsv")
sessionInfo()
## CMS by Candidate RAVs dat <- readRDS("data/SummaryForFig4.rds") score_ind <- grep("RAV", colnames(dat)) ## Subset with 'cms_label_SSP' dat <- dat %>% mutate(cms_label_SSP = cms_label_SSP %>% recode("unlabeled" = "not labeled")) dat <- dat %>% group_by(study, cms_label_SSP) %>% dplyr::summarise(mean_cl1575 = mean(RAV1575), mean_cl834 = mean(RAV834), mean_cl833 = mean(RAV833), sd_cl1575 = sd(RAV1575), sd_cl834 = sd(RAV834), sd_cl833 = sd(RAV833)) ## Plot cl_ind <- c(1575, 834, 833) mean_names <- paste0("mean_cl", cl_ind) for (i in seq_along(mean_names)) { ordered.dat <- dat[order(dat[,mean_names[i]]),] plot(dplyr::pull(ordered.dat, mean_names[i]), col = ordered.dat$cms_label_SSP, xlab = "", ylab = paste0("Score from RAV", cl_ind[i])) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.