knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, 
                      warning = FALSE, message = FALSE,
                      fig.align = "center", out.width = "70%", fig.asp = 1)

Setup

In this vignette, we perform colorectal cancer (CRC) subtyping analysis using GenomicSuperSignature and benchmark it against Figure 4A of the CRC paper (Ma et al.).

Load packages

if (!"GenomicSuperSignaturePaper" %in% installed.packages())
    devtools::install_github("shbrief/GenomicSuperSignaturePaper")

suppressPackageStartupMessages({
  library(Biobase)
  library(tidyverse)
  library(GenomicSuperSignature)
  library(GenomicSuperSignaturePaper)
})

RAVmodel

## If GenomicSuperSignaturePaper is built locally with RAVmodel in inst/extdata
data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper")
RAVmodel <- readRDS(file.path(data.dir, "RAVmodel_C2.rds"))
RAVmodel <- getModel("C2", load=TRUE)
RAVmodel

version(RAVmodel)

CRC validation datasets

CRC paper used colorectal cancer datasets from curatedCRCData package. Eight of them are for training and ten are for validation.

load("data/eSets/setNames.RData")
setNames

## Load validation samples
for (set in setNames) {
  load(paste0("data/eSets_new/", set, '.RData'))
}

CRC subtyping with PCSSs

CRC paper actually used both training and validation datasets (total of 18) for Figure 4A. So here, we reproduce the same plot using all 18 datasets.

Briefly, this plot summarizes the sample scores assigned to 3,567 tumor samples from 18 CRC studies. The samples in each of 18 datasets, assigned to either (i) one of the 4 previously proposed CMS subtypes by CRC Subtyping Consortium or (ii) not assigned to a CMS subtype (so 5 x 18 = 90 total groups), are represented by the mean (point) and standard deviation (error bar) of the sample scores.

val_only <- FALSE # PCSSs subtyping plot with 18 validation datasets
source("R/Fig4A_PCSS.R")
saveRDS(pA, "outputs/scatter_PCSS.rds")

If we use only 10 validation datasets:

## CRC paper actually used both training and validation datasets (total of 18) 
## for Figure 4. So here, we are using all of them. However, if you want to use 
## only 10 validation datasets, run the below chunk.

load("data/eSets/trainingSetNames.RData")
validationSetNames <- setdiff(setNames, trainingSetNames)
setNames <- validationSetNames
val_only <- TRUE # PCSSs subtyping plot with only 10 validation datasets
source("R/Fig4A_PCSS.R")

CRC subtyping with RAVs

To evaluate the performance of RAVs compared to PCSSs, we searched for colon cancer associated RAVs in three different ways.

  1. We identified two RAVs, RAV1575 and RAV834, that have the highest Pearson correlation coefficients with PCSS1 and PCSS2, respectively (0.59 and 0.56). (detail)

  2. We ran Kruskal-Wallis rank sum test between CMS subtypes and RAV-assigned scores. Two RAVs with the highest chi-square, RAV834 and RAV833, were selected for further testing. (detail)

  3. We calculated validation scores for 18 colon cancer datasets used in CRC paper and collected top 10 validated RAVs from each dataset. We summarized the frequency of different RAVs validating each dataset without any additional filtering criteria and selected the top 2 most frequently validated RAVs, RAV188 and RAV832, which were captured 14 and 10 times, respectively. (detail)

In this section, we draw colon cancer subtype plots with these pairs of RAVs and look up MeSH terms, enriched pathways, and related studies associated with them.

Most similar to PCSS1/PCSS2

For plotting with RAVs, you can decide whether to use all 18 datasets or only 10 validation datasets by setting val_only variable before running R/Fig4A_plotting.R script. Here, we are using only 10 validation datasets.

val_only <- TRUE   # FALSE for 18 datasets, TRUE for 10 validation datasets

We also provide the 'metadata + sample score' table for R/Fig4A_plotting.R script.

df.results <- readRDS("data/SummaryForFig4.rds")

RAV1575 and RAV834 are identified as the RAVs most similar to PCSS1/2 in this vignette.

sampleScore1 <- 1575
sampleScore2 <- 834
source("R/Fig4A_plotting.R", print.eval = TRUE)
x <- source("R/Fig4A_plotting.R")
saveRDS(x, "outputs/scatter_1575_834.rds")

Wordcloud

MeSH terms associated with r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

drawWordcloud(RAVmodel, sampleScore1)
drawWordcloud(RAVmodel, sampleScore2)

GSEA

Enriched pathways for r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

annotateRAV(RAVmodel, sampleScore1)
annotateRAV(RAVmodel, sampleScore2)

Studies

Studies associated with r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

findStudiesInCluster(RAVmodel, sampleScore1, studyTitle = TRUE)
findStudiesInCluster(RAVmodel, sampleScore2, studyTitle = TRUE)

Highest validation score

Validation

Before applying validate function, any NA or Inf values in CRC datasets were removed and the expression matrix was centered.

load("data/eSets/setNames.RData")   # 18 CRC datasets
validated_ind_all <- vector(mode = "list", length = length(setNames))
names(validated_ind_all) <- setNames

for (set in setNames) {
  eSet <- get(set)
  exprs <- Biobase::exprs(eSet) %>% rmNaInf
  exprs <- apply(exprs, 1, function(x) x - mean(x)) %>% t
  val_all <- validate(exprs, RAVmodel)
  validated_ind_all[[set]] <- validatedSignatures(val_all, RAVmodel,
                                                  num.out = 10, 
                                                  indexOnly = TRUE)
}

We combined the top validated RAVs from all 18 studies and checked the frequency of them. Any RAV that appeared more often means they are consistently associated with the common feature of CRC datasets.

Top row is RAV number and the bottom row is the frequency of it making top validation. Because there were 18 datasets, the maximum frequency of showing up is 18. RAV188 and RAV832 are captured 14 and 10 times, respectively, as top 10 validated RAVs.

## Top 10 validated
validated_ind_all %>% unlist %>% table %>% sort(., decreasing = TRUE)

## Top 5 validated
lapply(validated_ind_all, function(x) x[1:5]) %>%
  unlist %>% table %>% sort(., decreasing = TRUE)

Top validated RAVs

sampleScore1 <- 832
sampleScore2 <- 188
source("R/Fig4A_plotting.R", print.eval = TRUE)
x <- source("R/Fig4A_plotting.R")
saveRDS(x, "outputs/scatter_832_188.rds")
sampleScore1 <- 832
sampleScore2 <- 833
source("R/Fig4A_plotting.R", print.eval = TRUE)

Wordcloud

MeSH terms associated with r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

drawWordcloud(RAVmodel, sampleScore1)
drawWordcloud(RAVmodel, sampleScore2)

GSEA

Enriched pathways for r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

annotateRAV(RAVmodel, sampleScore1)
annotateRAV(RAVmodel, sampleScore2)

Studies

Studies associated with r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

findStudiesInCluster(RAVmodel, sampleScore1, studyTitle = TRUE)
findStudiesInCluster(RAVmodel, sampleScore2, studyTitle = TRUE)

Best explaining metadata

We find the metadata-associated RAVs using Kruskal-Wallis Rank Sum Test. Further detail can be found here.

sampleScore1 <- 834
sampleScore2 <- 833
source("R/Fig4A_plotting.R", print.eval = TRUE)
x <- source("R/Fig4A_plotting.R")
saveRDS(x, "outputs/scatter_834_833.rds")

png("outputs/png/scatter_834_833.png", width = 400, height = 400)
print(x)$visible
dev.off()

Wordcloud

MeSH terms associated with r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

drawWordcloud(RAVmodel, sampleScore1)
drawWordcloud(RAVmodel, sampleScore2)

GSEA

Enriched pathways for r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

annotateRAV(RAVmodel, sampleScore1)
annotateRAV(RAVmodel, sampleScore2)

Studies

Studies associated with r paste0("RAV", sampleScore1, " (sampleScore1)") and r paste0("RAV", sampleScore2, " (sampleScore2)")

findStudiesInCluster(RAVmodel, sampleScore1, studyTitle = TRUE)
findStudiesInCluster(RAVmodel, sampleScore2, studyTitle = TRUE)

Manuscript Table

Supplementary Table 5

This table contains the top 10 validated RAVs for 18 CRC datasets.

CRC_validation_df <- data.frame(matrix(unlist(validated_ind_all), 
                                       nrow = length(validated_ind_all), 
                                       byrow = TRUE), 
                                stringsAsFactors=FALSE)
rownames(CRC_validation_df) <- names(validated_ind_all)
colnames(CRC_validation_df) <- paste0("validated_RAV_", 1:10)

CRC_validation_df[1:4, 1:4]
write.table(CRC_validation_df, "outputs/CRC_top10_validated_ind.tsv")

Session Info

sessionInfo()

## CMS by Candidate RAVs
dat <- readRDS("data/SummaryForFig4.rds")
score_ind <- grep("RAV", colnames(dat))

## Subset with 'cms_label_SSP'
dat <- dat %>%
  mutate(cms_label_SSP = cms_label_SSP %>%
           recode("unlabeled" = "not labeled"))

dat <- dat %>%
  group_by(study, cms_label_SSP) %>%
  dplyr::summarise(mean_cl1575 = mean(RAV1575),
                   mean_cl834 = mean(RAV834),
                   mean_cl833 = mean(RAV833),
                   sd_cl1575 = sd(RAV1575),
                   sd_cl834 = sd(RAV834),
                   sd_cl833 = sd(RAV833))

## Plot
cl_ind <- c(1575, 834, 833)
mean_names <- paste0("mean_cl", cl_ind)

for (i in seq_along(mean_names)) {
  ordered.dat <- dat[order(dat[,mean_names[i]]),]
  plot(dplyr::pull(ordered.dat, mean_names[i]),
       col = ordered.dat$cms_label_SSP,
       xlab = "", ylab = paste0("Score from RAV", cl_ind[i]))
}


shbrief/GenomicSuperSignaturePaper documentation built on Aug. 2, 2022, 2:04 p.m.