knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, message = FALSE,
                      fig.width = 5, fig.height = 3.5)

Setup

Background

Using training and validation data of the original CRC study, we compared associations between different subtype models and RAVs with the same clinicopathological variables. Notably, these data were not part of RAV training and are microarray datasets whereas the RAVs were trained exclusively from RNA-seq data.

As described previously in CRC study, we used the likelihood-ratio test (LRT) to compare the different subtype models for association with clinicopathological variables. A p-value near 1 (-log10p-value near 0) means that no additional information is provided by a full model composed of two subtype definitions compared to a model with only one.

CMS-associated RAVs performed better than discrete CMS on all four phenotypes (plot) and also outperformed PCSSs except on tumor location (plot). Interestingly, PCSS-associated RAVs were still better than CMS but slightly worse than PCSSs (plot), while CMS-associated RAVs were better than both CMS and PCSSs (plot), indicating that RAVs contain more comprehensive information. This performance improvement became more significant using only the 10 original validation datasets, excluding 8 datasets used to train the PCSS model. In conclusion, RAVs trained from heterogeneous datasets, not specific to CRC, captured biologically-relevant signatures for CRC as well or superior to focused efforts using CRC-specific databases, suggesting that RAVs are for general-use and can be applied to describe other diseases as well.

Select training dataset

CRC paper actually used both training and validation datasets (total of 18) for Figure 4. If you want to use all 18 datasets, assign the variable num_dat = 18. If you want only 10 CRC datasets excluding 8 datasets used to build PCSSs, assign num_dat = 10.

Required inputs

# width : width of boxplot
# ymin or xmin : lower whisker = smallest observation greater than or equal to lower hinge - 1.5 *
# IQR
# lower or xlower : lower hinge, 25% quantile
# notchlower : lower edge of notch = median - 1.58 * IQR / sqrt(n)
# middle or xmiddle : median, 50% quantile
# notchupper : upper edge of notch = median + 1.58 * IQR / sqrt(n)
# upper or xupper : upper hinge, 75% quantile
# ymax or xmax : upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR

boxplot_summary <- as.data.frame(matrix(nrow = 8*9, ncol = 11))
colnames(boxplot_summary) <- c("figure", "panel", "group", "datasets_used", 
                               "minima", "whisker_min", "first_quartile", 
                               "median", "third_quartile", "whisker_max", 
                               "maxima")
boxplot_fig <- c("3B","3C","Sup5B","Sup5C","Sup5E","Sup5F","Sup6C","Sup6D","Sup6E")
boxplot_summary$figure <- rep(boxplot_fig, each = 8)
boxplot_summary$panel <- rep(rep(c(1,2,3,4), each = 2), 9)
boxplot_summary$group <- rep(c(1,2), 36)

## 4 panels and 2 groups
ref <- data.frame(panel = rep(1:4, each = 2), group = 1:2)

##### statsSummary function to extract boxplot stats
# plotRes : ggplot object (e.g. x$value from the above script)
# figName : name of the figure in the manuscript

statsSummary <- function(plotRes, figName) {
    res <- ggplot_build(plotRes)
    res_box <- res$data[[1]]  # Summary
    res_dat <- res$data[[2]]  # Data points
    ind <- which(boxplot_summary$figure == figName)

    ## the number of datasets used
    for (i in seq_len(nrow(ref))) {
        panel <- ref$panel[i]
        group <- ref$group[i]
        sub <- res_dat[which(res_dat$PANEL == panel & res_dat$group == group),]

        ## row to add sample size info
        datasets_used_ind <- which(boxplot_summary$figure == figName &
                                     boxplot_summary$panel == panel &
                                     boxplot_summary$group == group)

        boxplot_summary$datasets_used[datasets_used_ind] <- sum(!is.na(sub$y))
    }

    ## if there is no outlier, just assign median value for the function
    no_outlier_ind <- which(isEmpty(res_box$outliers)) 
    res_box$outliers[no_outlier_ind] <- res_box$middle[no_outlier_ind]

    boxplot_summary$minima[ind] <- sapply(res_box$ymin, pmin, as.numeric(unlist(res_box$outliers))) %>% diag
    boxplot_summary$whisker_min[ind] <- res_box$ymin # lower extreme
    boxplot_summary$first_quartile[ind] <- res_box$lower # 25% quartile (box min)
    boxplot_summary$median[ind] <- res_box$middle # 50% quartile
    boxplot_summary$third_quartile[ind] <- res_box$upper # 75% quartile (box max)
    boxplot_summary$whisker_max[ind] <- res_box$ymax
    boxplot_summary$maxima[ind] <- sapply(res_box$ymax, pmax, as.numeric(unlist(res_box$outliers))) %>% diag

    return(boxplot_summary)
}

CMS vs. PCSS1/2

First, we reproduce the previous result comparing the performance of PCSSs (continuous score) against CMS (discrete score).

18 CRC validation datasets

m2_name <- "PCSS"
m2_1 <- "PCSS1"
m2_2 <- "PCSS2"
num_dat <- 18
x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_PCSS.rds"))

10 CRC validation datasets

m2_name <- "PCSS"
m2_1 <- "PCSS1"
m2_2 <- "PCSS2"
num_dat <- 10
x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup6C")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_PCSS.rds"))

Figures in the paper

Here is the summary of boxplots in our paper.

| Figures | Models | # validation data | |------------|----------|-------------------| | Fig.3B | CMS vs. RAV834/833 | 18 | | Fig.3C | PCSS1/2 vs. RAV834/833 | 18 | | Sup.Fig.5B | CMS vs. RAV1575/834 | 18 | | Sup.Fig.5C | PCSS1/2 vs. RAV1575/834 | 18 | | Sup.Fig.5E | CMS vs. RAV832/188 | 18 | | Sup.Fig.5F | PCSS1/2 vs. RAV832/188 | 18 | | Fig.6C | CMS vs. PCSS1/2 | 10 | | Fig.6D | CMS vs. RAV834/833 | 10 | | Fig.6E | PCSS1/2 vs. RAV834/833 | 10 |

Compare to CMS

CMS vs. RAV1575/834

RAV1575/834 are the most similar RAVs to PCSS1/2, respectively, based on Pearson correlation.

m2_name <- "RAV"
m2_1 <- "RAV1575"
m2_2 <- "RAV834"
num_dat <- 18
x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5B")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_1575_834.rds"))

CMS vs. RAV834/833

RAV834/833 have the highest r-squared score when we compared the sample scores against the metadata.

18 CRC validation datasets

m2_name <- "RAV"
m2_1 <- "RAV834"
m2_2 <- "RAV833"
num_dat <- 18
x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "3B")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_834_833.rds"))

png("outputs/png/boxplot_CMS_vs_834_833.png", width = 600, height = 500)
print(x)$visible
dev.off()

10 CRC validation datasets

m2_name <- "RAV"
m2_1 <- "RAV834"
m2_2 <- "RAV833"
num_dat <- 10
x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup6D")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_834_833.rds"))

png("outputs/png/boxplot_CMS_vs_834_833.png", width = 600, height = 500)
print(x)$visible
dev.off()

CMS vs. RAV832/188

RAV832/188 are the most frequently validated RAVs for 18 CRC datasets.

m2_name <- "RAV"
m2_1 <- "RAV832"
m2_2 <- "RAV188"
num_dat <- 18
x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5E")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_832_188.rds"))

png("outputs/png/boxplot_CMS_vs_832_188.png", width = 600, height = 500)
print(x)$visible
dev.off()

Compare to PCSS

PCSS vs. RAV1575/834

RAV1575/834 are the most similar RAVs to PCSS1/2, respectively, based on Pearson correlation.

m1_name <- "PCSS"
m1_1 <- "PCSS1"
m1_2 <- "PCSS2"
m2_name <- "RAV"
m2_1 <- "RAV1575"
m2_2 <- "RAV834"
num_dat <- 18

x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5C")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_1575_834.rds"))

PCSS vs. RAV834/833

RAV834/833 have the highest r-squared score when we compared the sample scores against the metadata.

18 CRC validation datasets

m1_name <- "PCSS"
m1_1 <- "PCSS1"
m1_2 <- "PCSS2"
m2_name <- "RAV"
m2_1 <- "RAV834"
m2_2 <- "RAV833"
num_dat <- 18

x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "3C")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_834_833.rds"))

png("outputs/png/boxplot_PCSS_vs_834_833.png", width = 600, height = 500)
print(x)$visible
dev.off()

10 CRC validation datasets

m1_name <- "PCSS"
m1_1 <- "PCSS1"
m1_2 <- "PCSS2"
m2_name <- "RAV"
m2_1 <- "RAV834"
m2_2 <- "RAV833"
num_dat <- 10

x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup6E")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_834_833.rds"))

PCSS vs. RAV832/188

RAV832/188 are the most frequently validated RAVs for 18 CRC datasets.

m1_name <- "PCSS"
m1_1 <- "PCSS1"
m1_2 <- "PCSS2"
m2_name <- "RAV"
m2_1 <- "RAV832"
m2_2 <- "RAV188"
num_dat <- 18

x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5F")
dirName <- paste0(num_dat, "_CRC_datasets")
saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_832_188.rds"))

PCSS vs. RAV834/3290

RAV3290 is associated with "stage" metadata of CRC datasets.

m1_name <- "PCSS"
m1_1 <- "PCSS1"
m1_2 <- "PCSS2"
m2_name <- "RAV"
m2_1 <- "RAV834"
m2_2 <- "RAV3290"
num_dat <- 10

x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
saveRDS(x, "outputs/boxplot_PCSS_vs_834_3290.rds")

PCSS vs. RAV834/596

RAV596 is associated with "grade" metadata of CRC datasets.

m1_name <- "PCSS"
m1_1 <- "PCSS1"
m1_2 <- "PCSS2"
m2_name <- "RAV"
m2_1 <- "RAV834"
m2_2 <- "RAV596"
num_dat <- 10

x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
saveRDS(x, "outputs/boxplot_PCSS_vs_834_596.rds")
##### Change Supplementary Figures number
boxplot_fig <- c("3B","3C","Sup8B","Sup8C","Sup8E","Sup8F","Sup9C","Sup9D","Sup9E")
boxplot_summary$figure <- rep(boxplot_fig, each = 8)

write.csv(boxplot_summary, "outputs/boxplot_summary.csv", row.names = FALSE)

Session Info

sessionInfo()



shbrief/GenomicSuperSignaturePaper documentation built on Aug. 2, 2022, 2:04 p.m.