knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, message = FALSE, fig.width = 5, fig.height = 3.5)
Using training and validation data of the original CRC study, we compared associations between different subtype models and RAVs with the same clinicopathological variables. Notably, these data were not part of RAV training and are microarray datasets whereas the RAVs were trained exclusively from RNA-seq data.
As described previously in CRC study, we used the likelihood-ratio test (LRT) to compare the different subtype models for association with clinicopathological variables. A p-value near 1 (-log10p-value near 0) means that no additional information is provided by a full model composed of two subtype definitions compared to a model with only one.
CMS-associated RAVs performed better than discrete CMS on all four phenotypes (plot) and also outperformed PCSSs except on tumor location (plot). Interestingly, PCSS-associated RAVs were still better than CMS but slightly worse than PCSSs (plot), while CMS-associated RAVs were better than both CMS and PCSSs (plot), indicating that RAVs contain more comprehensive information. This performance improvement became more significant using only the 10 original validation datasets, excluding 8 datasets used to train the PCSS model. In conclusion, RAVs trained from heterogeneous datasets, not specific to CRC, captured biologically-relevant signatures for CRC as well or superior to focused efforts using CRC-specific databases, suggesting that RAVs are for general-use and can be applied to describe other diseases as well.
CRC paper actually used both training and validation datasets (total of 18)
for Figure 4.
If you want to use all 18 datasets, assign the variable num_dat = 18
. If you
want only 10 CRC datasets excluding 8 datasets used to build PCSSs, assign
num_dat = 10
.
Fig4C_CMSvs.R
: m2_name
(column name for the model to be compared to CMS) m2_1
and m2_2
(names of the two model variables that form the group to
be compared to CMS) num_dat
for the number of CRC datasets for testing (18 or 10)
6 environmental variables for Fig4C_contScores.R
:
m1_name
, m1_1
, m1_2
for one model m2_name
, m2_1
, m2_2
for the othernum_dat
for the number of CRC datasets for testing (18 or 10)# width : width of boxplot # ymin or xmin : lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * # IQR # lower or xlower : lower hinge, 25% quantile # notchlower : lower edge of notch = median - 1.58 * IQR / sqrt(n) # middle or xmiddle : median, 50% quantile # notchupper : upper edge of notch = median + 1.58 * IQR / sqrt(n) # upper or xupper : upper hinge, 75% quantile # ymax or xmax : upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR boxplot_summary <- as.data.frame(matrix(nrow = 8*9, ncol = 11)) colnames(boxplot_summary) <- c("figure", "panel", "group", "datasets_used", "minima", "whisker_min", "first_quartile", "median", "third_quartile", "whisker_max", "maxima") boxplot_fig <- c("3B","3C","Sup5B","Sup5C","Sup5E","Sup5F","Sup6C","Sup6D","Sup6E") boxplot_summary$figure <- rep(boxplot_fig, each = 8) boxplot_summary$panel <- rep(rep(c(1,2,3,4), each = 2), 9) boxplot_summary$group <- rep(c(1,2), 36) ## 4 panels and 2 groups ref <- data.frame(panel = rep(1:4, each = 2), group = 1:2) ##### statsSummary function to extract boxplot stats # plotRes : ggplot object (e.g. x$value from the above script) # figName : name of the figure in the manuscript statsSummary <- function(plotRes, figName) { res <- ggplot_build(plotRes) res_box <- res$data[[1]] # Summary res_dat <- res$data[[2]] # Data points ind <- which(boxplot_summary$figure == figName) ## the number of datasets used for (i in seq_len(nrow(ref))) { panel <- ref$panel[i] group <- ref$group[i] sub <- res_dat[which(res_dat$PANEL == panel & res_dat$group == group),] ## row to add sample size info datasets_used_ind <- which(boxplot_summary$figure == figName & boxplot_summary$panel == panel & boxplot_summary$group == group) boxplot_summary$datasets_used[datasets_used_ind] <- sum(!is.na(sub$y)) } ## if there is no outlier, just assign median value for the function no_outlier_ind <- which(isEmpty(res_box$outliers)) res_box$outliers[no_outlier_ind] <- res_box$middle[no_outlier_ind] boxplot_summary$minima[ind] <- sapply(res_box$ymin, pmin, as.numeric(unlist(res_box$outliers))) %>% diag boxplot_summary$whisker_min[ind] <- res_box$ymin # lower extreme boxplot_summary$first_quartile[ind] <- res_box$lower # 25% quartile (box min) boxplot_summary$median[ind] <- res_box$middle # 50% quartile boxplot_summary$third_quartile[ind] <- res_box$upper # 75% quartile (box max) boxplot_summary$whisker_max[ind] <- res_box$ymax boxplot_summary$maxima[ind] <- sapply(res_box$ymax, pmax, as.numeric(unlist(res_box$outliers))) %>% diag return(boxplot_summary) }
First, we reproduce the previous result comparing the performance of PCSSs (continuous score) against CMS (discrete score).
m2_name <- "PCSS" m2_1 <- "PCSS1" m2_2 <- "PCSS2" num_dat <- 18 x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_PCSS.rds"))
m2_name <- "PCSS" m2_1 <- "PCSS1" m2_2 <- "PCSS2" num_dat <- 10 x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup6C") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_PCSS.rds"))
Here is the summary of boxplots in our paper.
| Figures | Models | # validation data | |------------|----------|-------------------| | Fig.3B | CMS vs. RAV834/833 | 18 | | Fig.3C | PCSS1/2 vs. RAV834/833 | 18 | | Sup.Fig.5B | CMS vs. RAV1575/834 | 18 | | Sup.Fig.5C | PCSS1/2 vs. RAV1575/834 | 18 | | Sup.Fig.5E | CMS vs. RAV832/188 | 18 | | Sup.Fig.5F | PCSS1/2 vs. RAV832/188 | 18 | | Fig.6C | CMS vs. PCSS1/2 | 10 | | Fig.6D | CMS vs. RAV834/833 | 10 | | Fig.6E | PCSS1/2 vs. RAV834/833 | 10 |
RAV1575/834 are the most similar RAVs to PCSS1/2, respectively, based on Pearson correlation.
m2_name <- "RAV" m2_1 <- "RAV1575" m2_2 <- "RAV834" num_dat <- 18 x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5B") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_1575_834.rds"))
RAV834/833 have the highest r-squared score when we compared the sample scores against the metadata.
m2_name <- "RAV" m2_1 <- "RAV834" m2_2 <- "RAV833" num_dat <- 18 x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "3B") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_834_833.rds")) png("outputs/png/boxplot_CMS_vs_834_833.png", width = 600, height = 500) print(x)$visible dev.off()
m2_name <- "RAV" m2_1 <- "RAV834" m2_2 <- "RAV833" num_dat <- 10 x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup6D") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_834_833.rds")) png("outputs/png/boxplot_CMS_vs_834_833.png", width = 600, height = 500) print(x)$visible dev.off()
RAV832/188 are the most frequently validated RAVs for 18 CRC datasets.
m2_name <- "RAV" m2_1 <- "RAV832" m2_2 <- "RAV188" num_dat <- 18 x <- source("R/Fig4C_CMSvs.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5E") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_CMS_vs_832_188.rds")) png("outputs/png/boxplot_CMS_vs_832_188.png", width = 600, height = 500) print(x)$visible dev.off()
RAV1575/834 are the most similar RAVs to PCSS1/2, respectively, based on Pearson correlation.
m1_name <- "PCSS" m1_1 <- "PCSS1" m1_2 <- "PCSS2" m2_name <- "RAV" m2_1 <- "RAV1575" m2_2 <- "RAV834" num_dat <- 18 x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5C") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_1575_834.rds"))
RAV834/833 have the highest r-squared score when we compared the sample scores against the metadata.
m1_name <- "PCSS" m1_1 <- "PCSS1" m1_2 <- "PCSS2" m2_name <- "RAV" m2_1 <- "RAV834" m2_2 <- "RAV833" num_dat <- 18 x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "3C") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_834_833.rds")) png("outputs/png/boxplot_PCSS_vs_834_833.png", width = 600, height = 500) print(x)$visible dev.off()
m1_name <- "PCSS" m1_1 <- "PCSS1" m1_2 <- "PCSS2" m2_name <- "RAV" m2_1 <- "RAV834" m2_2 <- "RAV833" num_dat <- 10 x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup6E") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_834_833.rds"))
RAV832/188 are the most frequently validated RAVs for 18 CRC datasets.
m1_name <- "PCSS" m1_1 <- "PCSS1" m1_2 <- "PCSS2" m2_name <- "RAV" m2_1 <- "RAV832" m2_2 <- "RAV188" num_dat <- 18 x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
boxplot_summary <- statsSummary(x$value, "Sup5F") dirName <- paste0(num_dat, "_CRC_datasets") saveRDS(x, file.path("outputs", dirName, "boxplot_PCSS_vs_832_188.rds"))
RAV3290 is associated with "stage" metadata of CRC datasets.
m1_name <- "PCSS" m1_1 <- "PCSS1" m1_2 <- "PCSS2" m2_name <- "RAV" m2_1 <- "RAV834" m2_2 <- "RAV3290" num_dat <- 10 x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
saveRDS(x, "outputs/boxplot_PCSS_vs_834_3290.rds")
RAV596 is associated with "grade" metadata of CRC datasets.
m1_name <- "PCSS" m1_1 <- "PCSS1" m1_2 <- "PCSS2" m2_name <- "RAV" m2_1 <- "RAV834" m2_2 <- "RAV596" num_dat <- 10 x <- source("R/Fig4C_contScores.R", print.eval = TRUE)
saveRDS(x, "outputs/boxplot_PCSS_vs_834_596.rds")
##### Change Supplementary Figures number boxplot_fig <- c("3B","3C","Sup8B","Sup8C","Sup8E","Sup8F","Sup9C","Sup9D","Sup9E") boxplot_summary$figure <- rep(boxplot_fig, each = 8) write.csv(boxplot_summary, "outputs/boxplot_summary.csv", row.names = FALSE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.