This markdown file outputs the results of the HIPC ImmuneSignatures Project using the original parameters and data coming directly from https://www.immunespace.org unless noted. This markdown uses heavily refactored code to improve efficiency. However, all original code developed for the production of the manuscript is included in the /OrigCode folder within the package for reference. This can be viewed at https://github.com/ehfhcrc/ImmSig2.
As seen in the figure below, the analysis first starts by extracting data from ImmuneSpace for four "Discovery" studies (SDY212, SDY63, SDY404, and SDY400) and processing that data based on two age cohorts (younger <= 35 years and older > 60 years). Meta-Analysis are run on these datasets to identify individual and sets of genes that predict immune system response to an influenza vaccine based on day zero transcriptional data. These signature genes and modules are then validated on one study per age cohort.
To save time, the gene expression matrices are pre-loaded as data within the package. The code to generate these expression matrices from ImmuneSpace / GEO can be found in the R/get_GE.R
script within the package.There is also one case where data from ImmuneSpace could not be used: SDY80 (hai).
The SDY80 hai data available on ImmuneSpace is not incorporated into the analysis because it does not include subjects without gene expression data. Although subjects without GE data are filtered out later, they must be included in the adjust_hai.R script in order to achieve the same results for the "endPoint" variable ("fc_res_maxd30"). The original raw HAI data came from Yuri Kotliarov at NIH and is included within the package.
Finally, the results for the "old" cohort using SDY67 data are slightly different ( <0.5% Relative Error ). This is due to rounding that occurs from a transformation from one kind of data structure to another.
In order to run this vignette locally, you would first need to install the following Bioconductor packages by hand: Biobase, qusage, and MetaIntegrator. Then install ImmSig2
via devtools::install_github("ehfhcrc/ImmSig2", build_vignettes=T)
and view the vignette with vignette("ImmSig_Pipeline_demo", package="ImmSig2")
.
# Eset prep library(ImmSig2) # to load data and helper functions library(Rlabkey) library(Biobase) # biocLite library(stringr) library(hash) library(dplyr) library(data.table) # library(preprocessCore) # biocLite, used in get_GE # library(RCurl) # used by get_GE and data preloaded so not using # Figure 2 - Age Distribution library(RColorBrewer) # Figure 3 - AdjMFC demo library(gridExtra) # Module / Gene Set Analysis library(qusage) # biocLite # Single Gene Analysis library(pROC) library(ggplot2) library(clinfun) library(MetaIntegrator) # biocLite # library(RMySQL) # depends on libmysqlclient / used in Single Gene Analysis once, data preloaded here # sudo apt-get install libmysqlclient-dev # Figure 4 - Sdy Forest Plots library(metaplot) # Multiple Figures library(cowplot) library(png) library(grid) options(width = 999) knitr::opts_chunk$set(echo=FALSE, # cache.path=paste(labkey.file.root, "cache/pubmed_report_cache/", sep="/"), # for IS Report cache=TRUE, warning=FALSE, message=FALSE, tidy=TRUE, fig.align = "center")
The expressionSet objects used in the Meta-Analysis functions need to have gene expression data and hemaglutenin inhibition (hai) data that have corresponding subject names to map one column of expression data to one row of hai information. This helper function ensures this is the case and allows for the selection of day 0 or baseline data only with the d0
argument.
prepEset <- function(em, adjHai, d0 = TRUE){ # get geneSyms now because prep work requires them to be absent new_fData <- data.frame(gene_symbol = em$geneSymbol, stringsAsFactors = F) rownames(new_fData) <- rownames(em) # rm GeneSyms and unwanted timepoints em <- if( d0 ){ em[ , grep("d0", colnames(em)) ] }else{ em[, !(colnames(em) == "geneSymbol")] } # determine which em subs have hai data and subset haiSubs <- sapply(colnames(em), FUN = function(name){ sub <- strsplit(name, split = "_", fixed = T)[[1]][1] return( sub %in% adjHai$subject ) }) em <- em[, haiSubs ] # if all timepoints, create new hai that has one row per timepoint if( !d0 ){ adjHai <- rbindlist(lapply(colnames(em), FUN = function(name){ tmp <- strsplit(name, split = "_", fixed = T)[[1]] row <- adjHai[ adjHai$subject == tmp[1], ] day <- gsub("neg", "-", tmp[2]) row$timepoint.day <- gsub("d", "", day) row$subject <- name return(row) })) } # Get Subjects from em and ensure hai / em contain same subs in same order emSubs <- if( d0 ){ str_match(colnames(em), "(((SUB)|(sub))_?[0-9]+)(\\.[0-9])?_(.*)")[, 2L] } else { colnames(em) } adjHai <- adjHai[ adjHai$subject %in% emSubs, ] adjHai <- adjHai[ order(match(adjHai$subject, emSubs)), ] rownames(adjHai) <- colnames(em) # Make eset eset <- ExpressionSet(assayData = as.matrix(em), featureData = AnnotatedDataFrame(new_fData)) pData(eset) <- droplevels(adjHai) return(eset) }
To extract gene expression and hai data from ImmuneSpace, an ImmuneSpace-Connection class object, con
is created and then data is pulled with a con$getDataset()
call. Here, the expression data is preloaded but has been pulled in the manner described above as seen in the R/get_GE.R
script. To see how the hai data is adjusted to define "responders" from "non-responders", you can look at the R/adjust_hai.R
script that is based on the work of Yuri Kotliarov from NIH.
studies <- c("SDY212","SDY63","SDY404","SDY400","SDY67","SDY80") eset_d0 <- list() sdy80all <- "" # need eset with all time points for some figures for(sdy in studies){ # See comments in get_GE.R for methodology of generating expression matrices. # These matrices are created using online, public sources (e.g. GEO or ImmuneSpace). em <- getGE(sdy) # Remove dup biosamples based on Yale HIPCMetaModuleAnalysis.R comments if(sdy == "SDY212"){ em <- em[ , -(grep("SUB134307.*", colnames(em))) ] } # Yale collaborators have not yet updated GEO database to reflect these changes if(sdy == "SDY404"){ em <- swapCols(em, "SUB120473_d0", "SUB120474_d0") # Incorrect Sex em <- swapCols(em, "SUB120460_d0", "SUB120485_d28") # Strong evidence } # Get raw HAI data from Immunespace and adjust according to Yuri Kotliarov code. # NOTE: SDY80 raw data in ImmuneSpace differs, therefore using preloaded. # labkey.url.path <- paste0("/Studies/", sdy, "/") # for IS site only con <- CreateConnection(sdy) rawHai <- if(sdy != "SDY80"){ con$getDataset("hai") }else{ SDY80_rawtiterdata_v2 } adjHai <- adjust_hai(sdy, rawHai) # Age needed for figure 2 and SDY80 filtering dotNum <- gsub("SDY", ".", sdy, fixed = T) demo <- con$getDataset("demographics") adjHai$Age <- sapply(adjHai$subject, FUN = function( sub ){ trg <- demo[ which(demo$participant_id == paste0(sub, dotNum))] return( trg$age_reported ) }) if(sdy == "SDY80"){ adjHai <- adjHai[ which(adjHai$Age < 36), ] } # Need study id for figure 2 adjHai$study <- sdy # Prep and push to holders eset_d0[[sdy]] <- prepEset(em, adjHai, d0 = T) if( sdy == "SDY80" ){ sdy80all <- prepEset(em, adjHai, d0 = F) } }
Before filtering the expressionSets into "young" (<= 35 years old) and "old" (> 60 years old) cohorts, we create the age distribution figure form the manuscript using the con$getDataset()
method described before to pull demographic information for all participants. The subjects are subset into those with day zero gene expression data (dark blue) and those without (light blue). A second table shows the number of subjects available for each study in ImmuneSpace. These numbers differ slightly from the manuscript where some filtering was done prior to the calculation of available subjects - e.g. duplicate biosamples removed in SDY212.
pd <- rbindlist(lapply(studies, function(sdy) { con <- CreateConnection(sdy) demo <- con$getDataset("demographics") ge <- con$getDataset("gene_expression_files") ge <- ge[ ge$study_time_collected == 0, ] demo$ge <- sapply(demo$participant_id, FUN = function(x){x %in% unique(ge$participant_id)}) filt <- data.frame(Age = demo$age_reported, GE = demo$ge, Study = sdy) })) pd.with <- pd[ pd$GE == TRUE, ] pd.without <- pd[ pd$GE == FALSE, ] agePlot <- ggplot() + geom_dotplot(data = pd.without, binaxis = "y", fill = "#e6ffff", binwidth = 1, stackdir = "center", aes(x = Study, y = Age)) + geom_dotplot(data = pd.with, binaxis = "y", fill = "blue", binwidth = 1, stackdir = "center", aes(x = Study, y = Age)) + ylab("Age (years)") + xlab("Study ID") + theme_bw() + geom_hline(yintercept = c(35,60), col="red", lty = 2) + theme(plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), axis.line = element_line(colour = "black")) colNms <- c("2008-2009", "2009-2010", "2010-2011", "2011-2012", "2012-2013") subDf <- matrix(data = "", nrow = 6, ncol = 5, dimnames = list(studies, colNms) ) subDf["SDY212", "2008-2009"] <- nrow( pd[pd$Study == "SDY212", ] ) subDf["SDY63", "2010-2011"] <- nrow( pd[pd$Study == "SDY63", ] ) subDf["SDY404", "2011-2012"] <- nrow( pd[pd$Study == "SDY404", ] ) subDf["SDY400", "2012-2013"] <- nrow( pd[pd$Study == "SDY400", ] ) subDf["SDY80", "2009-2010"] <- nrow( pd[pd$Study == "SDY80", ] ) subDf["SDY67", "2010-2011"] <- nrow( pd[pd$Study == "SDY67", ] ) subDf <- data.frame(subDf, stringsAsFactors = F) colnames(subDf) <- colNms tbl <- tableGrob(subDf) print(plot_grid(agePlot, tbl, ncol = 1, nrow = 2, rel_heights = c(5,3), labels = c("A","B")))
To provide a more detailed understanding how the hai data adjustment affects the distribution of responders and non-responders, we provide the example of SDY404. Note that the outlier in Figure 3A. with an extremely high d0 titer is removed from the data before decorrelation. The blue horizontal lines show the quantile thresholds for categorizing an individual as high responder (>70%) or low responder (<30%).
# Get raw data from ImmuneSpace and select only young subjects ------------------------------- con <- CreateConnection("SDY404") rawHai <- con$getDataset("hai") adjHai <- adjust_hai("SDY404", rawHai) pd <- adjHai[ adjHai$Age.class == "young", ] # bins were created manually bins <- c(0.25, 1, 5) # setup BEFORE Decorr plot ---------------------------------------------------------------------- cc <- cor.test(pd$young_fc_norm_max_ivt, pd$young_d0_norm_max, method = "spearman", exact = F) # Create df of group-by-bin statistics pd.bin <- pd %>% mutate(bin = cut(young_d0_norm_max, breaks = c(-Inf, bins, Inf), labels = 1:( length(bins) + 1) ) ) %>% group_by(bin) %>% summarise(n = n(), fc_median = median(young_fc_norm_max_ivt, na.rm=T), cor = ifelse( n > 2, cor(young_d0_norm_max, young_fc_norm_max_ivt, method="spearman", use="pairwise.complete.obs"), NA), cor.p = ifelse( n > 2, cor.test(young_d0_norm_max, young_fc_norm_max_ivt, method="spearman", exact = F)$p.value, NA)) %>% ungroup() pd.bin.show <- pd.bin[, colnames(pd.bin) %in% c("bin", "n", "cor.p")] colnames(pd.bin.show)[3] <- "p-value" pd.bin.show <- data.frame(pd.bin.show[1:3, ]) # Custom function developed by Yuri Kotliarov plot_size_bubbles <- function(x,y) { dd = data.frame(x,y) %>% group_by(x,y) %>% summarise(n=n()) %>% ungroup() ggplot(dd, aes(x,y)) + geom_point(aes(size=n)) } plot.subtitle <- sprintf("subjects = %d, Spearman rho = %.2f, p-value = %.2g", nrow(pd), cc$estimate, cc$p.value) beforePlot <- plot_size_bubbles(pd$young_d0_norm_max, pd$young_fc_norm_max_ivt) + scale_size_area(max_size = 8) + ggtitle(bquote(atop(.("Before Day 0 Decorrelation"), atop(italic(.(plot.subtitle)), "")))) + xlab("Day 0 Titer Score") + ylab("") + xlim(-0.5,7.5) + ylim(-2.75,2.75) + theme_bw() + theme(legend.key = element_blank()) + geom_vline(xintercept = bins, col='red', lty=1) + annotation_custom(tableGrob(format(pd.bin.show, digits = 2), rows = NULL), xmin = 5.5, xmax = 7.5, ymin = -2.75, ymax = -0.5) + geom_hline(yintercept = quantile(pd$young_fc_norm_max_ivt, c(0.3, 0.7), na.rm=T), col="blue", lty = 2) breaks <- seq(-1.25, 2.25, 0.5) tmp <- hist(pd$young_fc_norm_max_ivt, breaks = breaks) negDat <- data.frame(counts = -(tmp$counts), mids = tmp$mids) negDatEd <- rbind(negDat, data.frame(counts = c(0,0,0,0), mids = c(-2.5,-2.0,-1.5, 2.5))) beforeHist <- ggplot(negDatEd, aes(mids,counts)) + geom_col() + coord_flip() + xlab("MFC Titer Score") + ylim(-10,0) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) top <- plot_grid(beforeHist, beforePlot, ncol = 2, nrow = 1, align="h", rel_widths = c(0.25,1)) # setup AFTER Decorr Plot ------------------------------------------------------------------------- pd <- pd[ !(pd$subject == "SUB120471"), ] # remove outlier cc <- cor.test(pd$young_fc_res_max, pd$young_d0_norm_max, method = "spearman", exact = F) plot.subtitle <- sprintf("subjects = %d, Spearman rho = %.2f, p-value = %.2g", nrow(pd), cc$estimate, cc$p.value) afterPlot <- plot_size_bubbles(pd$young_d0_norm_max, pd$young_fc_res_max) + scale_size_area(max_size = 8) + ggtitle(bquote(atop(.("After Day 0 Decorrelation"), atop(italic(.(plot.subtitle)), "")))) + xlab("Day 0 Titer Score") + ylab("") + xlim(-0.5, 7.5) + ylim(-2.75, 2.75) + theme_bw() + theme(legend.key = element_blank()) + geom_hline(yintercept = quantile(pd$young_fc_res_max, c(0.3, 0.7), na.rm=T), col="blue", lty = 2) + geom_vline(xintercept = bins, col='red', lty = 1) breaks <- seq(-2.25, 1.75, 0.5) tmp <- hist(pd$young_fc_res_max, breaks = breaks) negDat <- data.frame(counts = -(tmp$counts), mids = tmp$mids) negDatEd <- rbind(negDat, data.frame(counts = c(0, 0, 0), mids = c(-2.5, 2.0, 2.5))) afterHist <- ggplot(negDatEd, aes(mids,counts)) + geom_col() + coord_flip() + xlab("adjMFC Titer Score") + ylim(-10,0) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) bottom <- plot_grid(afterHist, afterPlot, ncol = 2, nrow = 1, align="h", rel_widths = c(0.25,1))
print(plot_grid(top, bottom, ncol = 1, nrow = 2, labels = c("A", "B")))
The gene set modules come from the HIPC collaborators and are found within the package, but must be formatted for use within the meta-analyses.
# Prep GeneSetDB per original code. Can also use GSEAbase package. gene_set <- strsplit(geneSetDB, "\t") ## convert from vector of strings to a list names(gene_set) <- sapply(gene_set,"[", 1) ## move the names column as the names of the list links <- sapply(gene_set,"[", 2) gene_set <- lapply(gene_set, "[",-1:-2) ## remove name and description columns gene_set <- lapply(gene_set, function(x){ x[ which( x != "") ] }) ## remove empty strings
Before use in the meta-analyses pipelines, the list of expressionSet objects are adjusted for each cohort to select the correct gene expression and hai information by subsetting the subjects by age. SDY80 is treated differently in terms of the hai data and noted below.
adjust_eset <- function(sdy, eset_list, cohort){ eset <- eset_list[[sdy]] storageMode(assayData(eset)) <- "list" # Select correct PhenoData in terms of young or old fc_res_max_d30 and delete rest hai_var <- grep("^((fc)|(d0))_", varLabels(eset), value = TRUE) oppo <- ifelse( cohort == 'young', "old", "young") eset <- eset[, eset$Age.class %in% cohort] # set in adjust_hai # Per notes from F. Vallania @ Stanford and H. Meng @ Yale: # SDY80 uses the adjMFC or fc_res_max_d30 calculated from a combined # group of young and slightly older subjects because it improved the estimation if(sdy != "SDY80"){ phenoData(eset) <- phenoData(eset)[, setdiff(varLabels(eset), c(hai_var, paste0(oppo, "_", hai_var)))] varLabels(eset) <- gsub(paste0("^", cohort, "_"), '', varLabels(eset)) }else{ hai_var_y <- paste0("young_", hai_var) hai_var_o <- paste0("old_", hai_var) phenoData(eset) <- phenoData(eset)[ , !(colnames(phenoData(eset)) %in% c(hai_var_o, hai_var_y))] } # remove NAs from fc_res_max_d30 pheno and therefore exprs to streamline eset <- eset[ , !is.na(eset$fc_res_max_d30) ] storageMode(assayData(eset)) <- "lockedEnvironment" return(eset) } esetLsMkr <- function(eset_list, cohort){ valSdy <- ifelse(cohort == "young", "SDY80", "SDY67") corrEsetList <- eset_list[ c("SDY212","SDY63","SDY404","SDY400", valSdy) ] adjEsetList <- lapply(names(corrEsetList), adjust_eset, eset_list = corrEsetList, cohort = cohort) names(adjEsetList) <- names(corrEsetList) return(adjEsetList) } yngEsets <- esetLsMkr(eset_d0, cohort = "young") oldEsets <- esetLsMkr(eset_d0, cohort = "old")
The Gene Set or Module meta analysis based on work by the Yale HIPC collaborators is run first. The analysis is performed by the QuSage
package that is found on BioConductor.
yng_res <- meta_analysis(adj_eset_list = yngEsets, cohort = "young", gene_set = gene_set)
``` {r include=TRUE} DT::datatable(yng_res$dsc)
#### Validation Study Significant Pathways Data (Table 1 from Manuscript) ``` {r include=TRUE} DT::datatable(yng_res$val)
old_res <- meta_analysis(adj_eset_list = oldEsets, cohort = "old", gene_set = gene_set)
DT::datatable(old_res$dsc)
DT::datatable(old_res$val)
Since figure 7 is based on Yale's qusage analysis, it is presented here.
# Prep and Helper Methods---------------------------------------------------- discoverySDY = c('SDY212','SDY63','SDY404','SDY400') loc <- which(colnames(yng_res$combinedPDF$path.PDF) == "BCR signaling (M54)") # These are deconstructed versions of qusage:::plotDensityCurves() to enable use of ggplot # without plotRecord for creating figures. xCoordCalc <- function(QSarray, loc, sif){ seq(-1, 1 , length.out = QSarray$n.points) * QSarray$ranges[loc] * sif + QSarray$path.mean[loc] } yCoordCalc <- function(QSarray, loc, scaleFactor){ QSarray$path.PDF[,loc] / scaleFactor[loc] } makeCoordDf <- function(QSarray, meta, loc, scaleFactor){ if(meta == FALSE){ sif <- sqrt(QSarray$vif[loc]) if(is.na(sif)){ sif <- 1 } }else{ sif <- 1 } x <- xCoordCalc(QSarray, loc, sif) y <- yCoordCalc(QSarray, loc, scaleFactor) df <- data.frame(x = x, y = y) } # A. discovery plot---------------------------------------------------- discSF <- qusage:::pdfScaleFactor(yng_res$combinedPDF) # Get individual study coordinates coordDfs <- lapply(yng_res$combinedPDF$QSlist, makeCoordDf, FALSE, loc, discSF) # Get meta study coordinates coordDfs[[5]] <- makeCoordDf(yng_res$combinedPDF, TRUE, loc, discSF) names(coordDfs) <- c(discoverySDY, "Meta-Analysis") discPlot <- ggplot() + geom_line(data = coordDfs$SDY212, aes(x,y, color="SDY212")) + geom_line(data = coordDfs$SDY63, aes(x,y, color="SDY63")) + geom_line(data = coordDfs$SDY404, aes(x,y, color="SDY404")) + geom_line(data = coordDfs$SDY400, aes(x,y, color="SDY400")) + geom_line(data = coordDfs$`Meta-Analysis`, aes(x,y, color="Meta-Analysis")) + scale_color_manual("", values = c("SDY212"="#E41A1C", "SDY63"="#377EB8", "SDY404"="#4DAF4A", "SDY400"="#984EA3", "Meta-Analysis"="black")) + ggtitle("Discovery") + xlim(-1,1) + xlab("Gene Module Activity") + ylab("Density") + geom_vline(xintercept = 0) + theme(legend.text = element_text(c(discoverySDY,"Meta-Analysis"))) # B. validation plot------------------------------------------------------- valSF <- qusage:::pdfScaleFactor(yng_res$valPDF) valDF <- makeCoordDf(yng_res$valPDF, meta = FALSE, loc, scaleFactor = valSF) valPlot <- ggplot() + geom_line(data = valDF, aes(x,y, color="SDY80")) + scale_color_manual("", values = c("SDY80"="black")) + ggtitle("Validation") + xlim(-1,1) + xlab("Gene Module Activity") + ylab("") + geom_vline(xintercept = 0) # C. multiday plot---------------------------------------------------------- eset <- sdy80all[ , !is.na(sdy80all$fc_res_max_d30) ] # get Module Intensity for each subject by averaging for genes in module em <- data.frame(exprs(eset), stringsAsFactors = F) em$geneSymbols <- fData(eset)$gene_symbol em <- em[ em$geneSymbols %in% gene_set$`BCR signaling (M54)`, ] em <- em[ , !(colnames(em) == "geneSymbols") ] em.Means <- data.frame(subject = colnames(em), ModInt = colMeans(em, na.rm = T)) # Qualify response, ensure timepoints are ordered by setting class to numeric pd <- pData(eset) pd$timepoint.day <- as.numeric(pd$timepoint.day) pd$Response <- c("Low responder", "Moderate responder", "High responder")[ pd$fc_res_max_d30 + 1 ] # merge and order for ggplot tcData <- merge(pd, em.Means, by = "subject" ) tcData$Response<- factor(tcData$Response, c("Low responder", "Moderate responder", "High responder"), ordered = TRUE) tcPlot <- ggplot(tcData, aes(x = factor(timepoint.day), y = ModInt, fill = Response)) + geom_boxplot(outlier.size = 0) + geom_point(position = position_jitterdodge(dodge.width = 0.8), aes(color = Response, shape = Response), size = 3) + scale_fill_manual(values=c("white","white","white")) + scale_shape_manual(values=c(15,19,17)) + scale_color_manual(values=c("deepskyblue","magenta","red")) + ggtitle("Expression Differences by Responder Category over Time") + xlab("Time (Days post Vaccination)") + ylab("Gene Module Intensity") + ylim(8.8, 9.6) # Output --------------------------------------- top <- plot_grid(discPlot, valPlot, ncol = 2, nrow = 1, labels = c("A","B"), rel_widths = c(1.25,1)) bottom <- plot_grid(tcPlot, ncol = 1, nrow = 1, labels = c("C")) print(plot_grid(top, bottom, ncol = 1, nrow = 2, align = "v", rel_heights = c(1,1.25)))
The single gene meta analysis was created by the Stanford HIPC collaborators and uses their package MetaIntegrator
, which is also available on BioConductor.
# outputs a plot so need include=FALSE sgaResYng <- singleGeneAnalysis(adjEsetList = yngEsets, grpToCut = 1, cohort = "young")
up <- data.frame(GeneName = rownames(sgaResYng$disc$upReg), effectSize = sgaResYng$disc$upReg$effectSize, effectSizeStdError = sgaResYng$disc$upReg$effectSizeStandardError, fisherPvalUp = sgaResYng$disc$upReg$fisherPvalUp, fisherFDRUp = sgaResYng$disc$upReg$fisherFDRUp, stringsAsFactors = F) DT::datatable(up, fillContainer = T)
dn <- data.frame(GeneName = rownames(sgaResYng$disc$dnReg), effectSize = sgaResYng$disc$dnReg$effectSize, effectSizeStdError = sgaResYng$disc$dnReg$effectSizeStandardError, fisherPvalDown = sgaResYng$disc$dnReg$fisherPvalDown, fisherFDRDown = sgaResYng$disc$dnReg$fisherFDRDown, stringsAsFactors = F) DT::datatable(dn, fillContainer = T)
These Forest plots show significant genes from the discovery studies, both up and down regulated. Code developed by F. Vallenia @ Stanford.
customForestPlot <- function(geneName, metaObject, summaryTbl){ # Prep and subsetting studyData <- data.frame(means = metaObject$metaAnalysis$datasetEffectSizes[ geneName, ]) studyData$SEs <- metaObject$metaAnalysis$datasetEffectSizeStandardErrors[ geneName, ] studyData <- studyData[ order( rownames(studyData) ), ] studyMeans <- studyData$means names(studyMeans) <- rownames(studyData) pooledMean <- metaObject$metaAnalysis$pooledResults[geneName, "effectSize"] pooledSE <- metaObject$metaAnalysis$pooledResults[geneName, "effectSizeStandardError"] titleN <- paste(geneName, "\nES =", round(summaryTbl[ rn == geneName ]$effectSize, 2), " p =", formatC(summaryTbl[ rn == geneName ]$effectSizePval, 3)) # Plot generation rmeta::metaplot(mn = studyMeans, se = studyData$SEs, labels = rownames(studyData), xlab = "Standardized Mean Difference (log2 scale)", ylab = "", xlim = c(-2,3), colors = rmeta::meta.colors(box = "black", lines = "black", zero = "black", summary = "black", text = "black"), summn = pooledMean, sumse = pooledSE, sumnn = 1 / pooledSE^2, main = titleN) } # Plot as pdf and then load img summaryTbl <- as.data.table(sgaResYng$disc$metaObj$metaAnalysis$pooledResults, keep.rownames = T) allGenes <- c(sgaResYng$disc$posGenes, sgaResYng$disc$negGenes) png("fig4.png", height = 3*3, width = 5*2.5, units = "in", res = 150) par(mfrow = c(3,5)) dmp <- lapply(allGenes, customForestPlot, sgaResYng$disc$metaObj, summaryTbl) dmp <- suppressMessages(dev.off())
Fig.6A shows a violin plot of the response score for each category (low, moderate, high) with SDY80 with the geometric mean. Fig6B. is a ROC curve for the High vs Low and Moderate vs Low responders. Fig.6C shows the change in response score over time.
# Create Response by timepoint plot --------------------------------------------- eset <- sdy80all[ , !is.na(sdy80all$fc_res_max_d30) ] gem <- makeGEM(eset = eset, rmGrp = FALSE, grpToCut = NA, ens_table = ens_table) # select positive genes from Discovery group and compute validation gem$pheno$score <- arithMean_comb_validate(GEM = gem, genes_p = sgaResYng$disc$posGenes, genes_n = NULL)$score gem$pheno$Response <- c("Low responder", "Moderate responder", "High responder")[ gem$class + 1 ] tcData <- gem$pheno[ , colnames(gem$pheno) %in% c("Response", "timepoint.day", "score") ] tcData$timepoint.day <- as.numeric(tcData$timepoint.day) tcData$subject <- substr(gem$pheno$subject, start = 0, stop = 9) tcData$Response <- factor(tcData$Response, c("Low responder", "Moderate responder", "High responder"), ordered = TRUE) tcPlot <- ggplot(tcData, aes(x = factor(timepoint.day), y = score, fill = Response)) + geom_boxplot(outlier.size = 0) + geom_point(position = position_jitterdodge(dodge.width=0.8), aes(color = Response, shape = Response), size = 3) + scale_fill_manual(values=c("white","white","white")) + scale_shape_manual(values=c(15,19,17)) + scale_color_manual(values=c("deepskyblue","magenta","red")) + xlab("Time (Days)") + ylab("Response Score") + theme(legend.position = "bottom", legend.title = element_blank()) #compute day0 vs day1 paired t-test t0 <- tcData[ tcData$timepoint.day == 0, ] t1 <- tcData[ tcData$timepoint.day == 1, ] t1 <- t1[ t1$subject %in% t0$subject, ] t0 <- t0[ order(t0$subject), ] t1 <- t1[ order(t1$subject), ] t01p <- t.test(t0$score, t1$score, paired = T, alternative = 'less') t01p.pval <- t01p$p.value #compute comparison of low vs high responders at each time point timeTTest <- function(day, tcData){ low <- tcData[tcData$timepoint.day == day & tcData$Response == "Low responder", ] high <- tcData[tcData$timepoint.day == day & tcData$Response == "High responder", ] res <- t.test(low$score, high$score) return(res$p.value) } tp <- sort(unique(tcData$timepoint.day)) pvals <- round(sapply(tp, timeTTest, tcData), digits = 2) timeDf <- t(data.frame(Day = tp, `p-value` = pvals)) timeTg <- tableGrob(timeDf) top <- plot_grid(sgaResYng$val$ViolinPlot, sgaResYng$val$RocPlot, ncol = 2, nrow = 1, align = "h", labels = c("A","B")) bottom <- plot_grid(tcPlot, timeTg, ncol = 1, nrow = 2, labels = c("C",""), rel_heights = c(4,1)) print(plot_grid(top, bottom, ncol = 1, nrow = 2))
No Significant up or down-regulated genes were found using the adjusted expressionSet and therefore plots are not shown for SDY67 alone.
# outputs a plot so need include=FALSE sgaResOld <- singleGeneAnalysis(adjEsetList = oldEsets, grpToCut = 1, cohort = "old")
No Significant up or down-regulated genes were found using the adjusted expressionSet, therefore the significant genes from the young cohort with high vs low responders are used to generate figure 9.
# outputs a plot so need include=FALSE # Note: SDY400 is not included because there are less than two cases and/or controls sgaResLvm <- singleGeneAnalysis(adjEsetList = yngEsets, grpToCut = 2, cohort = "young")
# Plot as png and then load img summaryTbl <- as.data.table(sgaResLvm$disc$metaObj$metaAnalysis$pooledResults, keep.rownames = T) # NOTE: using the genes from the sgaResYng, no sig-genes for sgaResLvm allGenes <- c(sgaResYng$disc$posGenes, sgaResYng$disc$negGenes) png("fig9.png", height = 3*3, width = 5*2.5, units = "in", res = 150) par(mfrow = c(3,5)) dmp <- lapply(allGenes, customForestPlot, sgaResLvm$disc$metaObj, summaryTbl) dmp <- suppressMessages(dev.off())
Combining the results of the module intensity and single gene analysis, Figure 5 presents heat maps of both. Fig.5A differs from the manuscript due to the SDY404 subject swap - in this case, the pvalue for 'enriched in activated dendritic cells / monocytes (M64)' is slightly greater and therefore does not meet the threshold. Fig.5B is also missing two genes that were not present in SDY80 on ImmuneSpace: "APOB48R","TNFSF13". This difference is likely due to feature / annotation updates made in bioconductor between the generation of the manuscript and the creation of gene expression matrices on ImmuneSpace.
# A - Young and Old Modules--------------------------------------------------------------------- sigMods <- rownames(yng_res$val) prepPhmap <- function(resObj, sigMods, valSdyNm){ sdyLs <- c("SDY212","SDY63","SDY404","SDY400","valPDF") lsSubset <- resObj[ names(resObj) %in% sdyLs] modMeans <- rbind(sapply(names(lsSubset), FUN = function(sdy){ ls <- lsSubset[[sdy]]$path.mean ls[ names(ls) %in% sigMods ] })) colnames(modMeans)[[5]] <- valSdyNm return(modMeans) } yngMns <- prepPhmap(yng_res, sigMods, "SDY80") png("fig5AYng.png", width = 300, height = 300) pheatmap::pheatmap(yngMns, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(100), cluster_cols = FALSE, cluster_rows = FALSE, main = "Young", show_rownames = FALSE, legend = FALSE) dmp <- suppressMessages(dev.off()) yngPng <- png::readPNG("fig5AYng.png") yngGrob <- grid::rasterGrob(yngPng, interpolate=TRUE) yngPlot <- ggplot() + annotation_custom(yngGrob) oldMns <- prepPhmap(old_res, sigMods, "SDY67") png("fig5AOld.png", width = 500, height = 300) pheatmap::pheatmap(oldMns, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(100), cluster_cols = FALSE, cluster_rows = FALSE, main = "Older", show_rownames = TRUE, legend = TRUE) dmp <- suppressMessages(dev.off()) oldPng <- png::readPNG("fig5AOld.png") oldGrob <- grid::rasterGrob(oldPng, interpolate=TRUE) oldPlot <- ggplot() + annotation_custom(oldGrob) # B - Single Genes ----------------------------------------------------------------------------- # Pull list of genes found in BCR signaling (M54), Platelet Activation (M42), # and Inflammatory Response (M33). Update to remove genes not found in SDY80 eset. # Although these were found in the original? genes <- unique(unlist(gene_set[ grep("M42|M54|M33", names(gene_set))])) genes <- sort(genes[ !(genes %in% c("APOB48R","TNFSF13")) ]) # not found in SDY80 # fn to return average of gene expression within sdy and response adjEm <- function(eset, resp, genes){ pd <- pData(eset) em <- data.table(exprs(eset)) subs <- paste0(pd$subject[ pd$fc_res_max_d30 == resp ], "_d0") em <- em[ , ( colnames(em) %in% subs ), with=FALSE ] em <- em[, gs := fData(eset)$gene_symbol ] setkey(em, gs) em <- em[.(genes)] em <- em[ , lapply(.SD, mean), by="gs", .SDcols = grep("^SUB", colnames(em))] em <- em[ , AVG := rowMeans(.SD, na.rm=T), .SDcols = grep("^SUB", colnames(em)) ] em <- em[ order(gs) ] return(em$AVG) } # Find difference of high and low responders in terms of log2 gene expression for each sdy sdys <- c("SDY212","SDY63","SDY404","SDY400","SDY80") diffEm <- t(rbind(sapply(sdys, FUN = function(sdy){ eset <- yngEsets[[sdy]] high <- adjEm(eset, 2, genes) low <- adjEm(eset, 0, genes) diff <- (high - low) / low }))) colnames(diffEm) <- genes png("fig5B.png", height = 300, width = 750) pheatmap::pheatmap(diffEm, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(100), cluster_cols = TRUE, cluster_rows = FALSE) dmp <- suppressMessages(dev.off()) sgaPng <- png::readPNG("fig5B.png") sgaGrob <- grid::rasterGrob(sgaPng, interpolate=TRUE) sgaPlot <- ggplot() + annotation_custom(sgaGrob) top_row <- plot_grid(yngPlot, oldPlot, ncol = 2, nrow = 1, rel_widths = c(1.25,2)) print(plot_grid(top_row, sgaPlot, ncol = 1, nrow = 2, labels = c("A","B")))
Figure 8A compares p-values of each gene in older and younger cohorts. The dark squares are genes that were found to be significant in the young cohort, either up or down-regulated. Figure 8B does the same for pathway-activity of the gene set modules from the discovery studies. Each gene or pathway is coded into non-significant (grey circle), significant for the young cohort (black square), or significant for the old cohort (black triangle).
# Single Gene -------------------------------------------------------------------- # input sga <- as.data.table(merge(sgaResYng$disc$metaObj$metaAnalysis$pooledResults, sgaResOld$disc$metaObj$metaAnalysis$pooledResults, by = 0)) sigGenes <- c(sgaResYng$disc$posGenes, sgaResYng$disc$negGenes) # fig effCorrSGA <- ggplot(sga, aes(y = effectSize.y, x = effectSize.x)) + xlim(-1.5,1.5) + ylim(-1.5,1.5) + geom_vline(xintercept=0,size=0.1, colour="black") + geom_hline(yintercept=0,size=0.1, colour="black") + geom_point(color = 'gray65') + theme_bw() + theme(axis.title = element_text(size=16,face="bold"), legend.title = element_blank(), legend.key = element_rect(colour = "white"), legend.position = c(0.25, 0.9), legend.text = element_text(size = 15,face="bold"), legend.background = element_rect(colour = "lightgray")) + geom_point(data = sga[ sga$Row.names %in% sigGenes,], aes(x = effectSize.x,y = effectSize.y), color = 'gray20', size = 3, shape = 15) + xlab("Gene effect size (young)") + ylab("Gene effect size (older)") # Gene Module -------------------------------------------------------------------- # input gma <- as.data.table(merge(yng_res$dsc, old_res$dsc, by = 0)) # figure effCorrGMA<- ggplot(gma, aes(y = pathwayActivity.y, x = pathwayActivity.x)) + xlim(-0.3,0.3) + ylim(-0.3,0.3) + geom_vline(xintercept=0,size=0.1, colour="black") + geom_hline(yintercept=0,size=0.1, colour="black") + geom_point(color = 'gray65') + theme_bw() + theme(axis.title = element_text(size=16,face="bold"), legend.title = element_blank(), legend.key = element_rect(colour = "white"), legend.position = c(0.25, 0.9), legend.text = element_text(size = 15,face="bold"), legend.background = element_rect(colour = "lightgray")) + geom_point(data = gma[ gma$Row.names %in% rownames(yng_res$val), ], aes(x = pathwayActivity.x, y = pathwayActivity.y), color = 'gray20', size = 3, shape = 15) + geom_point(data = gma[ gma$Row.names %in% rownames(old_res$val), ], aes(x = pathwayActivity.x,y = pathwayActivity.y), color = 'gray20', size = 3, shape = 17) + xlab("Gene Module Activity (young)") + ylab("Gene Module Activity (older)") print(plot_grid(effCorrSGA, effCorrGMA, ncol = 2, nrow = 1, labels = c("A","B")))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.