In ehfhcrc/ImmSig2: Streamlined package to demonstrate HIPC ImmuneSigatures work using ImmuneSpace Data

PACKAGE INFORMATION

Created: May 2017

Study Authors: HIPC ImmuneSignature Collaborators

Package Developer: Evan Henrich (Gottardo Lab - FHCRC)

Contact: ehenrich@fredhutch.org

OVERVIEW

This markdown file outputs the results of the HIPC ImmuneSignatures Project using the original parameters and data coming directly from https://www.immunespace.org unless noted. This markdown uses heavily refactored code to improve efficiency. However, all original code developed for the production of the manuscript is included in the /OrigCode folder within the package for reference. This can be viewed at https://github.com/ehfhcrc/ImmSig2.

Figure 1 - Analysis Workflow

As seen in the figure below, the analysis first starts by extracting data from ImmuneSpace for four "Discovery" studies (SDY212, SDY63, SDY404, and SDY400) and processing that data based on two age cohorts (younger <= 35 years and older > 60 years). Meta-Analysis are run on these datasets to identify individual and sets of genes that predict immune system response to an influenza vaccine based on day zero transcriptional data. These signature genes and modules are then validated on one study per age cohort.

Data Notes:

To save time, the gene expression matrices are pre-loaded as data within the package. The code to generate these expression matrices from ImmuneSpace / GEO can be found in the R/get_GE.R script within the package.There is also one case where data from ImmuneSpace could not be used: SDY80 (hai).

The SDY80 hai data available on ImmuneSpace is not incorporated into the analysis because it does not include subjects without gene expression data. Although subjects without GE data are filtered out later, they must be included in the adjust_hai.R script in order to achieve the same results for the "endPoint" variable ("fc_res_maxd30"). The original raw HAI data came from Yuri Kotliarov at NIH and is included within the package.

Finally, the results for the "old" cohort using SDY67 data are slightly different ( <0.5% Relative Error ). This is due to rounding that occurs from a transformation from one kind of data structure to another.

Package Notes:

In order to run this vignette locally, you would first need to install the following Bioconductor packages by hand: Biobase, qusage, and MetaIntegrator. Then install ImmSig2 via devtools::install_github("ehfhcrc/ImmSig2", build_vignettes=T) and view the vignette with vignette("ImmSig_Pipeline_demo", package="ImmSig2").

# Eset prep
library(ImmSig2) # to load data and helper functions
library(Rlabkey)
library(Biobase) # biocLite
library(stringr)
library(hash)
library(dplyr)
library(data.table)
# library(preprocessCore) # biocLite, used in get_GE
# library(RCurl) # used by get_GE and data preloaded so not using

# Figure 2 - Age Distribution
library(RColorBrewer)

# Figure 3 - AdjMFC demo
library(gridExtra)

# Module / Gene Set Analysis
library(qusage) # biocLite

# Single Gene Analysis
library(pROC) 
library(ggplot2) 
library(clinfun) 
library(MetaIntegrator) # biocLite

# library(RMySQL) 
# depends on libmysqlclient / used in Single Gene Analysis once, data preloaded here
# sudo apt-get install libmysqlclient-dev

# Figure 4 - Sdy Forest Plots
library(metaplot)

# Multiple Figures
library(cowplot)
library(png)
library(grid)

options(width = 999)
knitr::opts_chunk$set(echo=FALSE, 
               # cache.path=paste(labkey.file.root, "cache/pubmed_report_cache/", sep="/"), # for IS Report
               cache=TRUE, 
               warning=FALSE, 
               message=FALSE, 
               tidy=TRUE,
               fig.align = "center")

STEP 1: DATA EXTRACTION AND CLEANING

Setup for expressionSet object creation

The expressionSet objects used in the Meta-Analysis functions need to have gene expression data and hemaglutenin inhibition (hai) data that have corresponding subject names to map one column of expression data to one row of hai information. This helper function ensures this is the case and allows for the selection of day 0 or baseline data only with the d0 argument.

prepEset <- function(em, adjHai, d0 = TRUE){

  # get geneSyms now because prep work requires them to be absent
  new_fData <- data.frame(gene_symbol = em$geneSymbol, stringsAsFactors = F)
  rownames(new_fData) <- rownames(em)

  # rm GeneSyms and unwanted timepoints
  em <- if( d0 ){ em[ , grep("d0", colnames(em)) ] }else{ em[, !(colnames(em) == "geneSymbol")] }

  # determine which em subs have hai data and subset
  haiSubs <- sapply(colnames(em), FUN = function(name){
    sub <- strsplit(name, split = "_", fixed = T)[[1]][1]
    return( sub %in% adjHai$subject )
  })

  em <- em[, haiSubs ]

  # if all timepoints, create new hai that has one row per timepoint
  if( !d0 ){
    adjHai <- rbindlist(lapply(colnames(em), FUN = function(name){
      tmp <- strsplit(name, split = "_", fixed = T)[[1]]
      row <- adjHai[ adjHai$subject == tmp[1], ]
      day <- gsub("neg", "-", tmp[2])
      row$timepoint.day <- gsub("d", "", day)
      row$subject <- name
      return(row)
    }))
  }

  # Get Subjects from em and ensure hai / em contain same subs in same order
  emSubs <- if( d0 ){
              str_match(colnames(em), "(((SUB)|(sub))_?[0-9]+)(\\.[0-9])?_(.*)")[, 2L]
            } else {
              colnames(em)
            }

  adjHai <- adjHai[ adjHai$subject %in% emSubs, ]
  adjHai <- adjHai[ order(match(adjHai$subject, emSubs)), ]
  rownames(adjHai) <- colnames(em)

  # Make eset
  eset <- ExpressionSet(assayData = as.matrix(em),
                      featureData = AnnotatedDataFrame(new_fData))
  pData(eset) <- droplevels(adjHai)

  return(eset)
}

expressionSet Construction

To extract gene expression and hai data from ImmuneSpace, an ImmuneSpace-Connection class object, con is created and then data is pulled with a con$getDataset() call. Here, the expression data is preloaded but has been pulled in the manner described above as seen in the R/get_GE.R script. To see how the hai data is adjusted to define "responders" from "non-responders", you can look at the R/adjust_hai.R script that is based on the work of Yuri Kotliarov from NIH.

studies <- c("SDY212","SDY63","SDY404","SDY400","SDY67","SDY80")
eset_d0 <- list()
sdy80all <- "" # need eset with all time points for some figures

for(sdy in studies){
  # See comments in get_GE.R for methodology of generating expression matrices. 
  # These matrices are created using online, public sources (e.g. GEO or ImmuneSpace).
  em <- getGE(sdy)

  # Remove dup biosamples based on Yale HIPCMetaModuleAnalysis.R comments
  if(sdy == "SDY212"){ em <- em[ , -(grep("SUB134307.*", colnames(em))) ] }

  # Yale collaborators have not yet updated GEO database to reflect these changes
  if(sdy == "SDY404"){
    em <- swapCols(em, "SUB120473_d0", "SUB120474_d0") # Incorrect Sex
    em <- swapCols(em, "SUB120460_d0", "SUB120485_d28") # Strong evidence
  }

  # Get raw HAI data from Immunespace and adjust according to Yuri Kotliarov code.
  # NOTE: SDY80 raw data in ImmuneSpace differs, therefore using preloaded.
  # labkey.url.path <- paste0("/Studies/", sdy, "/") # for IS site only
  con <- CreateConnection(sdy)
  rawHai <- if(sdy != "SDY80"){ con$getDataset("hai") }else{ SDY80_rawtiterdata_v2 }
  adjHai <- adjust_hai(sdy, rawHai)

  # Age needed for figure 2 and SDY80 filtering
  dotNum <- gsub("SDY", ".", sdy, fixed = T)
  demo <- con$getDataset("demographics")
  adjHai$Age <- sapply(adjHai$subject, FUN = function( sub ){
    trg <- demo[ which(demo$participant_id == paste0(sub, dotNum))]
    return( trg$age_reported )
  })

  if(sdy == "SDY80"){ adjHai <- adjHai[ which(adjHai$Age < 36), ] }

  # Need study id for figure 2
  adjHai$study <- sdy

  # Prep and push to holders
  eset_d0[[sdy]] <- prepEset(em, adjHai, d0 = T)
  if( sdy == "SDY80" ){ sdy80all <- prepEset(em, adjHai, d0 = F) } 
}

Figure 2 - Age Distribution by Study

Before filtering the expressionSets into "young" (<= 35 years old) and "old" (> 60 years old) cohorts, we create the age distribution figure form the manuscript using the con$getDataset() method described before to pull demographic information for all participants. The subjects are subset into those with day zero gene expression data (dark blue) and those without (light blue). A second table shows the number of subjects available for each study in ImmuneSpace. These numbers differ slightly from the manuscript where some filtering was done prior to the calculation of available subjects - e.g. duplicate biosamples removed in SDY212.

pd <- rbindlist(lapply(studies, function(sdy) {
                        con <- CreateConnection(sdy)
                        demo <- con$getDataset("demographics")
                        ge <- con$getDataset("gene_expression_files")
                        ge <- ge[ ge$study_time_collected == 0, ]
                        demo$ge <- sapply(demo$participant_id, 
                                          FUN = function(x){x %in% unique(ge$participant_id)})
                        filt <- data.frame(Age = demo$age_reported,
                                           GE = demo$ge,
                                           Study = sdy)
                       }))
pd.with <- pd[ pd$GE == TRUE, ]
pd.without <- pd[ pd$GE == FALSE, ]

agePlot <- ggplot()                      + 
            geom_dotplot(data = pd.without,
                         binaxis = "y", 
                         fill = "#e6ffff", 
                         binwidth = 1, 
                         stackdir = "center", 
                         aes(x = Study, y = Age))                 + 
            geom_dotplot(data = pd.with,
                         binaxis = "y", 
                         fill = "blue",
                         binwidth = 1, 
                         stackdir = "center", 
                         aes(x = Study, y = Age))                 + 
            ylab("Age (years)")                                   +
            xlab("Study ID")                                      +
            theme_bw()                                            +
            geom_hline(yintercept = c(35,60), col="red", lty = 2) +
            theme(plot.background = element_blank(),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  panel.border = element_blank(),
                  axis.line = element_line(colour = "black"))

colNms <- c("2008-2009", "2009-2010", "2010-2011", "2011-2012", "2012-2013")
subDf <- matrix(data = "", nrow = 6, ncol = 5, dimnames = list(studies, colNms) )
subDf["SDY212", "2008-2009"] <- nrow( pd[pd$Study == "SDY212", ] )
subDf["SDY63", "2010-2011"] <- nrow( pd[pd$Study == "SDY63", ] )
subDf["SDY404", "2011-2012"] <- nrow( pd[pd$Study == "SDY404", ] )
subDf["SDY400", "2012-2013"] <- nrow( pd[pd$Study == "SDY400", ] )
subDf["SDY80", "2009-2010"] <- nrow( pd[pd$Study == "SDY80", ] )
subDf["SDY67", "2010-2011"] <- nrow( pd[pd$Study == "SDY67", ] )

subDf <- data.frame(subDf, stringsAsFactors = F)
colnames(subDf) <- colNms
tbl <- tableGrob(subDf)

print(plot_grid(agePlot,
          tbl,
          ncol = 1,
          nrow = 2,
          rel_heights = c(5,3),
          labels = c("A","B")))

Figure 3 - Adjusted Multiple Fold Change Score (adjMFC) for HAI data

To provide a more detailed understanding how the hai data adjustment affects the distribution of responders and non-responders, we provide the example of SDY404. Note that the outlier in Figure 3A. with an extremely high d0 titer is removed from the data before decorrelation. The blue horizontal lines show the quantile thresholds for categorizing an individual as high responder (>70%) or low responder (<30%).

# Get raw data from ImmuneSpace and select only young subjects -------------------------------
con <- CreateConnection("SDY404")
rawHai <- con$getDataset("hai")
adjHai <- adjust_hai("SDY404", rawHai)
pd <- adjHai[ adjHai$Age.class == "young", ]

# bins were created manually
bins <- c(0.25, 1, 5)

# setup BEFORE Decorr plot ----------------------------------------------------------------------
cc <- cor.test(pd$young_fc_norm_max_ivt, 
               pd$young_d0_norm_max, 
               method = "spearman", 
               exact = F)

# Create df of group-by-bin statistics
pd.bin <- pd %>%
    mutate(bin = cut(young_d0_norm_max, 
                     breaks = c(-Inf, bins, Inf), 
                     labels = 1:( length(bins) + 1) 
                     ) 
           ) %>%
    group_by(bin) %>%
    summarise(n = n(), 
              fc_median = median(young_fc_norm_max_ivt, na.rm=T),
              cor = ifelse( n > 2, 
                            cor(young_d0_norm_max, 
                                young_fc_norm_max_ivt, 
                                method="spearman", 
                                use="pairwise.complete.obs"), 
                            NA),
              cor.p = ifelse( n > 2, 
                              cor.test(young_d0_norm_max, 
                                       young_fc_norm_max_ivt, 
                                       method="spearman", 
                                       exact = F)$p.value, 
                              NA)) %>%
    ungroup()

pd.bin.show <- pd.bin[, colnames(pd.bin) %in% c("bin", "n", "cor.p")]
colnames(pd.bin.show)[3] <- "p-value"
pd.bin.show <- data.frame(pd.bin.show[1:3, ])

# Custom function developed by Yuri Kotliarov
plot_size_bubbles <- function(x,y) {
  dd = data.frame(x,y) %>%
    group_by(x,y) %>%
    summarise(n=n()) %>%
    ungroup()
  ggplot(dd, aes(x,y)) + geom_point(aes(size=n))
}

plot.subtitle <- sprintf("subjects = %d, Spearman rho = %.2f, p-value = %.2g",
                                      nrow(pd),
                                      cc$estimate,
                                      cc$p.value)
beforePlot <- plot_size_bubbles(pd$young_d0_norm_max, pd$young_fc_norm_max_ivt)        +
                      scale_size_area(max_size = 8)               + 
                      ggtitle(bquote(atop(.("Before Day 0 Decorrelation"),
                                          atop(italic(.(plot.subtitle)), 
                                               "")))) +
                      xlab("Day 0 Titer Score")                   +
                      ylab("") +
                      xlim(-0.5,7.5)                              +
                      ylim(-2.75,2.75)                              +
                      theme_bw()                                  + 
                      theme(legend.key = element_blank())         +
                      geom_vline(xintercept = bins, col='red', lty=1)  +
                      annotation_custom(tableGrob(format(pd.bin.show, digits = 2), 
                                                  rows = NULL), 
                                        xmin = 5.5, 
                                        xmax = 7.5, 
                                        ymin = -2.75, 
                                        ymax = -0.5) +
                      geom_hline(yintercept = quantile(pd$young_fc_norm_max_ivt, 
                                                       c(0.3, 0.7), 
                                                       na.rm=T),
                                 col="blue", lty = 2) 

breaks <- seq(-1.25, 2.25, 0.5)
tmp <- hist(pd$young_fc_norm_max_ivt, breaks = breaks)
negDat <- data.frame(counts = -(tmp$counts), mids = tmp$mids)
negDatEd <- rbind(negDat, data.frame(counts = c(0,0,0,0),
                                     mids = c(-2.5,-2.0,-1.5, 2.5)))
beforeHist <- ggplot(negDatEd, aes(mids,counts)) + 
              geom_col() +
              coord_flip() +
              xlab("MFC Titer Score") +
              ylim(-10,0) +
              theme(axis.text.y = element_blank(),
                    axis.ticks.y = element_blank())

top <- plot_grid(beforeHist, 
                 beforePlot, 
                 ncol = 2, 
                 nrow = 1,
                 align="h",
                 rel_widths = c(0.25,1))

# setup AFTER Decorr Plot -------------------------------------------------------------------------
pd <- pd[ !(pd$subject == "SUB120471"), ] # remove outlier

cc <- cor.test(pd$young_fc_res_max, pd$young_d0_norm_max, method = "spearman", exact = F)
plot.subtitle <- sprintf("subjects = %d, Spearman rho = %.2f, p-value = %.2g",
                                      nrow(pd),
                                      cc$estimate,
                                      cc$p.value)

afterPlot <- plot_size_bubbles(pd$young_d0_norm_max, pd$young_fc_res_max) +
                      scale_size_area(max_size = 8) + 
                      ggtitle(bquote(atop(.("After Day 0 Decorrelation"), 
                                          atop(italic(.(plot.subtitle)), 
                                               "")))) +
                      xlab("Day 0 Titer Score") +
                      ylab("") +
                      xlim(-0.5, 7.5) +
                      ylim(-2.75, 2.75) +
                      theme_bw() + 
                      theme(legend.key = element_blank()) +
                      geom_hline(yintercept = quantile(pd$young_fc_res_max, 
                                                       c(0.3, 0.7), 
                                                       na.rm=T),
                                 col="blue", lty = 2) +
                      geom_vline(xintercept = bins, 
                                 col='red', lty = 1)        


breaks <- seq(-2.25, 1.75, 0.5)
tmp <- hist(pd$young_fc_res_max, breaks = breaks)
negDat <- data.frame(counts = -(tmp$counts), mids = tmp$mids)
negDatEd <- rbind(negDat, data.frame(counts = c(0, 0, 0),
                                     mids = c(-2.5, 2.0, 2.5)))
afterHist <- ggplot(negDatEd, aes(mids,counts)) + 
              geom_col() +
              coord_flip()  +
              xlab("adjMFC Titer Score") +
              ylim(-10,0) +
              theme(axis.text.y = element_blank(),
                    axis.ticks.y = element_blank())

bottom <- plot_grid(afterHist, 
                    afterPlot, 
                    ncol = 2, 
                    nrow = 1,
                    align="h",
                    rel_widths = c(0.25,1))

print(plot_grid(top,
                bottom,
                ncol = 1,
                nrow = 2,
                labels = c("A", "B")))

GeneSet Module Preparation

The gene set modules come from the HIPC collaborators and are found within the package, but must be formatted for use within the meta-analyses.

# Prep GeneSetDB per original code.  Can also use GSEAbase package.
gene_set <- strsplit(geneSetDB, "\t")       ## convert from vector of strings to a list
names(gene_set) <- sapply(gene_set,"[", 1)  ## move the names column as the names of the list
links <- sapply(gene_set,"[", 2)
gene_set <- lapply(gene_set, "[",-1:-2)     ## remove name and description columns
gene_set <- lapply(gene_set, function(x){ x[ which( x != "") ] })      ## remove empty strings

Adjusting expressionSets by age cohort

Before use in the meta-analyses pipelines, the list of expressionSet objects are adjusted for each cohort to select the correct gene expression and hai information by subsetting the subjects by age. SDY80 is treated differently in terms of the hai data and noted below.

adjust_eset <- function(sdy, eset_list, cohort){
  eset <- eset_list[[sdy]]
  storageMode(assayData(eset)) <- "list" 

  # Select correct PhenoData in terms of young or old fc_res_max_d30 and delete rest
  hai_var <- grep("^((fc)|(d0))_", varLabels(eset), value = TRUE) 
  oppo <- ifelse( cohort == 'young', "old", "young")
  eset <- eset[, eset$Age.class %in% cohort] # set in adjust_hai

  # Per notes from F. Vallania @ Stanford and H. Meng @ Yale:
  # SDY80 uses the adjMFC or fc_res_max_d30 calculated from a combined 
  # group of young and slightly older subjects because it improved the estimation
  if(sdy != "SDY80"){
    phenoData(eset) <- phenoData(eset)[, setdiff(varLabels(eset),
                                               c(hai_var, paste0(oppo, "_", hai_var)))]
    varLabels(eset) <- gsub(paste0("^", cohort, "_"), '', varLabels(eset))
  }else{
    hai_var_y <- paste0("young_", hai_var)
    hai_var_o <- paste0("old_", hai_var)
    phenoData(eset) <- phenoData(eset)[ , !(colnames(phenoData(eset)) %in% c(hai_var_o, hai_var_y))]
  }

  # remove NAs from fc_res_max_d30 pheno and therefore exprs to streamline
  eset <- eset[ , !is.na(eset$fc_res_max_d30) ] 

  storageMode(assayData(eset)) <- "lockedEnvironment"
  return(eset)
}

esetLsMkr <- function(eset_list, cohort){
  valSdy <- ifelse(cohort == "young", "SDY80", "SDY67")
  corrEsetList <- eset_list[ c("SDY212","SDY63","SDY404","SDY400", valSdy) ]
  adjEsetList <- lapply(names(corrEsetList), adjust_eset, 
                        eset_list = corrEsetList, 
                        cohort = cohort)
  names(adjEsetList) <- names(corrEsetList)
  return(adjEsetList)
}

yngEsets <- esetLsMkr(eset_d0, cohort = "young")
oldEsets <- esetLsMkr(eset_d0, cohort = "old")

STEP 2: GENE SET META-ANALYSIS

The Gene Set or Module meta analysis based on work by the Yale HIPC collaborators is run first. The analysis is performed by the QuSage package that is found on BioConductor.

Signficant Pathways Selection Parameters:

FDR.cutoff: 0.506
pvalue.cutoff: 0.0105
endPoint: fc_res_max_d30
geneSetDB: BTM_for_GSEA_20131008.gmt

YOUNG COHORT

yng_res <- meta_analysis(adj_eset_list = yngEsets,
                         cohort = "young",
                         gene_set = gene_set)

Discovery Group Significant Pathways Data

``` {r include=TRUE} DT::datatable(yng_res$dsc)

#### Validation Study Significant Pathways Data  (Table 1 from Manuscript)
``` {r include=TRUE}
DT::datatable(yng_res$val)

OLD COHORT

old_res <- meta_analysis(adj_eset_list = oldEsets,
                         cohort = "old",
                         gene_set = gene_set)

Discovery Group Significant Pathways Data

DT::datatable(old_res$dsc)

Validation Study Significant Pathways Data

DT::datatable(old_res$val)

Figure 7 - BCR Signaling (M54) Gene Module Highlight

Since figure 7 is based on Yale's qusage analysis, it is presented here.

# Prep and Helper Methods----------------------------------------------------
discoverySDY = c('SDY212','SDY63','SDY404','SDY400')
loc <- which(colnames(yng_res$combinedPDF$path.PDF) == "BCR signaling (M54)")

# These are deconstructed versions of qusage:::plotDensityCurves() to enable use of ggplot
# without plotRecord for creating figures.
xCoordCalc <- function(QSarray, loc, sif){
  seq(-1, 1 , length.out = QSarray$n.points) * 
    QSarray$ranges[loc] * 
    sif + 
    QSarray$path.mean[loc]
}

yCoordCalc <- function(QSarray, loc, scaleFactor){
  QSarray$path.PDF[,loc] / scaleFactor[loc]
}

makeCoordDf <- function(QSarray, meta, loc, scaleFactor){
  if(meta == FALSE){
    sif <- sqrt(QSarray$vif[loc])
    if(is.na(sif)){ sif <- 1 }
  }else{
    sif <- 1
  }
  x <- xCoordCalc(QSarray, loc, sif)
  y <- yCoordCalc(QSarray, loc, scaleFactor)
  df <- data.frame(x = x, y = y)
}

# A. discovery plot----------------------------------------------------
discSF <- qusage:::pdfScaleFactor(yng_res$combinedPDF)

# Get individual study coordinates
coordDfs <- lapply(yng_res$combinedPDF$QSlist, makeCoordDf, FALSE, loc, discSF)

# Get meta study coordinates
coordDfs[[5]] <- makeCoordDf(yng_res$combinedPDF, TRUE, loc, discSF)  
names(coordDfs) <- c(discoverySDY, "Meta-Analysis")

discPlot <- ggplot() +
                geom_line(data = coordDfs$SDY212, aes(x,y, color="SDY212")) +
                geom_line(data = coordDfs$SDY63, aes(x,y, color="SDY63")) + 
                geom_line(data = coordDfs$SDY404, aes(x,y, color="SDY404")) + 
                geom_line(data = coordDfs$SDY400, aes(x,y, color="SDY400")) + 
                geom_line(data = coordDfs$`Meta-Analysis`, aes(x,y, color="Meta-Analysis")) +
                scale_color_manual("", values = c("SDY212"="#E41A1C", 
                                               "SDY63"="#377EB8", 
                                               "SDY404"="#4DAF4A",
                                               "SDY400"="#984EA3", 
                                               "Meta-Analysis"="black")) +
                ggtitle("Discovery") +
                xlim(-1,1) +
                xlab("Gene Module Activity") +
                ylab("Density") +
                geom_vline(xintercept = 0) +
                theme(legend.text = element_text(c(discoverySDY,"Meta-Analysis")))

# B. validation plot-------------------------------------------------------
valSF <- qusage:::pdfScaleFactor(yng_res$valPDF)
valDF <- makeCoordDf(yng_res$valPDF, 
                     meta = FALSE, 
                     loc, 
                     scaleFactor = valSF)

valPlot <- ggplot() +
                geom_line(data = valDF, aes(x,y, color="SDY80")) +
                scale_color_manual("", values = c("SDY80"="black")) +
                ggtitle("Validation") +
                xlim(-1,1) +
                xlab("Gene Module Activity") +
                ylab("") +
                geom_vline(xintercept = 0)

# C. multiday plot----------------------------------------------------------
eset <- sdy80all[ , !is.na(sdy80all$fc_res_max_d30) ]

# get Module Intensity for each subject by averaging for genes in module
em <- data.frame(exprs(eset), stringsAsFactors = F)
em$geneSymbols <- fData(eset)$gene_symbol
em <- em[ em$geneSymbols %in% gene_set$`BCR signaling (M54)`, ]
em <- em[ , !(colnames(em) == "geneSymbols") ]
em.Means <- data.frame(subject = colnames(em), ModInt = colMeans(em, na.rm = T))

# Qualify response, ensure timepoints are ordered by setting class to numeric
pd <- pData(eset)
pd$timepoint.day <- as.numeric(pd$timepoint.day)
pd$Response <- c("Low responder",
                "Moderate responder",
                "High responder")[ pd$fc_res_max_d30 + 1 ]

# merge and order for ggplot
tcData <- merge(pd, em.Means, by = "subject" )
tcData$Response<- factor(tcData$Response,
                    c("Low responder",
                      "Moderate responder",
                      "High responder"),
                    ordered = TRUE)

tcPlot <- ggplot(tcData,
                 aes(x    = factor(timepoint.day),
                     y    = ModInt,
                     fill = Response))                             +
                 geom_boxplot(outlier.size = 0)                               +
                 geom_point(position = position_jitterdodge(dodge.width = 0.8),
                           aes(color = Response, shape = Response), size = 3)         +
                 scale_fill_manual(values=c("white","white","white"))         +
                 scale_shape_manual(values=c(15,19,17))                       +
                 scale_color_manual(values=c("deepskyblue","magenta","red"))  +
                 ggtitle("Expression Differences by Responder Category over Time") +
                 xlab("Time (Days post Vaccination)")                         + 
                 ylab("Gene Module Intensity") +
                 ylim(8.8, 9.6)

# Output ---------------------------------------
top <- plot_grid(discPlot,
                     valPlot,
                     ncol = 2,
                     nrow = 1,
                     labels = c("A","B"),
                     rel_widths = c(1.25,1))
bottom <- plot_grid(tcPlot,
                     ncol = 1,
                     nrow = 1,
                     labels = c("C"))
print(plot_grid(top,
                bottom,
                ncol = 1,
                nrow = 2,
                align = "v",
                rel_heights = c(1,1.25)))

STEP 3: SINGLE GENE ANALYSIS

The single gene meta analysis was created by the Stanford HIPC collaborators and uses their package MetaIntegrator, which is also available on BioConductor.

# outputs a plot so need include=FALSE
sgaResYng <- singleGeneAnalysis(adjEsetList = yngEsets, 
                                grpToCut = 1,
                                cohort = "young")

YOUNG COHORT: HIGH VS LOW RESPONDERS

Discovery - Upregulated Genes

up <- data.frame(GeneName = rownames(sgaResYng$disc$upReg),
                 effectSize = sgaResYng$disc$upReg$effectSize,
                 effectSizeStdError = sgaResYng$disc$upReg$effectSizeStandardError,
                 fisherPvalUp = sgaResYng$disc$upReg$fisherPvalUp,
                 fisherFDRUp = sgaResYng$disc$upReg$fisherFDRUp,
                 stringsAsFactors = F)
DT::datatable(up, fillContainer = T)

Discovery - Downregulated Genes

dn <- data.frame(GeneName = rownames(sgaResYng$disc$dnReg),
                 effectSize = sgaResYng$disc$dnReg$effectSize,
                 effectSizeStdError = sgaResYng$disc$dnReg$effectSizeStandardError,
                 fisherPvalDown = sgaResYng$disc$dnReg$fisherPvalDown,
                 fisherFDRDown = sgaResYng$disc$dnReg$fisherFDRDown,
                 stringsAsFactors = F)
DT::datatable(dn, fillContainer = T)

Figure 4 - Forest Plots of Significant Genes in Discovery Studies

These Forest plots show significant genes from the discovery studies, both up and down regulated. Code developed by F. Vallenia @ Stanford.

customForestPlot <- function(geneName, metaObject, summaryTbl){

  # Prep and subsetting
  studyData <- data.frame(means = metaObject$metaAnalysis$datasetEffectSizes[ geneName, ])
  studyData$SEs <- metaObject$metaAnalysis$datasetEffectSizeStandardErrors[ geneName, ]
  studyData <- studyData[ order( rownames(studyData) ), ]
  studyMeans <- studyData$means
  names(studyMeans) <- rownames(studyData)
  pooledMean <- metaObject$metaAnalysis$pooledResults[geneName, "effectSize"]
  pooledSE <- metaObject$metaAnalysis$pooledResults[geneName, "effectSizeStandardError"]
  titleN <- paste(geneName,
                  "\nES =",
                  round(summaryTbl[ rn == geneName ]$effectSize, 2),
                  " p =",
                  formatC(summaryTbl[ rn == geneName ]$effectSizePval, 3))

  # Plot generation
  rmeta::metaplot(mn = studyMeans, 
           se = studyData$SEs, 
           labels = rownames(studyData), 
           xlab = "Standardized Mean Difference (log2 scale)", 
           ylab = "", 
           xlim = c(-2,3),
           colors = rmeta::meta.colors(box = "black", 
                                lines = "black", 
                                zero = "black", 
                                summary = "black", 
                                text = "black"), 
           summn = pooledMean, 
           sumse = pooledSE, 
           sumnn = 1 / pooledSE^2, 
           main = titleN)
}

# Plot as pdf and then load img
summaryTbl <- as.data.table(sgaResYng$disc$metaObj$metaAnalysis$pooledResults, 
                            keep.rownames = T)
allGenes <- c(sgaResYng$disc$posGenes, sgaResYng$disc$negGenes)

png("fig4.png",
    height = 3*3,
    width  = 5*2.5,
    units = "in",
    res = 150)
par(mfrow = c(3,5))
dmp <- lapply(allGenes, customForestPlot, sgaResYng$disc$metaObj, summaryTbl)
dmp <- suppressMessages(dev.off())

Figure 6 - Young Validation Study (SDY80) Immune Response

Fig.6A shows a violin plot of the response score for each category (low, moderate, high) with SDY80 with the geometric mean. Fig6B. is a ROC curve for the High vs Low and Moderate vs Low responders. Fig.6C shows the change in response score over time.

# Create Response by timepoint plot ---------------------------------------------
eset <- sdy80all[ , !is.na(sdy80all$fc_res_max_d30) ]

gem <- makeGEM(eset = eset,
               rmGrp = FALSE,
               grpToCut = NA,
               ens_table = ens_table)

# select positive genes from Discovery group and compute validation
gem$pheno$score <- arithMean_comb_validate(GEM = gem, 
                                           genes_p = sgaResYng$disc$posGenes,
                                           genes_n = NULL)$score

gem$pheno$Response <- c("Low responder",
                    "Moderate responder",
                    "High responder")[ gem$class + 1 ]

tcData <- gem$pheno[ , colnames(gem$pheno) %in% c("Response", "timepoint.day", "score") ]
tcData$timepoint.day <- as.numeric(tcData$timepoint.day)
tcData$subject <- substr(gem$pheno$subject, start = 0, stop = 9)
tcData$Response <- factor(tcData$Response,
                      c("Low responder",
                        "Moderate responder",
                        "High responder"),
                      ordered = TRUE)

tcPlot <- ggplot(tcData,
        aes(x    = factor(timepoint.day),
            y    = score,
            fill = Response))                                       +
        geom_boxplot(outlier.size = 0)                               +
        geom_point(position = position_jitterdodge(dodge.width=0.8),
                   aes(color = Response, shape = Response), size = 3)        +
        scale_fill_manual(values=c("white","white","white"))         +
        scale_shape_manual(values=c(15,19,17))                       +
        scale_color_manual(values=c("deepskyblue","magenta","red"))  +
        xlab("Time (Days)")                                          + 
        ylab("Response Score") +
        theme(legend.position = "bottom",
              legend.title = element_blank())

#compute day0 vs day1 paired t-test
t0 <- tcData[ tcData$timepoint.day == 0, ]
t1 <- tcData[ tcData$timepoint.day == 1, ]
t1 <- t1[ t1$subject %in% t0$subject, ]
t0 <- t0[ order(t0$subject), ]
t1 <- t1[ order(t1$subject), ]
t01p <- t.test(t0$score, 
               t1$score, 
               paired = T, 
               alternative = 'less')
t01p.pval <- t01p$p.value

#compute comparison of low vs high responders at each time point
timeTTest <- function(day, tcData){
  low <- tcData[tcData$timepoint.day == day & tcData$Response == "Low responder", ]
  high <- tcData[tcData$timepoint.day == day & tcData$Response == "High responder", ]
  res <- t.test(low$score, high$score)
  return(res$p.value)
}
tp <- sort(unique(tcData$timepoint.day))
pvals <- round(sapply(tp, timeTTest, tcData), digits = 2)
timeDf <- t(data.frame(Day = tp, `p-value` = pvals))
timeTg <- tableGrob(timeDf)


top <- plot_grid(sgaResYng$val$ViolinPlot,
                 sgaResYng$val$RocPlot,
                 ncol = 2,
                 nrow = 1,
                 align = "h",
                 labels = c("A","B"))

bottom <- plot_grid(tcPlot,
                    timeTg,
                    ncol = 1,
                    nrow = 2,
                    labels = c("C",""),
                    rel_heights = c(4,1))

print(plot_grid(top,
                bottom,
                ncol = 1,
                nrow = 2))

OLD COHORT: HIGH VS LOW RESPONDERS

No Significant up or down-regulated genes were found using the adjusted expressionSet and therefore plots are not shown for SDY67 alone.

# outputs a plot so need include=FALSE
sgaResOld <- singleGeneAnalysis(adjEsetList = oldEsets, 
                                grpToCut = 1,
                                cohort = "old")

YOUNG COHORT: MODERATE VS LOW RESPONDERS

No Significant up or down-regulated genes were found using the adjusted expressionSet, therefore the significant genes from the young cohort with high vs low responders are used to generate figure 9.

# outputs a plot so need include=FALSE
# Note: SDY400 is not included because there are less than two cases and/or controls
sgaResLvm <- singleGeneAnalysis(adjEsetList = yngEsets, 
                                grpToCut = 2,
                                cohort = "young")

Figure 9 - Forest Plots of Significant Genes for Low vs Moderate Responders

# Plot as png and then load img
summaryTbl <- as.data.table(sgaResLvm$disc$metaObj$metaAnalysis$pooledResults, 
                            keep.rownames = T)
# NOTE: using the genes from the sgaResYng, no sig-genes for sgaResLvm
allGenes <- c(sgaResYng$disc$posGenes, sgaResYng$disc$negGenes)

png("fig9.png",
    height = 3*3,
    width  = 5*2.5,
    units = "in",
    res = 150)
par(mfrow = c(3,5))
dmp <- lapply(allGenes, customForestPlot, sgaResLvm$disc$metaObj, summaryTbl)
dmp <- suppressMessages(dev.off())

STEP 4: CROSS-ANALYSES FIGURES

Figure 5 - Significant Genes / Modules Heatmaps

Combining the results of the module intensity and single gene analysis, Figure 5 presents heat maps of both. Fig.5A differs from the manuscript due to the SDY404 subject swap - in this case, the pvalue for 'enriched in activated dendritic cells / monocytes (M64)' is slightly greater and therefore does not meet the threshold. Fig.5B is also missing two genes that were not present in SDY80 on ImmuneSpace: "APOB48R","TNFSF13". This difference is likely due to feature / annotation updates made in bioconductor between the generation of the manuscript and the creation of gene expression matrices on ImmuneSpace.

# A - Young and Old Modules---------------------------------------------------------------------
sigMods <- rownames(yng_res$val)
prepPhmap <- function(resObj, sigMods, valSdyNm){
  sdyLs <- c("SDY212","SDY63","SDY404","SDY400","valPDF")
  lsSubset <- resObj[ names(resObj) %in% sdyLs]
  modMeans <- rbind(sapply(names(lsSubset), FUN = function(sdy){
    ls <- lsSubset[[sdy]]$path.mean
    ls[ names(ls) %in% sigMods ]
  }))
  colnames(modMeans)[[5]] <- valSdyNm
  return(modMeans)
}


yngMns <- prepPhmap(yng_res, sigMods, "SDY80")
png("fig5AYng.png", width = 300, height = 300)
pheatmap::pheatmap(yngMns,
                   color = colorRampPalette(rev(brewer.pal(n = 7, 
                                                            name = "RdBu")))(100),
                   cluster_cols = FALSE,
                   cluster_rows = FALSE,
                   main = "Young",
                   show_rownames = FALSE,
                   legend = FALSE)
dmp <- suppressMessages(dev.off())
yngPng <- png::readPNG("fig5AYng.png")
yngGrob <- grid::rasterGrob(yngPng, interpolate=TRUE) 
yngPlot <- ggplot() +
           annotation_custom(yngGrob)


oldMns <- prepPhmap(old_res, sigMods, "SDY67")
png("fig5AOld.png", width = 500, height = 300)
pheatmap::pheatmap(oldMns,
                   color = colorRampPalette(rev(brewer.pal(n = 7, 
                                                            name = "RdBu")))(100),
                   cluster_cols = FALSE,
                   cluster_rows = FALSE,
                   main = "Older",
                   show_rownames = TRUE,
                   legend = TRUE)
dmp <- suppressMessages(dev.off())
oldPng <- png::readPNG("fig5AOld.png")
oldGrob <- grid::rasterGrob(oldPng, interpolate=TRUE) 
oldPlot <- ggplot() +
           annotation_custom(oldGrob)


# B - Single Genes -----------------------------------------------------------------------------
# Pull list of genes found in BCR signaling (M54), Platelet Activation (M42), 
# and Inflammatory Response (M33). Update to remove genes not found in SDY80 eset.
# Although these were found in the original?
genes <- unique(unlist(gene_set[ grep("M42|M54|M33", names(gene_set))]))
genes <- sort(genes[ !(genes %in% c("APOB48R","TNFSF13")) ]) # not found in SDY80

# fn to return average of gene expression within sdy and response
adjEm <- function(eset, resp, genes){
  pd <- pData(eset)
  em <- data.table(exprs(eset))
  subs <- paste0(pd$subject[ pd$fc_res_max_d30 == resp ], "_d0")
  em <- em[ , ( colnames(em) %in% subs ), with=FALSE ]
  em <- em[, gs := fData(eset)$gene_symbol ]
  setkey(em, gs)
  em <- em[.(genes)]
  em <- em[ , lapply(.SD, mean), by="gs", .SDcols = grep("^SUB", colnames(em))]
  em <- em[ , AVG := rowMeans(.SD, na.rm=T), .SDcols = grep("^SUB", colnames(em)) ]
  em <- em[ order(gs) ]
  return(em$AVG)
}

# Find difference of high and low responders in terms of log2 gene expression for each sdy
sdys <- c("SDY212","SDY63","SDY404","SDY400","SDY80")
diffEm <- t(rbind(sapply(sdys, FUN = function(sdy){
  eset <- yngEsets[[sdy]]
  high <- adjEm(eset, 2, genes)
  low <- adjEm(eset, 0, genes)
  diff <- (high - low) / low
})))
colnames(diffEm) <- genes

png("fig5B.png", height = 300, width = 750)
pheatmap::pheatmap(diffEm,
                   color = colorRampPalette(rev(brewer.pal(n = 7, 
                                                            name = "RdBu")))(100),
                   cluster_cols = TRUE,
                   cluster_rows = FALSE)
dmp <- suppressMessages(dev.off())
sgaPng <- png::readPNG("fig5B.png")
sgaGrob <- grid::rasterGrob(sgaPng, interpolate=TRUE) 
sgaPlot <- ggplot() +
           annotation_custom(sgaGrob)

top_row <- plot_grid(yngPlot, 
                     oldPlot,
                     ncol = 2,
                     nrow = 1,
                     rel_widths = c(1.25,2))

print(plot_grid(top_row, 
                sgaPlot, 
                ncol = 1, 
                nrow = 2,
                labels = c("A","B")))

Figure 8 - Gene / Modules Effect Sizes

Figure 8A compares p-values of each gene in older and younger cohorts. The dark squares are genes that were found to be significant in the young cohort, either up or down-regulated. Figure 8B does the same for pathway-activity of the gene set modules from the discovery studies. Each gene or pathway is coded into non-significant (grey circle), significant for the young cohort (black square), or significant for the old cohort (black triangle).

# Single Gene --------------------------------------------------------------------
# input
sga <- as.data.table(merge(sgaResYng$disc$metaObj$metaAnalysis$pooledResults,
                           sgaResOld$disc$metaObj$metaAnalysis$pooledResults,
                           by = 0))
sigGenes <- c(sgaResYng$disc$posGenes, sgaResYng$disc$negGenes)

# fig
effCorrSGA <- ggplot(sga,
  aes(y = effectSize.y, x = effectSize.x)) +
  xlim(-1.5,1.5) +
  ylim(-1.5,1.5) +
  geom_vline(xintercept=0,size=0.1, colour="black")   +
  geom_hline(yintercept=0,size=0.1, colour="black")   +
  geom_point(color = 'gray65')                        +
  theme_bw() +
  theme(axis.title        = element_text(size=16,face="bold"),
        legend.title      = element_blank(),
        legend.key        = element_rect(colour = "white"),
        legend.position   = c(0.25, 0.9),
        legend.text       = element_text(size = 15,face="bold"),
        legend.background = element_rect(colour = "lightgray")) +
  geom_point(data = sga[ sga$Row.names %in% sigGenes,],
             aes(x = effectSize.x,y = effectSize.y),
             color = 'gray20',
             size  = 3,
             shape = 15) +
  xlab("Gene effect size (young)") + ylab("Gene effect size (older)")


# Gene Module --------------------------------------------------------------------
# input
gma <- as.data.table(merge(yng_res$dsc,
                           old_res$dsc,
                           by = 0))

# figure
effCorrGMA<- ggplot(gma,
  aes(y = pathwayActivity.y, x = pathwayActivity.x))  +
  xlim(-0.3,0.3) +
  ylim(-0.3,0.3) +
  geom_vline(xintercept=0,size=0.1, colour="black")   +
  geom_hline(yintercept=0,size=0.1, colour="black")   +
  geom_point(color = 'gray65')                        +
  theme_bw() +
  theme(axis.title        = element_text(size=16,face="bold"),
        legend.title      = element_blank(),
        legend.key        = element_rect(colour = "white"),
        legend.position   = c(0.25, 0.9),
        legend.text       = element_text(size = 15,face="bold"),
        legend.background = element_rect(colour = "lightgray")) +
  geom_point(data = gma[ gma$Row.names %in% rownames(yng_res$val), ],
             aes(x = pathwayActivity.x, y = pathwayActivity.y),
             color = 'gray20',
             size  = 3,
             shape = 15) +
  geom_point(data = gma[ gma$Row.names %in% rownames(old_res$val), ],
             aes(x = pathwayActivity.x,y = pathwayActivity.y),
             color = 'gray20',
             size  = 3,
             shape = 17) +
  xlab("Gene Module Activity (young)") + ylab("Gene Module Activity (older)")

print(plot_grid(effCorrSGA, 
                effCorrGMA, 
                ncol = 2, 
                nrow = 1,
                labels = c("A","B")))

ehfhcrc/ImmSig2 documentation built on May 16, 2019, 12:15 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ehfhcrc/ImmSig2 Streamlined package to demonstrate HIPC ImmuneSigatures work using ImmuneSpace Data

In ehfhcrc/ImmSig2: Streamlined package to demonstrate HIPC ImmuneSigatures work using ImmuneSpace Data

PACKAGE INFORMATION

Created: May 2017

Study Authors: HIPC ImmuneSignature Collaborators

Package Developer: Evan Henrich (Gottardo Lab - FHCRC)

Contact: ehenrich@fredhutch.org

OVERVIEW

Figure 1 - Analysis Workflow

Data Notes:

Package Notes:

STEP 1: DATA EXTRACTION AND CLEANING

Setup for expressionSet object creation

expressionSet Construction

Figure 2 - Age Distribution by Study

Figure 3 - Adjusted Multiple Fold Change Score (adjMFC) for HAI data

GeneSet Module Preparation

Adjusting expressionSets by age cohort

STEP 2: GENE SET META-ANALYSIS

Signficant Pathways Selection Parameters:

YOUNG COHORT

Discovery Group Significant Pathways Data

OLD COHORT

Discovery Group Significant Pathways Data

Validation Study Significant Pathways Data

Figure 7 - BCR Signaling (M54) Gene Module Highlight

STEP 3: SINGLE GENE ANALYSIS

YOUNG COHORT: HIGH VS LOW RESPONDERS

Discovery - Upregulated Genes

Discovery - Downregulated Genes

Figure 4 - Forest Plots of Significant Genes in Discovery Studies

Figure 6 - Young Validation Study (SDY80) Immune Response

OLD COHORT: HIGH VS LOW RESPONDERS

YOUNG COHORT: MODERATE VS LOW RESPONDERS

Figure 9 - Forest Plots of Significant Genes for Low vs Moderate Responders

STEP 4: CROSS-ANALYSES FIGURES

Figure 5 - Significant Genes / Modules Heatmaps

Figure 8 - Gene / Modules Effect Sizes

R Package Documentation

Browse R Packages

We want your feedback!

ehfhcrc/ImmSig2
Streamlined package to demonstrate HIPC ImmuneSigatures work using ImmuneSpace Data