In vdumeaux/mixtR: Matched Interaction across Tissues (MIxT) data analysis

The mixtR package (Matched Interactions across Tissues) contains the computational and statistical methods for identifying and exploring associations between sets of genes or molecular processes across tissues.

The mixtR analysis requires gene expression profiles of two matched tissues and is based on two core functions: - sig.ranksum a function that maps samples to a linear ordering based on expression of a set of genes of interest - stat.ranksum a function that derives estimates of significance for association between gene sets of interest across tissues.

The mixtR package also offers ad-hoc plotting functions (cohort_heatmap, cohort_scatterplot, cohort_boxplot) for visualization of the results.

If the user wants to build a web application to explore/disseminate the results, the outputs should be saved in the /data folder of the mixtApp R package. Instructions to build the web application are described in the MIxT web application vignette.

First load the mixtR package in your R session.

library(mixtR)

1. Input data format

In this vignette, we are taking the example of gene expression profiles from microdissected epithelium and stroma from non-inflammatory invasive breast tumors from boersma et al.

Those data are stored in the /data folder of the mixtApp R package and are accessed through RawGit that provides a static reference to the files.

files= c("dat.rda", "moduleColors.rda")
dataDir <- 'https://rawgit.com/vdumeaux/mixtApp/master/data/'
files.dir=paste0(dataDir,files)

for(i in 1:length(files)){
 load(url(files.dir[i]))
}

Expression and clinical data from matched tissues were formatted so that they are contained in a list named dat.

Each object of the list must include the following objects for each tissue: exprs an expression matrix with m genes in rows and n patients in columns. Rownames of exprs are the same gene identifiers as the one used to define gene sets. cohorts list of patients subgroups. If you have only one , the vector should still be present clinical (optional) a clinical data frame with n patients in rows and clincal variables in columns. Categorical variables should be character or factors. Continuous variables should be numeric. clinical.colors (optional) a clinical data frame with n patients in rows and clincal variables in columns coded by discrete or continuous color-coding. We recommend the colorspace R package that provides perceptually-based color palettes for coding categorical data (qualitative palettes) and numerical variables (sequential and diverging palettes).

str(dat)

moduleColors is a list of length 2: each object is a named vector of gene set appartenance for each tissue. In our example, each gene is part of a module grouping strongly co-expressed genes together as computed by WGCNA R package. Other clustering approaches could replace this step. In general any approach that moves the focus from a single gene to a “gene set” could suffice; eg it could be groups of genes involved in different pathways/molecular processes.

str(moduleColors)

IF you work with gene networks, you can import the top node and edges for each network that you can vizualize in the MIxT web application.

load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/net.rda"))
str(net)

2. Patient linear ordering and region of independence

sig.ranksum() allows you to map 𝑛 samples to a linear ordering based on expression of 𝑘 genes within a given gene set/module.

If you know the directionality of gene expression, the gene set of interest can be split between the up and dn genes. If you do not know the directionality of gene expression, all genes in the gene set are listed in the ns argument. In this case, genes will be partitioned into two groups around myeloids using correlation as the distance metric.

The patient ordering is then partition in three categories (high, mid, low) following a random sampling procedure (eg n = 10,000). By default, the region of independence (ROI) is determined by the 0.025 and 0.975 percentile point of the distribution of random patient ranks (middle.range = 0.95).Both n and middle.range can be set differently.

gene.sets <- lapply(moduleColors, function(x.moduleColors){
  split(names(x.moduleColors), factor(x.moduleColors))
})
str(gene.sets)

## compute ranksum for each gene sets in each tissue
bresat <- list()

bresat$epi <- lapply(gene.sets$epi, function (mod){
    sig.ranksum (x.dat = dat$epi, 
                 ns = which(rownames(dat$epi$exprs) %in% mod),
                 n=10000, mc.cores = 2)})

bresat$stroma <- lapply(gene.sets$stroma, function (mod){
    sig.ranksum (x.dat = dat$stroma, 
                 ns = which(rownames(dat$stroma$exprs) %in% mod),
                 n=10000, mc.cores = 2)})

save(bresat, file = "../../mixtApp/data/bresat.rda")

load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/bresat.rda"))

names(bresat$epi)
str(bresat$epi$blue)

You can plot heatmap of a given gene set expressed in a given tissue and reordered by the output of sig.ranksum().

cohort_heatmap(mixt.dat = dat, mixt.ranksum = bresat, tissue = "epi", 
               module = "darkgrey", cohort.name = "all", cl.height = 3)

3. Functional enrichment

The MSigDB database from the Broad Institute provides multiple collections of gene sets that can be used to functionally annotate the gene modules in each tissue. Enrichment for each gene signature is estimated for all genes in each module and separately for genes that are positively (red genes up) or negatively (blue genes dn) correlated with the patient ranksum in each cohort using the hypergeometric minimum-likelihood P-values,computed with the function ‘dhyper’ (equivalent to one-sided Fisher exact test). P-values can then adjusted for multiple testing choosing p.adjust.method (see ?stats::p.adjust to review methods available).

MSigDB cannot be accessed without a free registration, so the first things you'll need to do is download the .gmt files using the MSigDB page and save it in a folder named extdata for example.

read.gmt.msigdb() will import your .gmt files in your working environment as follow

msigdb.files <- c("c1.all", "h.all", "c2.cp", "c2.cgp", "c5.all", "c6.all",
"c7.all")

msigdb <- list()

msigdb <- lapply(msigdb.files, function(p) {
  input.file <- file.path("../../extdata", paste0(p, ".v5.2.symbols.gmt"))
  ret <-  read.gmt.msigdb(input.file)
  return(ret)
})
names(msigdb) <- msigdb.files

We choose to use the hallmark (h), poistional (c1) and curated (c2) gene set collections.

sig<-list()

sig$all.sig<-c(msigdb$h.all, msigdb$c1.all, msigdb$c2.cp, msigdb$c2.cgp)

sig$set<-c(rep("h", length(msigdb$h.all)),
           rep("c1", length(msigdb$c1.all)),
           rep("c2.cp", length(msigdb$c2.cp)),
           rep("c2.cgp", length(msigdb$c2.cgp)))

msigdb.enrichment <- list()
msigdb.enrichment$epi<- functional.enrichment(mixt.ranksum = bresat,
                                              tissue="epi",
                                              cohort.name = "ern",
                                              functional.groups = sig,
                                              p.adjust.method = "BH",
                                              mc.cores=80)

msigdb.enrichment$stroma<- functional.enrichment(mixt.ranksum = bresat,
                                                 tissue="stroma",
                                                 cohort.name = "ern",
                                                 functional.groups = sig,
                                                 p.adjust.method = "BH",
                                                 mc.cores=80)

save(msigdb.enrichment, file = "../../mixtApp/data/msigdb.enrichment.rda")

load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/msigdb.enrichment.rda"))

library(plyr)

tissue.names<-c("epi", "stroma")
set<-levels(factor(msigdb.enrichment$epi$turquoise$results$sig.set))

results<-list()
for (tissue in tissue.names){
results[[tissue]]<-lapply(msigdb.enrichment[[tissue]], function(x) x$results)
}

msig.results<-list()

msig.results<-sapply(tissue.names, function(tissue) {
  ret <- lapply(names(results[[tissue]]), function(mod) {
    res<-do.call(rbind, lapply(set, function(s){
      x<-results[[tissue]][[mod]][results[[tissue]][[mod]]$sig.set==s,]
      x<-cbind(data.frame(name=rownames(x)), x)
      top.5<-x[order(as.numeric(as.character(x$updn.pval)), decreasing=F),][1:5,]}))
    })
  names(ret)<-names(results[[tissue]])
  return(ret)
  }, simplify=F)

epi.msig<-ldply(msig.results$epi)
stroma.msig<-ldply(msig.results$stroma)
knitr::kable(epi.msig[1:6,])
knitr::kable(stroma.msig[1:6,])

write.table(epi.msig, file = "../output/epi.msig.txt", sep="\t", col.names = T, quote = FALSE)
write.table(stroma.msig, file = "../output/stroma.msig.txt", sep="\t", col.names = T, quote = FALSE)

We also can annotate gene sets/modules based on gene ontology (GO) enrichment, for example using the R topGO package.

library(topGO)
library(org.Hs.eg.db)
library(plyr)
bckg.genes <- rownames(dat[[1]]$exprs) 
xx <- annFUN.org("BP", mapping = "org.Hs.eg.db", ID = "symbol")
head(xx)
xx <- unique(unlist(xx))
xx<-xx[xx %in% bckg.genes]

goterms <- NULL

goterms <- lapply(tissue.names, function (tissue){

  ret <- lapply(names(gene.sets[[tissue]]), function(mod) {
  gene.list <- gene.sets[[tissue]][[mod]]
  gene.list.2 <- ifelse(xx %in% gene.list, 1, 0)
  gene.list.2 <- factor(gene.list.2)
  names(gene.list.2) <- xx

  GO.genes <- new("topGOdata",
                    ontology = "BP",
                    allGenes = gene.list.2,
                    nodeSize = 5,
                    annot = annFUN.org, 
                    mapping = "org.Hs.eg.db",
                    ID = "symbol")

  resultFisher <- runTest(GO.genes, algorithm = "classic", 
                            statistic = "fisher")

  weight01Fisher <- runTest(GO.genes, statistic = "fisher")

  allRes <- GenTable(GO.genes, classicFisher = resultFisher, 
                      weight01Fisher= weight01Fisher, 
                      orderBy = "weight01Fisher", 
                      topNodes=length(usedGO(GO.genes)), numChar = 1000) 
  allRes <- subset(allRes, classicFisher != "1.00000")

  gt <- genesInTerm(GO.genes, allRes$GO.ID)
  mod.genes <- gene.sets[[tissue]][[mod]]
  common<-lapply(gt, function(sig) intersect(mod.genes, sig))

  results <- list(GO.table = allRes,
              common = common)

  return(results)
})
  names(ret) <- names(gene.sets[[tissue]])
  return(ret)
})
names(goterms) <- tissue.names

goterms.table <- list()
for (tissue in c("epi", "stroma")){
  goterms.table[[tissue]] <- lapply(goterms[[tissue]], "[[", "GO.table")
  goterms.table[[tissue]] <- lapply(goterms.table[[tissue]], function(x){
    dplyr::slice(x, 1:20)})
}

epi.goterms<-ldply(goterms.table$epi)
stroma.goterms<-ldply(goterms.table$stroma)

write.table(epi.goterms, file = "epi.goterms.txt", sep="\t", col.names = T)
write.table(stroma.goterms, file = "stroma.goterms.txt", sep="\t", col.names = T)

save(goterms, file = "../../mixtApp/data/goterms.RData")

load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/goterms.rda"))

4. Association between gene sets and clinical variables

Using ranksums to capture module expression, we can ask how gene sets/modules in each tissue are differentially expressed according to patient’s clinicopathological variables. The type of the clinicopathological attribute (categorical or continuous) determines the underlying statistical test. Pearson correlation (Student asymptotic p-value) is used to test association between a given module and continuous patient attributes (eg. age). Analysis of Variance (ANOVA) is used to test association between a given module and categorical patient attributes (eg. ER status).

Significance of the associations is computated using permutations. To determine association between modules within a given subtype, we first stratify on subtypes and then rerun the MIxT procedure: (i) computation of ranksums in each module, (ii) calculation of the association test and (iii) permutations to estimate significance of associations.

epi.clinical.assoc <- lapply(names(dat$epi$cohorts), function(cohort.name) {
  clinical.ranksum(mixt.dat = dat, 
                   mixt.ranksum = bresat, 
                   tissue="epi", cohort = cohort.name,
                   nRuns = 10000, mc.cores = 80)
})
names(epi.clinical.assoc) <- names(dat$epi$cohorts)
knitr::kable(epi.clinical.assoc$ern[, 1:6])

stroma.clinical.assoc <- lapply(names(dat$stroma$cohorts), function(cohort.name) {
  clinical.ranksum(mixt.dat = dat, 
                   mixt.ranksum = bresat,
                   tissue="stroma", cohort=cohort.name,
                   nRuns = 10000, mc.cores = 80)
})
names(stroma.clinical.assoc) <- names(dat$stroma$cohorts)
knitr::kable(stroma.clinical.assoc$all[, 1:6])

## Adjust for mutliple testing
emp.p <- list()

emp.p <- NULL
for (cohort.name in names(dat$epi$cohorts)){
  emp.p[[cohort.name]] <- cbind(epi.clinical.assoc[[cohort.name]], 
                                stroma.clinical.assoc[[cohort.name]])
}

### Families of tests
group1 <- c("her2", "HER2S")
group2 <- c("er", "LUMS", "ERS")
singles <- rownames(emp.p$all)[!rownames(emp.p$all) %in% c(group1, group2) ]
gp <- list(group1, group2)

mod_clinical_fdr <- NULL
mod_clinical_fdr <- lapply(emp.p, function(p.dat){
  ret <- do.call(rbind, lapply(gp, function(gpx) {
      matrix(p.adjust(p.dat[rownames(p.dat) %in% gpx, ], method="BH"), 
             nrow=length(which(rownames(p.dat) %in% gpx)), 
             ncol = ncol(p.dat),
             dimnames = list(rownames(p.dat)[rownames(p.dat) %in% gpx], 
                             colnames(p.dat))
             )
    }))
  ret <- rbind(ret, t(apply(p.dat[rownames(p.dat) %in% singles, ], 1, 
                            p.adjust, method="BH")))
  return(ret)
})


mod_clinical_fdr <- lapply(mod_clinical_fdr, function(val){
  return(list(epi=val[, 1:23], stroma=val[,24:44]))
})

save(mod_clinical_fdr, file="../../mixtApp/data/mod_clinical_fdr.RData")

load(url(("https://rawgit.com/vdumeaux/mixtApp/master/data/mod_clinical_fdr.rda")))

str(mod_clinical_fdr)
head(mod_clinical_fdr$all)

5. Investigating gene overlap between gene sets across tissues

We first exclude genes that are not present in both tissue and check significance of overlap between modules across tissue using a Fisher exact test.

overlapFun <- function (labels1, labels2)
{
    labels1 <-  as.vector(labels1)
    labels2 <-  as.vector(labels2)

    levels1 <-  sort(unique(labels1))
    levels2 <-  sort(unique(labels2))

    n1 <-  length(levels1)
    n2 <-  length(levels2)

    countMat <-   matrix(0, n1, n2)
    pMat <-  matrix(0, n1, n2)

    for (m1 in 1:n1) for (m2 in 1:n2) {
        m1Members = (labels1 == levels1[m1])
        m2Members = (labels2 == levels2[m2])

        t <- table(m1Members, m2Members)
        # make sure all levels are included
        tab <- matrix(0, 2, 2)
        tab[match(rownames(t), c(FALSE, TRUE)), match(colnames(t), c(FALSE, TRUE))] <- t

        pMat[m1, m2] = fisher.test(tab, alternative = "greater")$p.value
        countMat[m1, m2] = sum(labels1 == levels1[m1] & labels2 ==
            levels2[m2])
    }
    dimnames(pMat) = list(levels1, levels2)
    dimnames(countMat) = list(levels1, levels2)
    pMat[is.na(pMat)] = 1
    list(countTable = countMat, pTable = pMat)
}

str(moduleColors)

modColors <- list()
modColors$epi <- moduleColors$epi[names(moduleColors$epi) %in% names(moduleColors$stroma)]
modColors$stroma <- moduleColors$stroma[names(moduleColors$stroma) %in% names(moduleColors$epi)]
str(modColors)

gene.set.overlap <-  overlapFun(modColors[[1]], modColors[[2]])
str(gene.set.overlap)

P-values are adjusted for multiple testing using the false discovery rate of (Benjamini and Hochberg, 1995).

gene.set.overlap$pTable <- matrix(
  p.adjust(gene.set.overlap$pTable, method = "BH"),
  nrow = nrow(gene.set.overlap$pTable),
  ncol = ncol(gene.set.overlap$pTable), 
  dimnames = list(rownames(gene.set.overlap$pTable),
                  colnames(gene.set.overlap$pTable))
  )

knitr::kable(gene.set.overlap$pTable[1:10, 1:8])

6. Matched Interactions Across Tissues

Interactions between gene sets/modules across tissues are identified using a random permutation approach based on the Pearson correlation between ranksums of gene expression in gene sets/modules across tissues. This is run independently within each patient cohort.

perm_cor_p <- NULL

perm_cor_p$epi2 <- stat.ranksum (mixt.ranksum = bresat, 
                                       tissue1="epi", tissue2="epi",
                                       nRuns = 10000, mc.cores = 80)
perm_cor_p$epi_stroma <- stat.ranksum (mixt.ranksum = bresat, 
                                       tissue1="epi", tissue2="stroma",
                                       nRuns = 10000, mc.cores = 80)
perm_cor_p$stroma2 <- stat.ranksum (mixt.ranksum = bresat, 
                                       tissue1="stroma", tissue2="stroma",
                                       nRuns = 10000, mc.cores = 80)

save(perm_cor_p, file="../../mixtApp/data/perm_cor_p.rda")

load(url(("https://rawgit.com/vdumeaux/mixtApp/master/data/perm_cor_p.rda")))

Gene sets from tissue 1 (epi) are in row and gene sets from tissue 2 (stroma) are in column.

str(perm_cor_p)
head(perm_cor_p$epi_stroma$all)

You can then examine these associations with different plots

cohort_scatterplot(mixt.ranksum = bresat, 
                   x.tissue = "epi", x.module = "darkgrey", 
                   y.tissue = "stroma", y.module = "grey60", 
                   cohort.name = "ern")

cohort_scatterplot(mixt.ranksum = bresat, 
                   x.tissue = "epi", x.module = "darkgrey", 
                   y.tissue = "epi", y.module = "grey60", 
                   cohort.name = "ern")

cohort_scatterplot(mixt.ranksum = bresat, 
                   x.tissue = "stroma", x.module = "darkgrey", 
                   y.tissue = "stroma", y.module = "grey60", 
                   cohort.name = "ern")

cohort_heatmap(mixt.dat = dat, mixt.ranksum = bresat, tissue = "stroma", 
               module = "grey60", cohort.name = "ern", cl.height = 3)

cohort_heatmap(mixt.dat = dat, mixt.ranksum = bresat, tissue = "epi", 
               module = "darkgrey", cohort.name = "ern", cl.height = 3, 
               orderByTissue = "stroma", orderByModule = "grey60")

cohort_boxplot(mixt.ranksum = bresat, tissue = "epi", module = "darkgrey", 
               cohort.name = "ern",
               orderByTissue = "stroma", orderByModule = "grey60")