In dmontaner/mdgsa: Multi Dimensional Gene Set Analysis.

BiocStyle::markdown()

library (knitr) ## this is needed when R CMD build
opts_chunk$set (message = FALSE)

Introduction

The mdgsa library implements the gene set analysis methodology developed in Montaner and Dopazo (2010). It presents a flexible framework for analyzing the enrichment of gene sets along a given ranking of genes. The novelty is that, not just one ranking index but two, may be analyzed and jointly explored in a multidimensional gene set analysis.

As classical GSEA, our approach allows for the functional profiling of isolated genomic characteristics; differential gene expression, copy number analyses, or variant to disease associations may be interpreted in terms of gene sets using the mdgsa package. But more interestingly, our multivariate approach may be used to find out gene set enrichments due to the combined effect of two of such genomic dimensions. We could for instance detect gene sets affected by the interaction of gene expression changes and copy number alterations.

Citation

Further description of the mdgsa methods may be found at:

Multidimensional gene set analysis of genomic data.
David Montaner and Joaquin Dopazo.
PLoS One. 2010 Apr 27;5(4):e10348. doi: 10.1371/journal.pone.0010348.

Functional Profiling of Gene Expression Data

In this tutorial we use the data in the Acute Lymphocytic Leukemia expression dataset package of Bioconductor. First we will use the limma library to compute a differential gene expression analysis. Then we will use the functions uvGsa and mdGsa in the mdgsa package to perform uni-dimensional and bi-dimensional gene set analyses respectively. The functional interpretation will be done in terms of KEGG Pathways. The annotation will be taken from the hgu95av2.db library in Bioconductor.

First we load the data and describe the design matrix of the experiment

library (ALL)
data (ALL)

des.mat <- model.matrix (~ 0 + mol.biol, data = ALL)
colnames (des.mat) <- c("ALL", "BCR", "E2A", "NEG", "NUP", "p15")
head (des.mat)

Then we can use limma to carry out some gene expression comparisons. We can for instance compare ALL samples to NEG control samples or explore gene differential expression between BCR and NEG.

library (limma)
cont.mat <- makeContrasts (ALL-NEG, BCR-NEG, levels = des.mat)
cont.mat

fit <- lmFit (ALL, design = des.mat)
fit <- contrasts.fit (fit, cont.mat)
fit <- eBayes (fit)

From this analysis we get test statistics and p-values for each of the two contrasts

fit$t[1:3,]
fit$p.value[1:3,]

These gene level information may be now interpreted in terms of gene sets. For this example we will carry out a gene set analysis using the functional blocks described in KEGG, but any other functional data base such as the Gene Ontology or even a customized one may be analyzed using mdgsa. We can get the KEGG annotation from the hgu95av2.db library as follows.

library (hgu95av2.db)
anmat <- toTable (hgu95av2PATH)
anmat[1:3,]

Univariate Gene Set Analysis

We can now carry out the functional interpretation of the contrast "BCR - NEG" for instance.

The data needed for the gene set analysis are the p-values and test statistics returned by limma at gene level.

fit$t[1:3, "BCR - NEG"]
fit$p.value[1:3, "BCR - NEG"]

We load the mdgsa library.

library (mdgsa)

The first step in the procedure is to combine p-values and test statistics in a single ranking value.

rindex <- pval2index (pval = fit$p.value[,"BCR - NEG"], sign = fit$t[,"BCR - NEG"])
rindex <- indexTransform (rindex)
rindex[1:3]

This ranking index keeps the sign of the test statistic; that is, the information of whether the gene is over or underexpressed.

plot (fit$t[,"BCR - NEG"], rindex)

but the evidence of the differential expression is taken directly from the p-value

plot (fit$p.value[,"BCR - NEG"], rindex)

Next we will need to format the annotation. The function annotMat2list in the mdgsa library converts the annotation matrix into a list.

anmat[1:3,]
annot <- annotMat2list (anmat)
length (annot)

Each element of the list contains the gene names of a gene set.

lapply (annot[1:3], head, n= 3)

It is also important to make sure that the gene universe described by the ranking index and the annotation are concordant. The function annotFilter encompasses the annotation list to the names in the ranking index; it is also used to exclude functional blocks too small to be considered gene sets, or too big to be specific of any biological process of interest.

annot <- annotFilter (annot, rindex)

Now everything is ready to carry out the univariate gene set analysis.

res.uv <- uvGsa (rindex, annot)

The output of the uvGsa function is a data frame which rows correspond to functional blocks analyzed.

res.uv[1:3,]

The uvGsa function fits a logistic regression model relating, the probability of genes belonging to the gene set, with the value of the ranking statistic.

Significant and positive log odds ratio (lor) indicate that the gene set is enriched in those genes with high values of the ranking statistic. In our example this means that genes up regulated in BCR compared to NEG are more likely to belong to the functional block. We could also say that the block of genes shows a significant degree of over expression BCR compared to NEG.

On the other hand, when a gene set has a negative log odds ratio we can say that the genes in the set are more likely to be associated to low values, negative in our case, of the ranking statistics. In our case this means that the gene set is down regulated in BCR compared to NEG.

The function uvPat helps you classifying the analyzed gene sets

res.uv[,"pat"] <- uvPat (res.uv, cutoff = 0.05)
table (res.uv[,"pat"])

Positive values (1) correspond to significant and positive log odds ratios. Negative values (-1) correspond to significant and negative log odds ratios. Zeros correspond to non enriched blocks.

As in this example we are analyzing KEGG pathways we can use the function getKEGGnames to find out the "name" of the pathways^[Similar function getGOnames is available to be used with Gene Ontology terms.].

res.uv[,"KEGG"] <- getKEGGnames (res.uv)

Finally, the function uvSignif may help us displaying just the enriched blocks.

res <- uvSignif (res.uv)
res[,c("pat", "KEGG")]

Multivariate Gene Set Analysis

But with the mdgsa library we can analyze not just one but two ranking statistics at a time.

In our example we were interested not in just one differential expression contrast but in two: ALL vs. NEG and BCR vs. NEG. We used limma to fit this two contrasts (see previous sections) and computed gene statistics and p-values for for each of them.

fit$t[1:3,]
fit$p.value[1:3,]

We can combine this two matrices in a single one containing a ranking statistic for each contrast, just as we did in the univariate example.

rindex <- pval2index (pval = fit$p.value, sign = fit$t)
rindex <- indexTransform (rindex)
rindex[1:3,]

Now we can explore the bi-dimensional distribution of this ranking indexes

plot (rindex, pch = ".")

and search for gene sets enrichment patterns in both dimensions

The same annotation list we filtered in the previous section may be used in this second example. Thus everything is ready to carry out our multidimensional analysis.

res.md <- mdGsa (rindex, annot)

As in the univariate analysis, the output of the mdGsa function is a data frame with a row per analyzed gene set.

res.md[1:3,]

The column of the row contain log odds ratios and p-values for each of the analyzed dimensions and also for their interaction effect. The function mdPat helps clarifying the bi-dimensional pattern of enrichment.

res.md[,"pat"] <- mdPat (res.md)
table (res.md[,"pat"])

And as before we can incorporate the KEGG names to our results.

res.md[,"KEGG"] <- getKEGGnames (res.md)
res.md[1:3,]

Thus we could for instance explore the KEGG classified as having a with a q3f pattern. The q3f classification means that the genes of this gene set are located in the third quadrant of the bivariate ranking index representation.

The plotMdGsa function help us understanding such pattern.

Q3 <- rownames (res.md)[res.md$pat == "q3f"]
Q3
plotMdGsa (rindex, block = annot[[Q3]], main = res.md[Q3, "KEGG"])

Red dots in the figure represent the genes within the gene set. They show, in both dimensions of our ranking, values significantly more negatives that the remaining genes in the study. This same pattern may be appreciated in the ellipses drawn in the figure. The blue one represents a confidence region for all the genes in the study. The red one shows the same confidence region but just for those genes within the gene set. We can see how the distribution of the genes in KEGG r Q3 is displaced towards the third quadrant of the plot. This indicates us that "r res.md[Q3, "KEGG"]" is a pathway jointly down regulated in both ALL and BCR when compared to the controls in the NEG group.

Similarly we can explore a bimodal (b13) KEGG. This pattern classification indicates us that the functional block has two sub-modules of genes with opposite patterns of expression. One subset of the KEGG is up-regulated in both conditions while the other subset is down-regulated also in both dimensions analyzed.

It may be worth pointing here that, this bimodal pattern will be missed by standard univariate gene set methods, and that it may be just detected in the multidimensional analysis.

BI <- rownames (res.md)[res.md$pat == "b13"]
plotMdGsa (rindex, block = annot[[BI]], main = res.md[BI, "KEGG"])

As third example we could display some KEGG enriched in one dimension but not in the other. The yh pattern indicates gene set over-representation just in the Y axis but not in the horizontal axis. In our case, KEGG pathways with at yh pattern are those over expressed in BRC compared to NEG but not enriched in the comparison between ALL and NEG.

Similarly an yl pattern indicates a down-regulation of the block in the BRC vs. NEG comparison but not in the ALL vs. NEG one.

The plot below displays one of such yl classified KEGGs.

rownames (res.md)[res.md$pat == "yl"]
YL <- "00280"
plotMdGsa (rindex, block = annot[[YL]], main = res.md[YL, "KEGG"])

All possible multidimensional enrichment patterns are listed in the Appendix.

Appendix

1.- Multidimensional Functional Classification {#appendix1}

All possible functional block classifications in the bi-dimensional gene set analysis are:

q1i: block displaced toward quadrant 1 (0 < X & 0 < Y) with interaction.
q2i: block displaced toward quadrant 2 (0 > X & 0 < Y) with interaction.
q3i: block displaced toward quadrant 3 (0 > X & 0 > Y) with interaction.
q4i: block displaced toward quadrant 4 (0 < X & 0 > Y) with interaction.
q1f: block displaced toward quadrant 1, no interaction.
q2f: block displaced toward quadrant 2, no interaction.
q3f: block displaced toward quadrant 3, no interaction.
q4f: block displaced toward quadrant 4, no interaction.
xh: block shifted to positive X values.
xl: block shifted to negative X values.
yh: block shifted to positive Y values.
yl: block shifted to negative Y values.
b13: bimodal block. Half of the genes displaced towards quadrant 1 and the other half towards quadrant 3.
b24: bimodal block. Half of the genes displaced towards quadrant 2 and the other half towards quadrant 4.
NS: non significant block.