In Accio/BioQC: Detect tissue heterogeneity in expression profiles with gene sets

library(knitr)
opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center")

Introduction

Besides being used as a QC-tool for gene expression data, BioQC can be used for general-purpose gene-set enrichment analysis. Its core function, wmwTest, can handle not only "normal", unsigned gene sets, but also signed genesets where two sets of genes represent signatures that are positively and negatively regulated respectively.

This vignette describes an example of such applications. First we load the BioQC library.

library(BioQC)

GMT read-in

We first open a toy GMT file containing signed genesets with readSignedGmt. The function reads the file into a SignedGenesets object.

gmtFile <- system.file("extdata/test.gmt", package="BioQC")
## print the file context
cat(readLines(gmtFile), sep="\n")
## read the file into SignedGenesets
genesets <- readSignedGmt(gmtFile)
print(genesets)

Note that though the GMT file contains 6 non-empty lines, only five signed genesets are returned by readSignedGmt because a pair of negative and negative geneset of GS_C is available.

Note that in its current form, no genes are reported for GS_A and GS_B, because the names of both sets do not match the patterns. User can decide how to handle such genesets that match patterns of neither positive nor negative sets (GS_A and GS_B in this case). By default, such genesets are ignored; however, user can decide to treat them as either positive or negative genesets. For the purpose of this tutorial, we treat these genesets as positive.

genesets <- readSignedGmt(gmtFile, nomatch="pos")
print(genesets)

Expression object construction

Next we construct an ExpressionSet object containing 100 genes, with some of the genes in the GMT file present.

set.seed(1887)
testN <- 100L
testSampleCount <- 3L
testGenes <- c("AKT1", "AKT2", "ERBB2", "ERBB3", "EGFR","TSC1", "TSC2", "GATA2", "GATA4", "GATA1", "GATA3")
testRows <- c(testGenes, paste("Gene", (length(testGenes)+1):testN, sep=""))
testMatrix <- matrix(rnorm(testN*testSampleCount, sd=0.1), nrow=testN, dimnames=list(testRows, NULL))
testMatrix[1:2,] <- testMatrix[1:2,]+10
testMatrix[6:7,] <- testMatrix[6:7,]-10
testMatrix[3:4,] <- testMatrix[3:4,]-5
testMatrix[5,] <- testMatrix[5,]+5
testEset <- new("ExpressionSet", exprs=testMatrix)

The expression of somes genes are delibrately changed so that some genesets are expected to be significantly enriched.

AKT1 and AKT2 are higher expressed than the background genes ('background' hereafter).
TSC1 and TSC2 are lower expressed than the background.
ERBB2 and ERBB3 are moderately lower expressed than the background.
EGFR is moderately higher expressed than the background.

Considering the geneset composition printed above, we can expect that GS_A shall be significantly up-regulated; GS_B shall not be significant, because its genes are missing; GS_C shall be significantly down-regulated, with positive targets down-regulated and negative targets up-regulated; GS_D shall not always be significant, because its genes are comparable to the background; * GS_E shall be significantly up-regulated, because its negative targets are down-regulated.

Match genes

With both SignedGenesets and ExpressionSet objects at hand, we can query the indices of genes of genesets in the ExpressionSet object.

testIndex <- matchGenes(genesets, testEset, col=NULL)
print(testIndex)

If the ExpressionSet object has GeneSymbols in its featureData other than the feature names, user can specify the parameter col in matchGenes to make the match working. matchGene also works with characters and matrix as input, please check out the help pages for such uses.

Perform the analysis

wmwResult.greater <- wmwTest(testEset, testIndex, valType="p.greater")
print(wmwResult.greater)

As expected, GS_A and GS_E are significant when the alternative hypothesis is greater.

wmwResult.less <- wmwTest(testEset, testIndex, valType="p.less")
print(wmwResult.less)

As expected, GS_C is significant when the alternative hypothesis is less.

wmwResult.two.sided <- wmwTest(testEset, testIndex, valType="p.two.sided")
print(wmwResult.two.sided)

As expected, GS_A, GS_C, and GS_E are significant when the alternative hypothesis is two.sided.

A simple way to integrate the results of both alternative hypotheses into one numerical value is the Q value, which is defined as absolute log10 transformation of the smaller p value of the two hypotheses, times the sign defined by greater=1 and less=-1.

wmwResult.Q <- wmwTest(testEset, testIndex, valType="Q")
print(wmwResult.Q)

As expected, genesets GS_A, GS_C, and GS_E have large absolute values, suggesting they are potentially enriched; while GS_A and GS_E are positively enriched, GS_C is negatively enriched.