Principal component gene set enrichment (PCGSE) algorithm
Description
Implementation of the PCGSE algorithm. Computes the statistical association between gene sets and the principal components of experimental data using a twostage competitive test. Supported genelevel test statistics include the PC loadings for each genomic variable, the Pearson correlation coefficients between each genomic variable and each PC and the Fishertransformed correlation coefficients. The input data is centered and scaled so that eigendecomposition is computed on the sample correlation matrix rather than the sample covariance matrix. Because the PC loadings for PCA on a correlation matrix are proportional to the Pearson correlation coefficients between each PC and each variable, all supported genelevel statistics provide a measure of correlation between genomic variables and PCs. Each gene set is quantified using either a standardized mean difference statistic or a standardized rank sum statistic. The statistical significance of each gene set test statistic is computed according to a competitive null hypothesis using either a parametric test, a correlationadjusted parametric test or a permutation test.
Usage
1 2 3 
Arguments
data 
Empirical data matrix, observationsbyvariables. Must be specified. Cannot contain missing values. 
prcomp.output 
Output of prcomp(data,center=T,scale=T). If not specified, it will be computed. 
pc.indexes 
Indices of the PCs for which enrichment should be computed. Defaults to 1. 
gene.sets 
Data structure that holds gene set membership information. Must be either a binary membership matrix or a list of gene set member indexes. For the member matrix, rows are gene sets, columns are genes, elements are binary membership values. For the membership index list, each element of the list represents a gene set and holds a vector of indexes of genes that are members. Must be a matrix if gene.set.test is set to "permutation". 
gene.statistic 
The genelevel statistic used to quantify the association between each genomic variable and each PC. Must be one of the following (default is "z"):

transformation 
Optional transformation to apply to the genelevel statistics. Must be one of the following (default is "none"):

gene.set.statistic 
The gene set statisic computed from the genelevel statistics. Must be one of the following (default is "mean.diff"):

gene.set.test 
The statistical test used to compute the significance of the gene set statistics under a competitive null hypothesis. The "parametric" test is in the "class 1" test category according to Barry et al., the "cor.adj.parametric" and "permutation" tests are in the "class 2" test category according to Barry et al. Must be one of the following (default is "cor.adj.parametric"):

nperm 
Number of permutations to perform. Only relevant if gene.set.test is set to "permutation". 
Value
List with the following elements:
p.values: Matrix with one row per gene set and one column for each tested PC. Elements are the twosided competitive enrichment pvalues. Multiple hypothesis correction is NOT applied to these pvalues.
statistics: Matrix with one row per gene set and one column for each tested PC. Elements are the gene set test statistics for each gene set.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  library(MASS)
p=200 ## number of genomic variables
n=50 ## number of observations
f=20 ## number of gene sets
## Create annotation matrix with disjoint gene sets
gene.sets = matrix(0, nrow=f, ncol=p)
for (i in 1:f) {
gene.sets[i, ((i1)*p/f + 1):(i*p/f)] = 1
}
## Simulate MVN data with two population PCs whose loadings are
## associated with the first and second gene sets, respectively.
var1=2 ## variance of first population PC
var2=1 ## variance of second population PC
default.var=.1 ## background variance of population PCs
load = sqrt(.1) ## value of population loading vector for gene set 1 on PC 1 and set 2 on PC 2
## Creates a first PC with loadings for just the first 20 genes and a
## second PC with loadings for just the second 20 genes
loadings1 = c(rep(load,p/f), rep(0,pp/f))
loadings2 = c(rep(0,p/f), rep(load, p/f), rep(0, p2*p/f))
## Create the population covariance matrix
sigma = var1 * loadings1 %*% t(loadings1) + var2 * loadings2 %*% t(loadings2) +
diag(rep(default.var, p))
## Simulate MVN data
data = mvrnorm(n=n, mu=rep(0, p), Sigma=sigma)
## Perform PCA on the standardized data
prcomp.output = prcomp(data, center=TRUE, scale=TRUE)
## Execute PCGSE using Fishertransformed correlation coefficients as the genelevel statistics,
## the standardized mean difference as the gene set statistic and an unadjusted twosided,
## twosample ttest for the determination of statistical significance.
pcgse.results = pcgse(data=data,
prcomp.output=prcomp.output,
pc.indexes=1:2,
gene.sets=gene.sets,
gene.statistic="z",
transformation="none",
gene.set.statistic="mean.diff",
gene.set.test="parametric")
## Apply Bonferroni correction to pvalues
for (i in 1:2) {
pcgse.results$p.values[,i] = p.adjust(pcgse.results$p.values[,i], method="bonferroni")
}
## Display the pvalues for the first 5 gene sets for PCs 1 and 2
pcgse.results$p.values[1:5,]
## Execute PCGSE again but using a correlationadjusted ttest
pcgse.results = pcgse(data=data,
prcomp.output=prcomp.output,
pc.indexes=1:2,
gene.sets=gene.sets,
gene.statistic="z",
transformation="none",
gene.set.statistic="mean.diff",
gene.set.test="cor.adj.parametric")
## Apply Bonferroni correction to pvalues
for (i in 1:2) {
pcgse.results$p.values[,i] = p.adjust(pcgse.results$p.values[,i], method="bonferroni")
}
## Display the pvalues for the first 5 gene sets for PCs 1 and 2
pcgse.results$p.values[1:5,]
