GSNormalize: Aggregating and calculating expression statistics by Gene Set
In GSEAlm: Linear Model Toolset for Gene Set Enrichment Analysis

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/lm4.R

Provides an interface for producing aggregate gene-set statistics, for gene-set-enrichment analysis (GSEA). The function is best suited for mean or rescaled-mean GSEA approaches, but is hopefully generic enough to enable other approaches as well.

GSNormalize(dataset, incidence, gseaFun = crossprod, fun1 = "/",
            fun2 = sqrt, removeShift=FALSE, removeStat=mean, ...) 
identity(x)
one(x)

`dataset`	a numeric matrix, typically of some gene-level statistics
`incidence`	0/1 incidence matrix indicating genes' membership in gene-sets
`gseaFun`	function name for the type of aggregation to take place, defaults to 'crossprod'. See 'Details'
`fun1`	function name for normalization, defaults to "/". See 'Details'
`fun2`	function name for scaling, defaults to 'sqrt'. See 'Details'
`removeShift`	logical: should normalization begin with a column-wise removal of the mean shift?
`removeStat`	(if above is TRUE) the column-wise statistic to be swept out of 'dataset'.
`...`	Additional arguments optionally passed on to 'gseaFun'.
`x`	any numerical value

In gene-set-enrichment analysis (GSEA), the core step is aggregating (or calculating) gene-set-level statistics from gene-set statistics. This utility achieves the feat. It is tailored specifically for rescaled-sums of the type suggested by Jiang and Gentleman (2007), but is designed as a generic template that should other GSEA approaches. In such cases, at this moment users should provide their own version of 'gseaFun'.

The default will generate sums of gene-level values divided by the square-root of the gene-set size (in other words, gene-set means multiplied by the square-root of gene-set size). The arithmetic works like this:

gene-set stat = gseaFun(t(incidence),dataset),...) 'fun1' fun2(gene-set size).

In case there is a known (or suspected) overall baseline shift (i.e., the mass of gene-level stats is not centered around zero) it may be scientifically more meaningful to look for gene-set deviating from this baseline rather than from zero. In this case, you can set 'removeShift=TRUE'.

Also provided are the 'identity' function (identity = function(x) x), so that leaving 'gseaFun' and 'fun1' at their default and setting 'fun2 = identity' will generate gene-set means – and the 'one' function to neutralize the effect of both 'fun1' and 'fun2' (see note below).

'GSNormalize' returns a matrix with the same number of rows as 'incidence' and the same number of columns as 'dataset' (if 'dataset' is a vector, the output will be a vector as well). The respective row and column names will carry through from 'dataset' and 'incidence' to the output.

'identity' simply returns x. 'one' returns the number 1.

If you want to create your own GSEA function for 'gseaFun', note that it should receive the transposed incidence matrix as its first argument, and the gene-level stats as its second argument. In other words, both should have genes as rows. also, you can easily neutralize the effect of 'fun1', 'fun2' by setting "fun2 = one".

Assaf Oron

Z. Jiang and R. Gentleman, "Extensions to Gene Set Enrichment Analysis",Bioinformatics (23),306-313, 2007.

gsealmPerm, which relies heavily on this function. The function applyByCategory from the Category package has similar functionality and is preferable when the applied function is complicated. GSNormalize is better optimized for matrix operations.

data(sample.ExpressionSet)
lm1 = lmPerGene(sample.ExpressionSet,~sex+type)

### Generating random pseudo-gene-sets
fauxGS=matrix(sample(c(0,1),size=50000,replace=TRUE,prob=c(.9,.1)),nrow=100)

### "tau-stats" for gene-SET-level type effect, adjusting for sex
fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS)

qqnorm(fauxEffects)
### diagonal line represents zero-shift null; note that it doesn't fit
abline(0,1,col=2)
### a better option may be to run a diagonal through the middle of the
### data (nonzero-shift null, i.e. type may have an effect but it is the
### same for all gene-sets); note that if any outlier shows, it is a purely random one!

abline(median(fauxEffects),1,col=4)

#### Now try with baseline-shift removal

fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS,removeShift=TRUE)

qqnorm(fauxEffects)
abline(0,1,col=2)