gsva: Gene Set Variation Analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Estimates GSVA enrichment scores.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
## S4 method for signature 'SummarizedExperiment,GeneSetCollection'
gsva(expr, gset.idx.list, annotation,
    method=c("gsva", "ssgsea", "zscore", "plage"),
    kcdf=c("Gaussian", "Poisson", "none"),
    abs.ranking=FALSE,
    min.sz=1,
    max.sz=Inf,
    parallel.sz=1L,
    mx.diff=TRUE,
    tau=switch(method, gsva=1, ssgsea=0.25, NA),
    ssgsea.norm=TRUE,
    verbose=TRUE,
    BPPARAM=SerialParam(progressbar=verbose))
## S4 method for signature 'SummarizedExperiment,list'
gsva(expr, gset.idx.list, annotation,
    method=c("gsva", "ssgsea", "zscore", "plage"),
    kcdf=c("Gaussian", "Poisson", "none"),
    abs.ranking=FALSE,
    min.sz=1,
    max.sz=Inf,
    parallel.sz=1L,
    mx.diff=TRUE,
    tau=switch(method, gsva=1, ssgsea=0.25, NA),
    ssgsea.norm=TRUE,
    verbose=TRUE,
    BPPARAM=SerialParam(progressbar=verbose))
## S4 method for signature 'ExpressionSet,list'
gsva(expr, gset.idx.list, annotation,
    method=c("gsva", "ssgsea", "zscore", "plage"),
    kcdf=c("Gaussian", "Poisson", "none"),
    abs.ranking=FALSE,
    min.sz=1,
    max.sz=Inf,
    parallel.sz=1L,
    mx.diff=TRUE,
    tau=switch(method, gsva=1, ssgsea=0.25, NA),
    ssgsea.norm=TRUE,
    verbose=TRUE,
    BPPARAM=SerialParam(progressbar=verbose))
## S4 method for signature 'ExpressionSet,GeneSetCollection'
gsva(expr, gset.idx.list, annotation,
    method=c("gsva", "ssgsea", "zscore", "plage"),
    kcdf=c("Gaussian", "Poisson", "none"),
    abs.ranking=FALSE,
    min.sz=1,
    max.sz=Inf,
    parallel.sz=1L,
    mx.diff=TRUE,
    tau=switch(method, gsva=1, ssgsea=0.25, NA),
    ssgsea.norm=TRUE,
    verbose=TRUE,
    BPPARAM=SerialParam(progressbar=verbose))
## S4 method for signature 'matrix,GeneSetCollection'
gsva(expr, gset.idx.list, annotation,
    method=c("gsva", "ssgsea", "zscore", "plage"),
    kcdf=c("Gaussian", "Poisson", "none"),
    abs.ranking=FALSE,
    min.sz=1,
    max.sz=Inf,
    parallel.sz=1L,
    mx.diff=TRUE,
    tau=switch(method, gsva=1, ssgsea=0.25, NA),
    ssgsea.norm=TRUE,
    verbose=TRUE,
    BPPARAM=SerialParam(progressbar=verbose))
## S4 method for signature 'matrix,list'
gsva(expr, gset.idx.list, annotation,
    method=c("gsva", "ssgsea", "zscore", "plage"),
    kcdf=c("Gaussian", "Poisson", "none"),
    abs.ranking=FALSE,
    min.sz=1,
    max.sz=Inf,
    parallel.sz=1L,
    mx.diff=TRUE,
    tau=switch(method, gsva=1, ssgsea=0.25, NA),
    ssgsea.norm=TRUE,
    verbose=TRUE,
    BPPARAM=SerialParam(progressbar=verbose))

Arguments

expr

Gene expression data which can be given either as a SummarizedExperiment or ExpressionSet object, or as a matrix of expression values where rows correspond to genes and columns correspond to samples.

gset.idx.list

Gene sets provided either as a list object or as a GeneSetCollection object.

annotation

In the case of calling gsva() on a SummarizedExperiment object, the annotation argument can be used to select the assay containing the molecular data we want as input to the gsva() function, otherwise the first assay is selected. In the case of calling gsva() with expression data in a matrix and gene sets as a GeneSetCollection object, the annotation argument can be used to supply the name of the Bioconductor package that contains annotations for the class of gene identifiers occurring in the row names of the expression data matrix. In the case of calling gsva() on a ExpressionSet object, the annotation argument is ignored. See details information below.

method

Method to employ in the estimation of gene-set enrichment scores per sample. By default this is set to gsva (Hänzelmann et al, 2013) and other options are ssgsea (Barbie et al, 2009), zscore (Lee et al, 2008) or plage (Tomfohr et al, 2005). The latter two standardize first expression profiles into z-scores over the samples and, in the case of zscore, it combines them together as their sum divided by the square-root of the size of the gene set, while in the case of plage they are used to calculate the singular value decomposition (SVD) over the genes in the gene set and use the coefficients of the first right-singular vector as pathway activity profile.

kcdf

Character string denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples when method="gsva". By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson".

abs.ranking

Flag used only when mx.diff=TRUE. When abs.ranking=FALSE (default) a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as 'highly' activated.

min.sz

Minimum size of the resulting gene sets.

max.sz

Maximum size of the resulting gene sets.

parallel.sz

Number of threads of execution to use when doing the calculations in parallel. The argument BPPARAM allows one to set the parallel back-end and fine tune its configuration.

mx.diff

Offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic. mx.diff=FALSE: ES is calculated as the maximum distance of the random walk from 0. mx.diff=TRUE (default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations.

tau

Exponent defining the weight of the tail in the random walk performed by both the gsva (Hänzelmann et al., 2013) and the ssgsea (Barbie et al., 2009) methods. By default, this tau=1 when method="gsva" and tau=0.25 when method="ssgsea" just as specified by Barbie et al. (2009) where this parameter is called alpha.

ssgsea.norm

Logical, set to TRUE (default) with method="ssgsea" runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. When ssgsea.norm=FALSE this last normalization step is skipped.

verbose

Gives information about each calculation step. Default: FALSE.

BPPARAM

An object of class BiocParallelParam specifiying parameters related to the parallel execution of some of the tasks and calculations within this function.

Details

GSVA assesses the relative enrichment of gene sets across samples using a non-parametric approach. Conceptually, GSVA transforms a p-gene by n-sample gene expression matrix into a g-geneset by n-sample pathway enrichment matrix. This facilitates many forms of statistical analysis in the 'space' of pathways rather than genes, providing a higher level of interpretability.

By default, gsva() will try to match the identifiers in expr to the identifiers in gset.idx.list just as they are, unless the annotation argument is set.

The gsva() function first maps the identifiers in the gene sets in gset.idx.list to the identifiers in the input expression data expr. When the input gene sets in gset.idx.list is provided as a list object, gsva() will try to match the identifiers in expr directly to the identifiers in gset.idx.list just as they are. Because unmatching identifiers will be discarded in both, expr and gset.idx.list, gsva() may prompt an error if no identifiers can be matched as in the case of different types of identifiers (e.g., gene symbols vs Entrez identitifers).

However, then the input gene sets in gset.idx.list is provided as a GeneSetCollection object, gsva() will try to automatically convert those identifiers to the type of identifier in the input expression data expr. Such an automatic conversion, however, will only occur in three scenarios: 1. when expr is an ExpressionSet object with an appropriately set annotation slot; 2. when expr is a SummarizedExperiment object with an appropriately set annotation slot in the metadata of expr; 3. when expr is a matrix and the annotation argument of the gsva() function is set to the name of the annotation package that provides the relationships between the type of identifiers in expr and gset.idx.list.

The collection of gene sets resulting from the previous identifier matching, can be further filtered to require a minimun and/or maximum size by using the arguments min.sz and max.sz.

Value

A gene-set by sample matrix of GSVA enrichment scores.

Author(s)

J. Guinney and R. Castelo

References

Barbie, D.A. et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(5):108-112, 2009.

Hänzelmann, S., Castelo, R. and Guinney, J. GSVA: Gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics, 14:7, 2013.

Lee, E. et al. Inferring pathway activity toward precise disease classification. PLoS Comp Biol, 4(11):e1000217, 2008.

Tomfohr, J. et al. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics, 6:225, 2005.

See Also

filterGeneSets computeGeneSetsOverlap

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
library(limma)

p <- 10 ## number of genes
n <- 30 ## number of samples
nGrp1 <- 15 ## number of samples in group 1
nGrp2 <- n - nGrp1 ## number of samples in group 2

## consider three disjoint gene sets
geneSets <- list(set1=paste("g", 1:3, sep=""),
                 set2=paste("g", 4:6, sep=""),
                 set3=paste("g", 7:10, sep=""))

## sample data from a normal distribution with mean 0 and st.dev. 1
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
            dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))

## genes in set1 are expressed at higher levels in the last 'nGrp1+1' to 'n' samples
y[geneSets$set1, (nGrp1+1):n] <- y[geneSets$set1, (nGrp1+1):n] + 2

## build design matrix
design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2)))

## fit linear model
fit <- lmFit(y, design)

## estimate moderated t-statistics
fit <- eBayes(fit)

## genes in set1 are differentially expressed
topTable(fit, coef="sampleGroup2vs1")

## estimate GSVA enrichment scores for the three sets
gsva_es <- gsva(y, geneSets, mx.diff=1)

## fit the same linear model now to the GSVA enrichment scores
fit <- lmFit(gsva_es, design)

## estimate moderated t-statistics
fit <- eBayes(fit)

## set1 is differentially expressed
topTable(fit, coef="sampleGroup2vs1")

GSVA documentation built on Feb. 11, 2021, 2:02 a.m.