runGSA | R Documentation |
Performs gene set analysis (GSA) based on a given number of gene-level statistics and a gene set collection, using a variety of available methods, returning the gene set statistics and p-values of different directionality classes.
runGSA( geneLevelStats, directions = NULL, geneSetStat = "mean", signifMethod = "geneSampling", adjMethod = "fdr", gsc, gsSizeLim = c(1, Inf), permStats = NULL, permDirections = NULL, nPerm = 10000, gseaParam = 1, ncpus = 1, verbose = TRUE )
geneLevelStats |
a vector or a one-column data.frame or matrix, containing the gene level statistics. Gene level statistics can be e.g. p-values, t-values or F-values. |
directions |
a vector or a one-column data.frame or matrix, containing fold-change like values for the related gene-level statistics. This is mainly used if statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter. |
geneSetStat |
the statistical GSA method to use. Can be one of
|
signifMethod |
the method for significance assessment of gene sets,
i.e. p-value calculation. Can be one of |
adjMethod |
the method for adjusting for multiple testing. Can be any
of the methods supported by |
gsc |
a gene set collection given as an object of class |
gsSizeLim |
a vector of length two, giving the minimum and maximum gene
set size (number of member genes) to be kept for the analysis. Defaults to
|
permStats |
a matrix with permutated gene-level statistics (columns)
for each gene (rows). This should be calculated by the user by randomizing
the sample labels in the original data, and recalculating the gene level
statistics for each comparison a large number of times, thus generating a
vector (rows in the matrix) of background statistics for each gene. This
argument is required and only used if
|
permDirections |
similar to |
nPerm |
the number of permutations to use for gene sampling, i.e. if
|
gseaParam |
the exponent parameter of the GSEA and FGSEA approach. This defaults to 1, as recommended by the GSEA authors. |
ncpus |
the number of cpus to use. If larger than 1, the gene
permutation part will be run in parallel and thus decrease runtime. Requires
R package snowfall to be installed. Should be set so that
|
verbose |
a logical. Whether or not to display progress messages during the analysis. |
The rownames of geneLevelStats
and directions
should be
identical and match the names of the members of the gene sets in gsc
.
If geneSetStat
is set to "fisher"
, "stouffer"
,
"reporter"
or "tailStrength"
only p-values are allowed as
geneLevelStats
. If geneSetStat
is set to "maxmean"
,
"gsea"
, "fgsea"
or "page"
only t-like
geneLevelStats
are allowed (e.g. t-values, fold-changes).
For geneSetStat
set to "fisher"
, "stouffer"
,
"reporter"
, "wilcoxon"
or "page"
, the gene set p-values
can be calculated from a theoretical null-distribution, in this case, set
signifMethod="nullDist"
. For all methods
signifMethod="geneSampling"
or
signifMethod="samplePermutation"
can be used, except for
"fgsea"
where only signifMethod="geneSampling"
is allowed. If
signifMethod="geneSampling"
gene sampling is used, meaning that the
gene labels are randomized nPerm
times and the gene set statistics
are recalculated so that a background distribution for each original gene
set is acquired. The gene set p-values are calculated based on this
background distribution. Similarly if
signifMethod="samplePermutation"
sample permutation is used. In this
case the argument permStats
(and optionally permDirections
)
has to be supplied.
The runGSA
function returns p-values for each gene set. Depending on
the choice of methods and gene statistics up to three classes of p-values
can be calculated, describing different aspects of regulation
directionality. The three directionality classes are Distinct-directional,
Mixed-directional and Non-directional. The non-directional p-values
(pNonDirectional
) are calculated based on absolute values of the gene
statistics (or p-values without sign information), meaning that gene sets
containing a high portion of significant genes, independent of direction,
will turn up significant. That is, gene-sets with a low
pNonDirectional
should be interpreted to be significantly affected by
gene regulation, but there can be a mix of both up and down regulation
involved. The mixed-directional p-values (pMixedDirUp
and
pMixedDirDn
) are calculated using the subset of the gene statistics
that are up-regulated and down-regulated, respectively. This means that a
gene set with a low pMixedDirUp
will have a component of
significantly up-regulated genes, disregardful of the extent of
down-regulated genes, and the reverse for pMixedDirDn
. This also
means that one can get gene sets that are both significantly affected by
down-regulation and significantly affected by up-regulation at the same
time. Note that sample permutation cannot be used to calculate
pMixedDirUp
and pMixedDirDn
since the subset sizes will
differ. Finally, the distinct-directional p-values (pDistinctDirup
and pDistinctDirDn
) are calculated from statistics with sign
information (e.g. t-statistics). In this case, if a gene set contains both
up- and down-regulated genes, they will cancel out each other. A gene-set
with a low pDistinctDirUp
will be significantly affected by
up-regulation, but not a mix of up- and down-regulation (as in the case of
the mixed-directional and non-directional p-values). In order to be able to
calculate distinct-directional gene set p-values while using p-values as
gene-level statistics, the gene-level p-values are transformed as follows:
The up-regulated portion of the p-values are divided by 2 (scaled to range
between 0-0.5) and the down-regulated portion of p-values are set to 1-p/2
(scaled to range between 1-0.5). This means that a significantly
down-regulated gene will get a p-value close to 1. These new p-values are
used as input to the gene-set analysis procedure to get
pDistinctDirUp
. Similarly, the opposite is done, so that the
up-regulated portion is scaled between 1-0.5 and the down-regulated between
0-0.5 to get the pDistinctDirDn
.
A list-like object of class GSAres
containing the following
elements:
geneStatType |
The interpretated type of gene-level statistics |
geneSetStat |
The method for gene set statistic calculation |
signifMethod |
The method for significance estimation |
adjMethod |
The method of adjustment for multiple testing |
info |
A list object with detailed info number of genes and gene sets |
gsSizeLim |
The selected gene set size limits |
gsStatName |
The name of the gene set statistic type |
nPerm |
The number of permutations |
gseaParam |
The GSEA parameter |
geneLevelStats |
The input gene-level statistics |
directions |
The input directions |
gsc |
The input gene set collection |
nGenesTot |
The total number of genes in each gene set |
nGenesUp |
The number of up-regulated genes in each gene set |
nGenesDn |
The number of down-regulated genes in each gene set |
statDistinctDir |
Gene set statistics of the distinct-directional class |
statDistinctDirUp |
Gene set statistics of the distinct-directional class |
statDistinctDirDn |
Gene set statistics of the distinct-directional class |
statNonDirectional |
Gene set statistics of the non-directional class |
statMixedDirUp |
Gene set statistics of the mixed-directional class |
statMixedDirDn |
Gene set statistics of the mixed-directional class |
pDistinctDirUp |
Gene set p-values of the distinct-directional class |
pDistinctDirDn |
Gene set p-values of the distinct-directional class |
pNonDirectional |
Gene set p-values of the non-directional class |
pMixedDirUp |
Gene set p-values of the mixed-directional class |
pMixedDirDn |
Gene set p-values of the mixed-directional class |
pAdjDistinctDirUp |
Adjusted gene set p-values of the distinct-directional class |
pAdjDistinctDirDn |
Adjusted gene set p-values of the distinct-directional class |
pAdjNonDirectional |
Adjusted gene set p-values of the non-directional class |
pAdjMixedDirUp |
Adjusted gene set p-values of the mixed-directional class |
pAdjMixedDirDn |
Adjusted gene set p-values of the mixed-directional class |
runtime |
The execution time in seconds |
Leif Varemo piano.rpkg@gmail.com and Intawat Nookaew piano.rpkg@gmail.com
Fisher, R. Statistical methods for research workers. Oliver and Boyd, Edinburgh, (1932).
Stouffer, S., Suchman, E., Devinney, L., Star, S., and Williams Jr, R. The American soldier: adjustment during army life. Princeton University Press, Oxford, England, (1949).
Patil, K. and Nielsen, J. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proceedings of the National Academy of Sciences of the United States of America 102(8), 2685 (2005).
Oliveira, A., Patil, K., and Nielsen, J. Architecture of transcriptional regulatory circuits is knitted over the topology of bio-molecular interaction networks. BMC Systems Biology 2(1), 17 (2008).
Kim, S. and Volsky, D. Page: parametric analysis of gene set enrichment. BMC bioinformatics 6(1), 144 (2005).
Taylor, J. and Tibshirani, R. A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics 7(2), 167-181 (2006).
Mootha, V., Lindgren, C., Eriksson, K., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. Pgc-1-alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature genetics 34(3), 267-273 (2003).
Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genom-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15545-15550 (2005).
Efron, B. and Tibshirani, R. On testing the significance of sets of genes. The Annals of Applied Statistics 1, 107-129 (2007).
piano, loadGSC
,
GSAsummaryTable
, geneSetSummary
,
networkPlot2
, exploreGSAres
, samr,
limma, GSA, fgsea
# Load example input data to GSA: data("gsa_input") # Load gene set collection: gsc <- loadGSC(gsa_input$gsc) # Run gene set analysis: gsares <- runGSA(geneLevelStats=gsa_input$pvals , directions=gsa_input$directions, gsc=gsc, nPerm=500)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.