runGSA: Gene set analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/runGSA.r

Description

Performs gene set analysis (GSA) based on a given number of gene-level statistics and a gene set collection, using a variety of available methods, returning the gene set statistics and p-values of different directionality classes.

Usage

1
2
3
4
runGSA(geneLevelStats, directions = NULL, geneSetStat = "mean",
  signifMethod = "geneSampling", adjMethod = "fdr", gsc,
  gsSizeLim = c(1, Inf), permStats = NULL, permDirections = NULL,
  nPerm = 10000, gseaParam = 1, ncpus = 1, verbose = TRUE)

Arguments

geneLevelStats

a vector or a one-column data.frame or matrix, containing the gene level statistics. Gene level statistics can be e.g. p-values, t-values or F-values.

directions

a vector or a one-column data.frame or matrix, containing fold-change like values for the related gene-level statistics. This is mainly used if statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter.

geneSetStat

the statistical GSA method to use. Can be one of "fisher", "stouffer", "reporter", "tailStrength", "wilcoxon", "mean", "median", "sum", "maxmean", "gsea", "fgsea" or "page". See below for details.

signifMethod

the method for significance assessment of gene sets, i.e. p-value calculation. Can be one of "geneSampling", "samplePermutation" or "nullDist"

adjMethod

the method for adjusting for multiple testing. Can be any of the methods supported by p.adjust, i.e. "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr" or "none". The exception is for geneSetStat="gsea", where only the options "fdr" and "none" can be used.

gsc

a gene set collection given as an object of class GSC as returned by the loadGSC function.

gsSizeLim

a vector of length two, giving the minimum and maximum gene set size (number of member genes) to be kept for the analysis. Defaults to c(1,Inf).

permStats

a matrix with permutated gene-level statistics (columns) for each gene (rows). This should be calculated by the user by randomizing the sample labels in the original data, and recalculating the gene level statistics for each comparison a large number of times, thus generating a vector (rows in the matrix) of background statistics for each gene. This argument is required and only used if signifMethod="samplePermutation".

permDirections

similar to permStats, but should instead contain fold-change like values for the related permutated statistics. This is mainly used if the statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter. This argument is only used if signifMethod="samplePermutation", but not required. Note however, that if directions is give then also permDirections is required, and vice versa.

nPerm

the number of permutations to use for gene sampling, i.e. if signifMethod="geneSampling". The original Reporter features algorithm (geneSetStat="reporter" and signifMethod="nullDist") also uses a permutation step which is controlled by nPerm.

gseaParam

the exponent parameter of the GSEA and FGSEA approach. This defaults to 1, as recommended by the GSEA authors.

ncpus

the number of cpus to use. If larger than 1, the gene permutation part will be run in parallel and thus decrease runtime. Requires R package snowfall to be installed. Should be set so that nPerm/ncpus is a positive integer. (Not used by FGSEA.)

verbose

a logical. Whether or not to display progress messages during the analysis.

Details

The rownames of geneLevelStats and directions should be identical and match the names of the members of the gene sets in gsc. If geneSetStat is set to "fisher", "stouffer", "reporter" or "tailStrength" only p-values are allowed as geneLevelStats. If geneSetStat is set to "maxmean", "gsea", "fgsea" or "page" only t-like geneLevelStats are allowed (e.g. t-values, fold-changes).

For geneSetStat set to "fisher", "stouffer", "reporter", "wilcoxon" or "page", the gene set p-values can be calculated from a theoretical null-distribution, in this case, set signifMethod="nullDist". For all methods signifMethod="geneSampling" or signifMethod="samplePermutation" can be used, except for "fgsea" where only signifMethod="geneSampling" is allowed. If signifMethod="geneSampling" gene sampling is used, meaning that the gene labels are randomized nPerm times and the gene set statistics are recalculated so that a background distribution for each original gene set is acquired. The gene set p-values are calculated based on this background distribution. Similarly if signifMethod="samplePermutation" sample permutation is used. In this case the argument permStats (and optionally permDirections) has to be supplied.

The runGSA function returns p-values for each gene set. Depending on the choice of methods and gene statistics up to three classes of p-values can be calculated, describing different aspects of regulation directionality. The three directionality classes are Distinct-directional, Mixed-directional and Non-directional. The non-directional p-values (pNonDirectional) are calculated based on absolute values of the gene statistics (or p-values without sign information), meaning that gene sets containing a high portion of significant genes, independent of direction, will turn up significant. That is, gene-sets with a low pNonDirectional should be interpreted to be significantly affected by gene regulation, but there can be a mix of both up and down regulation involved. The mixed-directional p-values (pMixedDirUp and pMixedDirDn) are calculated using the subset of the gene statistics that are up-regulated and down-regulated, respectively. This means that a gene set with a low pMixedDirUp will have a component of significantly up-regulated genes, disregardful of the extent of down-regulated genes, and the reverse for pMixedDirDn. This also means that one can get gene sets that are both significantly affected by down-regulation and significantly affected by up-regulation at the same time. Note that sample permutation cannot be used to calculate pMixedDirUp and pMixedDirDn since the subset sizes will differ. Finally, the distinct-directional p-values (pDistinctDirup and pDistinctDirDn) are calculated from statistics with sign information (e.g. t-statistics). In this case, if a gene set contains both up- and down-regulated genes, they will cancel out each other. A gene-set with a low pDistinctDirUp will be significantly affected by up-regulation, but not a mix of up- and down-regulation (as in the case of the mixed-directional and non-directional p-values). In order to be able to calculate distinct-directional gene set p-values while using p-values as gene-level statistics, the gene-level p-values are transformed as follows: The up-regulated portion of the p-values are divided by 2 (scaled to range between 0-0.5) and the down-regulated portion of p-values are set to 1-p/2 (scaled to range between 1-0.5). This means that a significantly down-regulated gene will get a p-value close to 1. These new p-values are used as input to the gene-set analysis procedure to get pDistinctDirUp. Similarly, the opposite is done, so that the up-regulated portion is scaled between 1-0.5 and the down-regulated between 0-0.5 to get the pDistinctDirDn.

Value

A list-like object of class GSAres containing the following elements:

geneStatType

The interpretated type of gene-level statistics

geneSetStat

The method for gene set statistic calculation

signifMethod

The method for significance estimation

adjMethod

The method of adjustment for multiple testing

info

A list object with detailed info number of genes and gene sets

gsSizeLim

The selected gene set size limits

gsStatName

The name of the gene set statistic type

nPerm

The number of permutations

gseaParam

The GSEA parameter

geneLevelStats

The input gene-level statistics

directions

The input directions

gsc

The input gene set collection

nGenesTot

The total number of genes in each gene set

nGenesUp

The number of up-regulated genes in each gene set

nGenesDn

The number of down-regulated genes in each gene set

statDistinctDir

Gene set statistics of the distinct-directional class

statDistinctDirUp

Gene set statistics of the distinct-directional class

statDistinctDirDn

Gene set statistics of the distinct-directional class

statNonDirectional

Gene set statistics of the non-directional class

statMixedDirUp

Gene set statistics of the mixed-directional class

statMixedDirDn

Gene set statistics of the mixed-directional class

pDistinctDirUp

Gene set p-values of the distinct-directional class

pDistinctDirDn

Gene set p-values of the distinct-directional class

pNonDirectional

Gene set p-values of the non-directional class

pMixedDirUp

Gene set p-values of the mixed-directional class

pMixedDirDn

Gene set p-values of the mixed-directional class

pAdjDistinctDirUp

Adjusted gene set p-values of the distinct-directional class

pAdjDistinctDirDn

Adjusted gene set p-values of the distinct-directional class

pAdjNonDirectional

Adjusted gene set p-values of the non-directional class

pAdjMixedDirUp

Adjusted gene set p-values of the mixed-directional class

pAdjMixedDirDn

Adjusted gene set p-values of the mixed-directional class

runtime

The execution time in seconds

Author(s)

Leif Varemo piano.rpkg@gmail.com and Intawat Nookaew piano.rpkg@gmail.com

References

Fisher, R. Statistical methods for research workers. Oliver and Boyd, Edinburgh, (1932).

Stouffer, S., Suchman, E., Devinney, L., Star, S., and Williams Jr, R. The American soldier: adjustment during army life. Princeton University Press, Oxford, England, (1949).

Patil, K. and Nielsen, J. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proceedings of the National Academy of Sciences of the United States of America 102(8), 2685 (2005).

Oliveira, A., Patil, K., and Nielsen, J. Architecture of transcriptional regulatory circuits is knitted over the topology of bio-molecular interaction networks. BMC Systems Biology 2(1), 17 (2008).

Kim, S. and Volsky, D. Page: parametric analysis of gene set enrichment. BMC bioinformatics 6(1), 144 (2005).

Taylor, J. and Tibshirani, R. A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics 7(2), 167-181 (2006).

Mootha, V., Lindgren, C., Eriksson, K., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. Pgc-1-alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature genetics 34(3), 267-273 (2003).

Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genom-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15545-15550 (2005).

Efron, B. and Tibshirani, R. On testing the significance of sets of genes. The Annals of Applied Statistics 1, 107-129 (2007).

See Also

piano, loadGSC, GSAsummaryTable, geneSetSummary, networkPlot2, exploreGSAres, HTSanalyzeR, PGSEA, samr, limma, GSA, fgsea

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
   # Load example input data to GSA:
   data("gsa_input")
   
   # Load gene set collection:
   gsc <- loadGSC(gsa_input$gsc)
      
   # Run gene set analysis:
   gsares <- runGSA(geneLevelStats=gsa_input$pvals , directions=gsa_input$directions, 
                    gsc=gsc, nPerm=500)
   

piano documentation built on Nov. 8, 2020, 6:27 p.m.