runGSA: Gene set analysis
In piano: Platform for integrative analysis of omics data

Description Usage Arguments Details Value Author(s) References See Also Examples

Performs gene set analysis (GSA) based on a given number of gene-level statistics and a gene set collection, using a variety of available methods, returning the gene set statistics and p-values of different directionality classes.

runGSA(geneLevelStats, directions = NULL, geneSetStat = "mean",
  signifMethod = "geneSampling", adjMethod = "fdr", gsc,
  gsSizeLim = c(1, Inf), permStats = NULL, permDirections = NULL,
  nPerm = 10000, gseaParam = 1, ncpus = 1, verbose = TRUE)

`geneLevelStats`	a vector or a one-column data.frame or matrix, containing the gene level statistics. Gene level statistics can be e.g. p-values, t-values or F-values.
`directions`	a vector or a one-column data.frame or matrix, containing fold-change like values for the related gene-level statistics. This is mainly used if statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter.
`geneSetStat`	the statistical GSA method to use. Can be one of `"fisher"`, `"stouffer"`, `"reporter"`, `"tailStrength"`, `"wilcoxon"`, `"mean"`, `"median"`, `"sum"`, `"maxmean"`, `"gsea"`, `"fgsea"` or `"page"`. See below for details.
`signifMethod`	the method for significance assessment of gene sets, i.e. p-value calculation. Can be one of `"geneSampling"`, `"samplePermutation"` or `"nullDist"`
`adjMethod`	the method for adjusting for multiple testing. Can be any of the methods supported by `p.adjust`, i.e. `"holm"`, `"hochberg"`, `"hommel"`, `"bonferroni"`, `"BH"`, `"BY"`, `"fdr"` or `"none"`. The exception is for `geneSetStat="gsea"`, where only the options `"fdr"` and `"none"` can be used.
`gsc`	a gene set collection given as an object of class `GSC` as returned by the `loadGSC` function.
`gsSizeLim`	a vector of length two, giving the minimum and maximum gene set size (number of member genes) to be kept for the analysis. Defaults to `c(1,Inf)`.
`permStats`	a matrix with permutated gene-level statistics (columns) for each gene (rows). This should be calculated by the user by randomizing the sample labels in the original data, and recalculating the gene level statistics for each comparison a large number of times, thus generating a vector (rows in the matrix) of background statistics for each gene. This argument is required and only used if `signifMethod="samplePermutation"`.
`permDirections`	similar to `permStats`, but should instead contain fold-change like values for the related permutated statistics. This is mainly used if the statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter. This argument is only used if `signifMethod="samplePermutation"`, but not required. Note however, that if `directions` is give then also `permDirections` is required, and vice versa.
`nPerm`	the number of permutations to use for gene sampling, i.e. if `signifMethod="geneSampling"`. The original Reporter features algorithm (`geneSetStat="reporter"` and `signifMethod="nullDist"`) also uses a permutation step which is controlled by `nPerm`.
`gseaParam`	the exponent parameter of the GSEA and FGSEA approach. This defaults to 1, as recommended by the GSEA authors.
`ncpus`	the number of cpus to use. If larger than 1, the gene permutation part will be run in parallel and thus decrease runtime. Requires R package snowfall to be installed. Should be set so that `nPerm/ncpus` is a positive integer. (Not used by FGSEA.)
`verbose`	a logical. Whether or not to display progress messages during the analysis.

The rownames of geneLevelStats and directions should be identical and match the names of the members of the gene sets in gsc. If geneSetStat is set to "fisher", "stouffer", "reporter" or "tailStrength" only p-values are allowed as geneLevelStats. If geneSetStat is set to "maxmean", "gsea", "fgsea" or "page" only t-like geneLevelStats are allowed (e.g. t-values, fold-changes).

For geneSetStat set to "fisher", "stouffer", "reporter", "wilcoxon" or "page", the gene set p-values can be calculated from a theoretical null-distribution, in this case, set signifMethod="nullDist". For all methods signifMethod="geneSampling" or signifMethod="samplePermutation" can be used, except for "fgsea" where only signifMethod="geneSampling" is allowed. If signifMethod="geneSampling" gene sampling is used, meaning that the gene labels are randomized nPerm times and the gene set statistics are recalculated so that a background distribution for each original gene set is acquired. The gene set p-values are calculated based on this background distribution. Similarly if signifMethod="samplePermutation" sample permutation is used. In this case the argument permStats (and optionally permDirections) has to be supplied.

The runGSA function returns p-values for each gene set. Depending on the choice of methods and gene statistics up to three classes of p-values can be calculated, describing different aspects of regulation directionality. The three directionality classes are Distinct-directional, Mixed-directional and Non-directional. The non-directional p-values (pNonDirectional) are calculated based on absolute values of the gene statistics (or p-values without sign information), meaning that gene sets containing a high portion of significant genes, independent of direction, will turn up significant. That is, gene-sets with a low pNonDirectional should be interpreted to be significantly affected by gene regulation, but there can be a mix of both up and down regulation involved. The mixed-directional p-values (pMixedDirUp and pMixedDirDn) are calculated using the subset of the gene statistics that are up-regulated and down-regulated, respectively. This means that a gene set with a low pMixedDirUp will have a component of significantly up-regulated genes, disregardful of the extent of down-regulated genes, and the reverse for pMixedDirDn. This also means that one can get gene sets that are both significantly affected by down-regulation and significantly affected by up-regulation at the same time. Note that sample permutation cannot be used to calculate pMixedDirUp and pMixedDirDn since the subset sizes will differ. Finally, the distinct-directional p-values (pDistinctDirup and pDistinctDirDn) are calculated from statistics with sign information (e.g. t-statistics). In this case, if a gene set contains both up- and down-regulated genes, they will cancel out each other. A gene-set with a low pDistinctDirUp will be significantly affected by up-regulation, but not a mix of up- and down-regulation (as in the case of the mixed-directional and non-directional p-values). In order to be able to calculate distinct-directional gene set p-values while using p-values as gene-level statistics, the gene-level p-values are transformed as follows: The up-regulated portion of the p-values are divided by 2 (scaled to range between 0-0.5) and the down-regulated portion of p-values are set to 1-p/2 (scaled to range between 1-0.5). This means that a significantly down-regulated gene will get a p-value close to 1. These new p-values are used as input to the gene-set analysis procedure to get pDistinctDirUp. Similarly, the opposite is done, so that the up-regulated portion is scaled between 1-0.5 and the down-regulated between 0-0.5 to get the pDistinctDirDn.

A list-like object of class GSAres containing the following elements:

`geneStatType`	The interpretated type of gene-level statistics
`geneSetStat`	The method for gene set statistic calculation
`signifMethod`	The method for significance estimation
`adjMethod`	The method of adjustment for multiple testing
`info`	A list object with detailed info number of genes and gene sets
`gsSizeLim`	The selected gene set size limits
`gsStatName`	The name of the gene set statistic type
`nPerm`	The number of permutations
`gseaParam`	The GSEA parameter
`geneLevelStats`	The input gene-level statistics
`directions`	The input directions
`gsc`	The input gene set collection
`nGenesTot`	The total number of genes in each gene set
`nGenesUp`	The number of up-regulated genes in each gene set
`nGenesDn`	The number of down-regulated genes in each gene set
`statDistinctDir`	Gene set statistics of the distinct-directional class
`statDistinctDirUp`	Gene set statistics of the distinct-directional class
`statDistinctDirDn`	Gene set statistics of the distinct-directional class
`statNonDirectional`	Gene set statistics of the non-directional class
`statMixedDirUp`	Gene set statistics of the mixed-directional class
`statMixedDirDn`	Gene set statistics of the mixed-directional class
`pDistinctDirUp`	Gene set p-values of the distinct-directional class
`pDistinctDirDn`	Gene set p-values of the distinct-directional class
`pNonDirectional`	Gene set p-values of the non-directional class
`pMixedDirUp`	Gene set p-values of the mixed-directional class
`pMixedDirDn`	Gene set p-values of the mixed-directional class
`pAdjDistinctDirUp`	Adjusted gene set p-values of the distinct-directional class
`pAdjDistinctDirDn`	Adjusted gene set p-values of the distinct-directional class
`pAdjNonDirectional`	Adjusted gene set p-values of the non-directional class
`pAdjMixedDirUp`	Adjusted gene set p-values of the mixed-directional class
`pAdjMixedDirDn`	Adjusted gene set p-values of the mixed-directional class
`runtime`	The execution time in seconds

Leif Varemo piano.rpkg@gmail.com and Intawat Nookaew piano.rpkg@gmail.com

Fisher, R. Statistical methods for research workers. Oliver and Boyd, Edinburgh, (1932).

Stouffer, S., Suchman, E., Devinney, L., Star, S., and Williams Jr, R. The American soldier: adjustment during army life. Princeton University Press, Oxford, England, (1949).

Patil, K. and Nielsen, J. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proceedings of the National Academy of Sciences of the United States of America 102(8), 2685 (2005).

Oliveira, A., Patil, K., and Nielsen, J. Architecture of transcriptional regulatory circuits is knitted over the topology of bio-molecular interaction networks. BMC Systems Biology 2(1), 17 (2008).

Kim, S. and Volsky, D. Page: parametric analysis of gene set enrichment. BMC bioinformatics 6(1), 144 (2005).

Taylor, J. and Tibshirani, R. A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics 7(2), 167-181 (2006).

Mootha, V., Lindgren, C., Eriksson, K., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. Pgc-1-alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature genetics 34(3), 267-273 (2003).

Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genom-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15545-15550 (2005).

Efron, B. and Tibshirani, R. On testing the significance of sets of genes. The Annals of Applied Statistics 1, 107-129 (2007).

piano, loadGSC, GSAsummaryTable, geneSetSummary, networkPlot2, exploreGSAres, HTSanalyzeR, PGSEA, samr, limma, GSA, fgsea

   # Load example input data to GSA:
   data("gsa_input")
   
   # Load gene set collection:
   gsc <- loadGSC(gsa_input$gsc)
      
   # Run gene set analysis:
   gsares <- runGSA(geneLevelStats=gsa_input$pvals , directions=gsa_input$directions, 
                    gsc=gsc, nPerm=500)