ssea.analyze: Marker set enrichment analysis (MSEA)
In Mergeomics: Integrative network analysis of omics data

Description Usage Arguments Details Value Author(s) References See Also Examples

ssea.analyze finds the enrichment of the pathways or co-expression modules by a marker set (e.g. associated risk variants -loci- of a relevant disease). Association study by mapping markers (e.g. SNPs) to genes (e.g. via expression QTLs). Enrichment P-values obtained by the MSEA denote the degree of enrichment of significantly disease-associated (high ranking) markers (e.g. eSNPs) within these pathways when compared to the null distribution of expected uniform distribution of all ranks of the markers. MSEA is performed with either gene-level or marker-level permutations based on Gaussian distribution.

1	ssea.analyze(job, trim_start, trim_end)

`job`	the data list including fields: seed for random number generator `(job$seed)` (to obtain the same results from permutation when the same input set is given), random permutation level `(job$permtype)` (either gene- or marker-level), maximum number of permutations `(job$nperm)` for the permutation test, and the database that uses indexed identities for modules, genes, and markers (e.g. loci) `(job$database)`.
`trim_start`	percentile taken from the beginning for trimming away a defined proportion of genes with significant trait association to avoid signal inflation of null background in gene permutation. Default value is 0.002.
`trim_end`	percentile taken from the ending point for trimming away a defined proportion of genes with significant trait association to avoid signal inflation of null background in gene permutation. Default value is 0.998.

ssea.analyze associates the gene sets (pathways or co-expression modules) with relevant disease (e.g. Coronary Artery Disease) association data by mapping markers (e.g. SNPs) to genes (e.g. via expression QTLs). It performs the MSEA by using observed and estimated enrichment scores. First, the observed enrichment scores of the pathways by markers (e.g. loci) are calculated. Then, a Gaussian distribution based simulation is performed, by using the statistics of the observed scores (mean, std.dev., etc.), to obtain the estimated enrichment scores, enrichment frequencies, and other statistics e.g. p-values for the pathways. ssea.analyze trims away a defined proportion of genes with significant trait association to avoid signal inflation of null background in gene permutation by using trim_start and trim_end.

job

the updated data frame including results: indexed module identity, enrichment P-values, raw frequencies (raw frequency of a gene set defines the number of the estimated enrichment scores that are larger than this gene set's enrichment score under the null distribution based on Gaussian function)

Ville-Petteri Makinen

Shu L, Zhao Y, Kurt Z, Byars SG, Tukiainen T, Kettunen J, Orozco LD, Pellegrini M, Lusis AJ, Ripatti S, Zhang B, Inouye M, Makinen V-P, Yang X. Mergeomics: multidimensional data integration to identify pathogenic perturbations to biological systems. BMC genomics. 2016;17(1):874.

ssea.control, ssea.finish, ssea.prepare, ssea.start, ssea2kda

job.msea <- list()
job.msea$label <- "hdlc"
job.msea$folder <- "Results"
job.msea$genfile <- system.file("extdata", 
"genes.hdlc_040kb_ld70.human_eliminated.txt", package="Mergeomics")
job.msea$marfile <- system.file("extdata", 
"marker.hdlc_040kb_ld70.human_eliminated.txt", package="Mergeomics")
job.msea$modfile <- system.file("extdata", 
"modules.mousecoexpr.liver.human.txt", package="Mergeomics")
job.msea$inffile <- system.file("extdata", 
"coexpr.info.txt", package="Mergeomics")
job.msea$nperm <- 100 ## default value is 20000

## ssea.start() process takes long time while merging the genes sharing high
## amounts of markers (e.g. loci). it is performed with full module list in
## the vignettes. Here, we used a very subset of the module list (1st 10 mods
## from the original module file) and we collected the corresponding genes
## and markers belonging to these modules:
moddata <- tool.read(job.msea$modfile)
gendata <- tool.read(job.msea$genfile)
mardata <- tool.read(job.msea$marfile)
mod.names <- unique(moddata$MODULE)[1:min(length(unique(moddata$MODULE)),
10)]
moddata <- moddata[which(!is.na(match(moddata$MODULE, mod.names))),]
gendata <- gendata[which(!is.na(match(gendata$GENE, 
unique(moddata$GENE)))),]
mardata <- mardata[which(!is.na(match(mardata$MARKER, 
unique(gendata$MARKER)))),]

## save this to a temporary file and set its path as new job.msea$modfile:
tool.save(moddata, "subsetof.coexpr.modules.txt")
tool.save(gendata, "subsetof.genfile.txt")
tool.save(mardata, "subsetof.marfile.txt")
job.msea$modfile <- "subsetof.coexpr.modules.txt"
job.msea$genfile <- "subsetof.genfile.txt"
job.msea$marfile <- "subsetof.marfile.txt"
## run ssea.start() and prepare for this small set: (due to the huge runtime)
job.msea <- ssea.start(job.msea)
job.msea <- ssea.prepare(job.msea)
job.msea <- ssea.control(job.msea)
job.msea <- ssea.analyze(job.msea)

## Remove the temporary files used for the test:
file.remove("subsetof.coexpr.modules.txt")
file.remove("subsetof.genfile.txt")
file.remove("subsetof.marfile.txt")

Writing to file... 
Saved 1744 rows in 'subsetof.coexpr.modules.txt'.
[1] "subsetof.coexpr.modules.txt"

Writing to file... 
Saved 7376 rows in 'subsetof.genfile.txt'.
[1] "subsetof.genfile.txt"

Writing to file... 
Saved 7070 rows in 'subsetof.marfile.txt'.
[1] "subsetof.marfile.txt"

MSEA Version:01.04.2016

Parameters:
  Permutation type: gene
  Permutations: 100
  Random seed: 1
  Minimum gene count: 10
  Maximum gene count: 500
  Maximum overlap between genes: 0.33

Importing modules...
    MODULE             DESCR          
 Length:20          Length:20         
 Class :character   Class :character  
 Mode  :character   Mode  :character  
    MODULE              GENE          
 Length:1744        Length:1744       
 Class :character   Class :character  
 Mode  :character   Mode  :character  

Importing marker values...
    MARKER              VALUE         
 Length:7070        Min.   :  0.8094  
 Class :character   1st Qu.:  0.9531  
 Mode  :character   Median :  1.1685  
                    Mean   :  1.8574  
                    3rd Qu.:  1.5491  
                    Max.   :323.0100  

Importing mapping data...
     GENE              MARKER         
 Length:7376        Length:7376       
 Class :character   Class :character  
 Mode  :character   Mode  :character  

Merging genes containing shared markers...

16110 comparisons

7260 comparisons

7021 comparisons
Job: 0.7925262 Mb

Preparing data structures...
Job: 1.21257 Mb

Adding positive controls...
Job: 1.298004 Mb

Estimating enrichment...
100/100 cycles

Normalizing scores...
[1] TRUE
[1] TRUE
[1] TRUE