README.md

EnrichKit

Authors: Lihe Liu and Francisco Peñagaricano Maintainer: Lihe Liu (lihe.liu@wisc.edu)

The goal of EnrichKit is to perform an over-representation analysis of biological pathways (gene sets) given two gene lists (Significant Genes and Total Genes) using Fisher’s exact test (test of proportions based on the hypergeometric distribution). Significant genes could be derived from differentially expressed genes, genes flagged by significant SNPs from whole-genome scans, genes in non-preserved co-expression modules, etc..

Six pathway/annotation databases are currently integrated in the current release:

Note that the current release only supports Bos Taurus, other organisms might be included in the future.

Latest update 10-08-2020.

Installation

EnrichKit is currently unavailable on CRAN.

Users should use the development version from GitHub with:

install.packages("devtools")
devtools::install_github("liulihe954/EnrichKit") # Depends on R (>= 3.5.0)

Example

Suppose we have identified 2 DEGs from a total of 5 genes in each of two lactations.

library(EnrichKit)
# input format
Sig_lac1 =   c("ENSBTAG00000012594","ENSBTAG00000004139")
Sig_lac2 =   c("ENSBTAG00000009188","ENSBTAG00000001258")
Tot_lac1 = c("ENSBTAG00000012594","ENSBTAG00000004139","ENSBTAG00000018278","ENSBTAG00000021997","ENSBTAG00000008482")
Tot_lac2 = c("ENSBTAG00000009188","ENSBTAG00000001258","ENSBTAG00000021819","ENSBTAG00000019404","ENSBTAG00000015212")

# convert and orgnize
GeneInfo = convertNformatID(GeneSetNames=c("lactation1","lactation2"),
                            SigGene_list = list(Sig_lac1,Sig_lac2),
                            TotalGene_list = list(Tot_lac1,Tot_lac2),
                            IDtype = "ens") # Need to choose from c('ens','entrez','symbol')

# Resulting an integreted gene identifier object
GeneInfo
#> $lactation1
#>                 Gene ENTREZID SYMBOL SYMBOL_Suggested Sig
#> 1 ENSBTAG00000012594   615431  MRPS6            MRPS6   1
#> 2 ENSBTAG00000004139   531350  BACH1            BACH1   1
#> 3 ENSBTAG00000018278   281640 ATP5PO           ATP5PO   0
#> 4 ENSBTAG00000021997   510879  ITSN1            ITSN1   0
#> 5 ENSBTAG00000008482   516462    SON              SON   0
#> 
#> $lactation2
#>                 Gene ENTREZID  SYMBOL SYMBOL_Suggested Sig
#> 1 ENSBTAG00000009188   281183    GART             GART   1
#> 2 ENSBTAG00000001258   615627 TMEM50B          TMEM50B   1
#> 3 ENSBTAG00000021819   282257  IFNAR1           IFNAR1   0
#> 4 ENSBTAG00000019404   767864  IL10RB           IL10RB   0
#> 5 ENSBTAG00000015212   282258  IFNAR2           IFNAR2   0

Simply providing significant and total genes as lists, the function convertNformatID() automatically matches and organizes genes across different identifiers, namely, Ensembl Gene ID, EntrezID, Gene Symbol and HGNC suggested symbol**. Also, an additional column indicating significance status will be added (1 stands for significant and 0 for insignificant).

The R object resulted from the last step (e.g. GeneInfo) could be fed into the subsequent step.

There are six databases build-in beforehand, users can simply indicate which database they want to use by providing a parameter - Database = “xxx” in the function arguments:

# Enrichment of each database might take a few mintues to finish.
HyperGEnrich(GeneSet = GeneInfo,
             Database = 'kegg', #c("go","kegg","interpro","mesh","msig","reactome")
             minOverlap = 4, # minimum overlap of pathway genes and total genes
             pvalue_thres = 0.05, # pvalue of fisher's exact test
             adj_pvalue_thres = 0.1, # adjusted pvalues based on multiple testing correction
             padj_method = "fdr", # c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
             NewDB = F)

This function does not return any object in the R environment, however, all the results are packed into an .RData object and saved in the current working directory.

There are two elements in the resulting .RData object:

# Here is a demo of results format
data(SampleResults)
class(results) # it's a list
#> [1] "list"
length(results) # number of elements equals to numbers of (significant) gene list provided
#> [1] 3
dim(results[[1]]) # e.g. here are 144 significant pathways/terms and each has 9 attributes/statistics
#> [1] 144   9
names(results[[1]]) # Specific attributes/statistics
#> [1] "Term"               "totalG"             "sigG"              
#> [4] "pvalue"             "ExternalLoss_total" "ExternalLoss_sig"  
#> [7] "findG"              "hitsPerc"           "adj.pvalue"

A total of nine columns are documented in the outputs:

About Databases

Although databases will be updated on a regular basis (tentatively every 6 months), users are free to request an update or update/download databases using the built-in database updating functions.

There are a total of six datasets that can be updated.

# Note that these functions are potentially time-comsuming.
# New databases (in .RData format) will be stored in current working directory
GO_DB_Update()
KEGG_DB_Update()
Interpro_DB_Update()
MeSH_DB_Update()
Msig_DB_Update()
Reactome_DB_Update()

Please note, when you use new databases, please make sure:

HyperGEnrich(GeneSet = GeneInfo,
             Database = 'kegg', #c("go","kegg","interpro","mesh","msig","reactome")
             minOverlap = 4, # minimum overlap of pathway genes and total genes
             pvalue_thres = 0.05, # pvalue of fisher's exact test
             adj_pvalue_thres = 0.1, # adjusted pvalues based on multiple testing correction
             padj_method = "fdr", #c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
             NewDB = T) ### Set to T


liulihe954/EnrichKit documentation built on Oct. 10, 2020, 11:49 p.m.