knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "80%",
  warning=FALSE,
  message = F
)

EnrichKit

Authors: Lihe Liu and Francisco Peñagaricano
Maintainer: Lihe Liu (lihe.liu@wisc.edu)

The goal of EnrichKit is to perform an over-representation analysis of biological pathways (gene sets) given two gene lists (Significant Genes and Total Genes) using Fisher's exact test (test of proportions based on the hypergeometric distribution). Significant genes could be derived from differentially expressed genes, genes flagged by significant SNPs from whole-genome scans, genes in non-preserved co-expression modules, etc..

knitr::include_graphics("man/figures/README-Enrich_Illustration.png")

Six pathway/annotation databases are currently integrated in the current release:

Note that the current release only supports Bos Taurus, other organisms might be included in the future.

Latest update 10-08-2020.

Installation

EnrichKit is currently unavailable on CRAN.

Users should use the development version from GitHub with:

install.packages("devtools")
devtools::install_github("liulihe954/EnrichKit") # Depends on R (>= 3.5.0)

Example

Suppose we have identified 2 DEGs from a total of 5 genes in each of two lactations.

library(EnrichKit)
# input format
Sig_lac1 =   c("ENSBTAG00000012594","ENSBTAG00000004139")
Sig_lac2 =   c("ENSBTAG00000009188","ENSBTAG00000001258")
Tot_lac1 = c("ENSBTAG00000012594","ENSBTAG00000004139","ENSBTAG00000018278","ENSBTAG00000021997","ENSBTAG00000008482")
Tot_lac2 = c("ENSBTAG00000009188","ENSBTAG00000001258","ENSBTAG00000021819","ENSBTAG00000019404","ENSBTAG00000015212")

# convert and orgnize
GeneInfo = convertNformatID(GeneSetNames=c("lactation1","lactation2"),
                            SigGene_list = list(Sig_lac1,Sig_lac2),
                            TotalGene_list = list(Tot_lac1,Tot_lac2),
                            IDtype = "ens") # Need to choose from c('ens','entrez','symbol')

# Resulting an integreted gene identifier object
GeneInfo

Simply providing significant and total genes as lists, the function convertNformatID() automatically matches and organizes genes across different identifiers, namely, Ensembl Gene ID, EntrezID, Gene Symbol and HGNC suggested symbol**. Also, an additional column indicating significance status will be added (1 stands for significant and 0 for insignificant).

The R object resulted from the last step (e.g. GeneInfo) could be fed into the subsequent step.

There are six databases build-in beforehand, users can simply indicate which database they want to use by providing a parameter - Database = "xxx" in the function arguments:

# Enrichment of each database might take a few mintues to finish.
HyperGEnrich(GeneSet = GeneInfo,
             Database = 'kegg', #c("go","kegg","interpro","mesh","msig","reactome")
             minOverlap = 4, # minimum overlap of pathway genes and total genes
             pvalue_thres = 0.05, # pvalue of fisher's exact test
             adj_pvalue_thres = 0.1, # adjusted pvalues based on multiple testing correction
             padj_method = "fdr", # c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
             NewDB = F)

This function does not return any object in the R environment, however, all the results are packed into an .RData object and saved in the current working directory.

There are two elements in the resulting .RData object:

# Here is a demo of results format
data(SampleResults)
class(results) # it's a list
length(results) # number of elements equals to numbers of (significant) gene list provided
dim(results[[1]]) # e.g. here are 144 significant pathways/terms and each has 9 attributes/statistics
names(results[[1]]) # Specific attributes/statistics

A total of nine columns are documented in the outputs:

About Databases

Although databases will be updated on a regular basis (tentatively every 6 months), users are free to request an update or update/download databases using the built-in database updating functions.

There are a total of six datasets that can be updated.

# Note that these functions are potentially time-comsuming.
# New databases (in .RData format) will be stored in current working directory
GO_DB_Update()
KEGG_DB_Update()
Interpro_DB_Update()
MeSH_DB_Update()
Msig_DB_Update()
Reactome_DB_Update()

Please note, when you use new databases, please make sure:

HyperGEnrich(GeneSet = GeneInfo,
             Database = 'kegg', #c("go","kegg","interpro","mesh","msig","reactome")
             minOverlap = 4, # minimum overlap of pathway genes and total genes
             pvalue_thres = 0.05, # pvalue of fisher's exact test
             adj_pvalue_thres = 0.1, # adjusted pvalues based on multiple testing correction
             padj_method = "fdr", #c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
             NewDB = T) ### Set to T


liulihe954/EnrichKit documentation built on Oct. 10, 2020, 11:49 p.m.