diffEnrich by example

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval.after = 'fig.cap'
)
options(knitr.table.format = "html") 

diffEnrich

Introduction

The goal of diffEnrich is to compare functional enrichment between two experimentally-derived groups of genes or proteins. Given a list of NCBI gene symbols, diffEnrich will perform differential enrichment analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) REST API. This package provides a number of functions that are intended to be used in a pipeline (See Figure 1). Briefly, the user provides a KEGG formatted species id for either human, mouse or rat, and the package will download and store species specific ENTREZ gene IDs and map them to their respective KEGG pathways by accessing the KEGG REST API. The KEGG API is used to guarantee the most up-to-date pathway data from KEGG. Next, the user will identify significantly enriched pathways in two different gene sets, and finally, the user will identify pathways that are differentially enriched between the two gene sets. In addition to the analysis pipeline, this package also provides a plotting function.

The KEGG REST API

KEGG is a database resource for understanding high-level functions of a biological system, such as a cell, an organism and an ecosystem, from genomic and molecular-level information https://www.kegg.jp/kegg/kegg1a.html. KEGG is an integrated database resource consisting of eighteen databases that are clustered into 4 main categories: 1) systems information (e.g. hierarchies and maps), 2) genomic information (e.g. genes and proteins), 3) chemical information (e.g. biochemical reactions), and 4) health information (e.g. human disease and drugs) https://www.kegg.jp/kegg/kegg1a.html.

In 2012 KEGG released its first application programming interface (API), and has been adding features and functionality ever since. There are benefits to using an API. First, API's, like KEGG's, allow users to perform customized analyses with the most up-to-date versions of the data contained in the database. In addition, accessing the KEGG API is very easy using statistical programming tools like R or Python and integrating data retrieval into user's code makes the program reproducible. To further enforce reproducibilty diffEnrich adds a date and KEGG release tag to all data files it generates from accessing the API. For update histories and release notes for the KEGG REST API please go here.

suppressMessages(library(diagram))

## open plot 
par(mar = c(0.5, 0.5, 0.5, 0.5))
openplotmat()
## define number of boxes at each level
elpos <- coordinates (c(3, 1, 1, 2, 2, 1, 1, 1))
## generate matrix of from-to arrow coordinates
fromto <- matrix(ncol = 2, byrow = TRUE,
                 data = c(1, 6,
                          2, 4,
                          3, 7,
                          4, 5,
                          5, 6,
                          5, 7,
                          6, 8,
                          7, 9,
                          8, 10, 
                          9, 10,
                          10, 11,
                          11, 12))
nr <- nrow(fromto)
arrpos <- matrix(ncol = 2, nrow = nr)

## Build flow chart
for (i in 1:nr){
  arrpos[i, ] <- straightarrow(to = elpos[fromto[i, 2], ],
                                from = elpos[fromto[i, 1], ],
                                lwd = 1, arr.pos = 0.6, arr.length = 0.2)
  textellipse(elpos[1,], 0.10, 0.04, lab = "geneList1", box.col = "purple",
              shadow.col = "purple4", shadow.size = 0.005, cex = 0.6)
  textrect(elpos[2,], 0.10, 0.03, lab = "get_kegg()", box.col = "steelblue1",
              shadow.col = "darkblue", shadow.size = 0.005, cex = 0.6)
  textellipse(elpos[3,], 0.10, 0.04, lab = "geneList2", box.col = "purple",
              shadow.col = "purple4", shadow.size = 0.005, cex = 0.6)
  textround(elpos[4,], 0.13, 0.03,lab = "KEGG REST API", box.col = "orange1",
            shadow.col = "orange4", shadow.size = 0.005, cex = 0.5)
  textellipse(elpos[5,], 0.18, 0.06, lab = c("cleaned gene-to-pathway", "map"), box.col = "firebrick1",
              shadow.col = "firebrick4", shadow.size = 0.005, cex = 0.5)
  textrect(elpos[6,], 0.14, 0.03,lab = "pathEnrich(List 1)", box.col = "steelblue1",
            shadow.col = "darkblue", shadow.size = 0.005, cex = 0.5)
  textrect(elpos[7,], 0.14, 0.03, lab = "pathEnrich(List 2)", box.col = "steelblue1",
           shadow.col = "darkblue", shadow.size = 0.005, cex = 0.5)
  textellipse(elpos[8,], 0.12, 0.04, lab = c("pathEnrich","result/object"), box.col = "firebrick1",
              shadow.col = "firebrick4", shadow.size = 0.005, cex = 0.5)
  textellipse(elpos[9,], 0.12, 0.04, lab = c("pathEnrich","result/object"),box.col = "firebrick1",
              shadow.col = "firebrick4", shadow.size = 0.005, cex = 0.5)
  textrect(elpos[10,], 0.10, 0.03, lab = "diffEnrich()", box.col = "steelblue1",
           shadow.col = "darkblue", shadow.size = 0.005, cex = 0.5)
  textellipse(elpos[11,], 0.12, 0.06, lab = c("diffEnrich","result/object"), box.col = "firebrick1",
              shadow.col = "firebrick4", shadow.size = 0.005, cex = 0.5)
  textrect(elpos[12,], 0.17, 0.03, lab = "plotFoldEnrichment()", box.col = "steelblue1",
           shadow.col = "darkblue", shadow.size = 0.005, cex = 0.5)
  #text(0.15,0.05, "Figure 1. diffEnrich Analysis pipeline", cex = 0.15)
  text(0.05,0.87, "Step 1", cex = 0.45, font = 2)
  text(0.05,0.57, "Step 2", cex = 0.45, font = 2)
  text(0.05,0.35, "Step 3", cex = 0.45, font = 2)
  text(0.05,0.10, "Step 4", cex = 0.45, font = 2)
}
cap <- c("Figure 1. diffEnrich Analysis pipeline. Functions within the diffEnrich package are represented by blue rectangles. The data that must be provided by the user is represented by the purple ovals. Data objects generated by a function in diffEnrich are represented by red ovals. The external call of the get_kegg function to the KEGG REST API is represented in yellow.")

Motivating experimental design for differential enrichment

Often high throughput omics studies include a functional enrichment analysis to glean biological insight from a list of candidate genes, proteins, metabolites, etc. Functional enrichment examines whether the number of genes in the list associated with a biological function or particular pathway is more than would be expected by chance. As an example, enrichment of a particular pathway among a list of genes that are differentially expressed after an experimental manipulation may indicate that the pathway has been altered by that manipulation. This analysis is rather straight forward and many solutions have been offered (e.g., [Haung et al., 2009]https://pubmed.ncbi.nlm.nih.gov/19033363/); Kuleshov et al., 2016; Liao et al., 2019; Subramanian et al., 2005). A wide variety of databases have also been used to define these pathways (e.g., Kanehisa and Goto, 2000) and ontologies (e.g., Ashburn et al., 2000).

One key component of a statistically rigorous functional enrichment analysis is the definition of a background data set that can be used to estimate the number of candidate genes that are ‘expected’ to be associated with the pathway by chance, e.g., if 5% of genes in the background data set are associated with a pathway then 5% of candidate gene are expected to be associated with the pathway by chance. For many study designs, the background data set is relatively simple to define (e.g., RNA-Seq analyses where the background data set includes genes expressed above background).

However, for some newer omics technologies, the background data set is hard to define. For example, LC-MS analysis can be used to identify carbonylated proteins ( Peterson et al., 2018; Shearn et al., 2019; Shearn et al., 2018). With this study design, carbonylated proteins are isolated using a BH-derivation and then LC-MS is used to identify peptides in this isolated sample. The most appropriate background data set would be proteins present in that tissue, but this would require a separate analytical analysis. Furthermore, most functional enrichment analyses involve a single gene list. However, in protein modification studies, the typical experimental design compares the presence or absence of particular modified proteins between multiple groups.

When there are two or more gene lists to compare and the background gene list is not clearly defined, as is often the case in protein modification experiments, we propose a differential enrichment analysis. In this analysis, we compare the proportion of genes/proteins from one gene list associated with a particular pathway to the proportion of genes/proteins from a second gene list that are associated with that pathway. To easily execute this analysis, we have designed an R package that uses the KEGG REST API to obtain the most recent version of the KEGG PATHWAY ( Kanehisa and Goto, 2000) database to initially identify functional enrichment within a gene list using the entire KEGG transcriptome as the background data set and then to identify differentially enriched pathways between two gene lists. This R package includes a function to generate a "differential enrichment" graphic.

Installation

You can install the released version of diffEnrich from CRAN with:

install.packages("diffEnrich") 

Example

Step 1: Collect and store pathways from KEGG API

First we will use the get_kegg function to access the KEGG REST API and download the data sets required to perform our downstream analysis. This function takes two arguments. The first, 'species' is required. Currently, diffEnrich supports three species, and the argument is a character string using the KEGG code https://www.pnas.org/doi/10.1073/pnas.0806162105: Homo sapiens (human), use 'hsa'; Mus musculus (mouse), use 'mmu'; and Rattus norvegicus (rat), use 'rno'. The second, 'path' is also passed as a character string, and is the path of the directory in which the user would like to write the data sets downloaded from the KEGG REST API. If the user does not provide a path, the data sets will be automatically written to the current working directory using the here::here() functionality. These data sets will be tab delimited files with a name describing the data, and for reproducibility, the date they were generated and the version of KEGG when the API was accessed. In addition to these flat files, get_kegg will also create a named list in R with the three relevant KEGG data sets. The names of this list will describe the data set. For a detailed description of list elements use ?get_kegg.

suppressMessages(library(diffEnrich))
## run get_kegg() using rat
kegg_rno <- get_kegg('rno')

Here are examples of the output files:

ncbi_to_kegg2019-10-17Release_92.0+_10-17_Oct_19.txt
kegg_to_pathway2019-10-17Release_92.0+_10-17_Oct_19.txt
pathway_to_species2019-10-17Release_92.0+_10-17_Oct_19.txt

Note: Because it is assumed that the user might want to use the data sets generated by get_kegg, it is careful not to overwrite data sets with exact names. get_kegg checks the path provided for data sets generated 'same-day/same-version', and if it finds even one of the three, it will not re-write any of the data sets. It will still however, let the user know it is not writing out new data sets and still generate the named list object. Users can generate 'same-day/same-version' data sets in different directories if they so choose.

## run get_kegg() using rat
kegg_rno <- get_kegg('rno')
## run get_kegg() using rat
kegg_rno <- get_kegg(read = TRUE, path = "/path/to/files", date = "2019-10-17", release = "92")

Here is an example of the output:

Reading in the following files:
ncbi_to_kegg2019-10-17Release_92.0+_10-17_Oct_19.txt
kegg_to_pathway2019-10-17Release_92.0+_10-17_Oct_19.txt
pathway_to_species2019-10-17Release_92.0+_10-17_Oct_19.txt
File location: ~/Desktop
# Define column names
kegg_df_dat <- c("ncbi_to_kegg",
                 "kegg_to_pathway",
                 "pathway_to_species")

kegg_df_dscr <- c(paste0("ncbi gene ID",
                 " <-- mapped to --> ",
                 "KEGG gene ID"),
                 paste0("KEGG gene ID",
                 " <-- mapped to --> ",
                 "KEGG pathway ID"),
                 paste0("KEGG pathway ID",
                 " <-- mapped to --> ",
                 "KEGG pathway species description"))
# Make dataframe
kegg_df <- data.frame("get_kegg list object" = kegg_df_dat, 
                      "Object description" = kegg_df_dscr)
if (require(kableExtra)){
knitr::kable(kegg_df,
             caption = "Table 1. Description of data sets generated from accessing KEGG REST API", row.names = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
}

Step 2: Perform enrichment analysis of individual gene sets

In this step we will use the pathEnrich function to identify KEGG pathways that are enriched (i.e. over-represented) based on a gene list of interest. User gene lists must be character vectors and be formatted as ENTREZ gene IDs. The clusterProfiler package offers a nice function (bitr) that maps gene symbols and Ensembl IDs to ENTREZ gene IDs, and an example can be seen in their vignette.

## View sample gene lists from package data
head(geneLists$list1)
head(geneLists$list2)

This function may not always use the complete list of genes provided by the user. Specifically, it will only use the genes from the list provided that are also in the most current species list pulled from the KEGG REST API using get_kegg, or from the older KEGG data loaded by the user from a previous get_kegg call. Users can also decide which KEGG pathways should be tested based on how many genes from their gene list are contained in the KEGG pathway. Users can set this parameter by changing the 'N' argument. The default is N = 2. The pathEnrich function should be run once for the genes of interest in list 1 and once for the genes of interest in list2. Each pathEnrich call generates a data frame summarizing the results of an enrichment analysis in which a Fisher's Exact test is used to identify which KEGG pathways are enriched for the user's list of genes compared to all genes annotated to a KEGG pathway. Users can choose a multiple correction option from those supported by stats::p.adjust. The default is the False Discovery Rate ( Benjamini and Hochberg, 1995), and the default threshold to reach significance is 0.05.

# run pathEnrich using kegg_rno
## List 1
list1_pe <- pathEnrich(gk_obj = kegg, gene_list = geneLists$list1, cutoff = 0.05, N = 2)
## list2
list2_pe <- pathEnrich(gk_obj = kegg, gene_list = geneLists$list2, cutoff = 0.05, N = 2) 
if (require(kableExtra)){
knitr::kable(head(list1_pe$enrich_table),
             caption = "Table 2. First 6 rows of list1_pe data frame generated using pathEnrich", row.names = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
}

pathEnrich generates a list object that contains a data frame with 9 columns described below as well as other metadata from the function call. Details also provided in pathEnrich documentation. S3 generic functions for print and summary are provided. The print function prints the results table as a tibble, and the summary function returns the number of pathways that reached statistical significance as well as their descriptions, the number of genes used from the KEGG data base, the KEGG species, and the method used for multiple testing correction.

summary(list1_pe)
# get column names
names <- colnames(list1_pe$enrich_table)
# Define column names
description <- c("KEGG Pathway Identifier",
                 "Description of KEGG Pathway (provided by KEGG)",
                 "Number of Genes in KEGG Pathway",
                 "Number of Genes from gene list in KEGG Pathway",
                 "Number of Genes in KEGG Database",
                 "Number of Genes from gene list in KEGG Database",
                 "Expected number of genes from list to be in KEGG pathway by chance",
                 "P-value for enrichment within the KEGG pathway for list genes",
                 "Multiple testing adjusted enrichment p-values (default = False Discovery Rate (Benjamini and Hochberg, 1995))",
                 "Ratio of number of genes observed from the gene list annotated to the KEGG pathway to the number of genes expected from the gene list to be annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list/expected)")
# Make dataframe
df1 <- data.frame(names, description)
colnames(df1) <- c("Column Names", "Column Description") 
if (require(kableExtra)){
knitr::kable(df1,
             caption = "Table 3. Description of columns is dataframe generated by pathEnrich", row.names = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
}

Step 3: Identify differentially enriched KEGG pathways

The diffEnrich function will merge two results from the pathEnrich calls generated above. Specifically, the data frame 'list1_pe' and the data frame 'list2_pe' will be merged by the following columns: "KEGG_PATHWAY_ID", "KEGG_PATHWAY_description", "KEGG_PATHWAY_cnt", "KEGG_DATABASE_cnt". This merged data set will then be used to perform differential enrichment using the same method and p-value calculation as described above. Users do have the option of choosing a method for multiple testing adjustment. Users can choose from those supported by stats::p.adjust. The default is the False Discovery Rate ( Benjamini and Hochberg, 1995). KEGG pathways that do not contain any genes from either gene list (i.e., list1_pe$enrich_table$KEGG_PATHWAY_in_list for 'rno04530' = 0 AND list2_pe$enrich_table$KEGG_PATHWAY_in_list for 'rno04530' = 0) will be removed as these cannot be tested. If this is the case a warning will be printed that tells the user how many pathways were removed. This can be avoided by setting the 'N' parameter to a value > 0 in the pathEnrich calls.

## Perform differential enrichment
diff_enrich <- diffEnrich(list1_pe = list1_pe, list2_pe = list2_pe, method = 'none', cutoff = 0.05)
if (require(kableExtra)){
knitr::kable(head(diff_enrich$de_table),
             caption = "Table 4. First 6 rows from data frame generated by diffEnrich", row.names = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
}
## get vector of column names
de_col_names <- names(diff_enrich$de_table)
de_col_descr <- c("KEGG Pathway Identifier",
                  "Description of KEGG Pathway (provided by KEGG)",
                  "Number of Genes in KEGG Pathway",
                  "Number of Genes in KEGG Database",
                  "Number of Genes from gene list 1 in KEGG Pathway",
                  "Number of Genes from gene list 1 in KEGG Database",
                  "Expected number of genes from list 1 to be in KEGG pathway by chance",
                  "P-value for enrichment of list 1 genes related to KEGG pathway",
                  "Multiple testing adjusted enrichment p-values from gene list 1 (default = False Discovery Rate (Benjamini and Hochberg, 1995))",
                  "Ratio of number of genes observed from gene list 1 annotated to the KEGG pathway to the number of genes expected from gene list 1 annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list1/expected_list1)",
                  "Number of Genes from gene list 2 in KEGG Pathway",
                  "Number of Genes from gene list 2 in KEGG Database",
                  "Expected number of genes from list 2 to be in KEGG pathway by chance",
                  "P-value for enrichment of list 2 genes related to KEGG pathway",
                  "Multiple testing adjusted enrichment p-values from gene list 2 (default = False Discovery Rate (Benjamini and Hochberg, 1995))",
                  "Ratio of number of genes observed from gene list 2 annotated to the KEGG pathway to the number of genes expected from gene list 2 annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list2/expected_list2)",
                  "Odds of a gene from list 2 being from this KEGG pathway / Odds of a gene from list 1 being from this KEGG pathway",
                  "P-value for differential enrichment of this KEGG pathway between list 1 and list 2",
                  "Multiple testing adjusted differential enrichment p-values (default = False Discovery Rate (Benjamini and Hochberg, 1995))")
# Make dataframe
de_descr_df <- data.frame(de_col_names, de_col_descr)
colnames(de_descr_df) <- c("Column Names", "Column Description") 
knitr::kable(de_descr_df,
             caption = "Table 5. Description of columns in dataframe generated by diffEnrich", row.names = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

The result of the diffEnrich call is a list object that contains a data frame with the estimated odds ratio generated by the Fisher's Exact test and the associated p-value as well as other metadata from the function call. S3 generic functions for print and summary are provided. The print function prints the results table as a tibble, and the summary function returns the number of pathways that reached statistical significance as well as their descriptions, the number of genes used from the KEGG database, the KEGG species, the number of pathways that were shared (i.e. had at least 1 gene from each gene list present in the pathway based on what the user chose for N in pathEnrich) between the gene lists and the method used for multiple testing correction.

summary(diff_enrich)

Figures

Step 4: Plot fold enrichment

plotFoldEnrichment generates a grouped bar plot using ggplot2 and the ggnewscale package. There are 3 arguments: 1) de_res is the dataframe generated from the diffEnrich function, 2) pval is the threshold for the adjusted p-value associated with differential enrichment that will filter which KEGG pathways to plot, and 3) after filtering based on pval N tells the function how many pathways to plot. It is important to make a note that the significance of the fold change is associated with the number of genes in the gene list. Notice that in this example the pathways in gene list 2 have smaller fold changes (shorter bars) than those in list 1, but that many of them are more significant (darker blue). This is because there are more genes in gene list 2 compared to gene list 1.

cap_plot <- c("Figure 2. Example of a differential enrichment graphic. KEGG pathways are plotted on the y-axis and fold
enrichment is plotted on the x-axis. Each KEGG pathway has a bar depicting
its fold enrichment in list 1 (red) and its fold enrichment in list 2 (blue).
The transparency of the bars correspond to the unadjusted p-value for the
pathway's enrichment in the given list. The p-value presented as text to the
right of each pair of bars is the adjusted p-value (user defined: default is FDR) associated with the
differential enrichment of the pathway between the two lists, and the pathways
are ordered from top to bottom by this p-value (i.e. smallest p-value on top
of plot, and largest p-value on bottom of plot). The dotted line represents a fold enrichment of 1. Finally, the number of genes used
for analysis from each gene list (recall that this number may not be the same as the number of
genes in the user's original list) are reported below their respective p-values
in the legend.")
## Plot fold enrichment
plotFoldEnrichment(de_res = diff_enrich, pval = 0.05, N = 5)

References

Ashburner et al. (2000) Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25-29.

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B,.57, 289–300

Huang DW. et al. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1):1-13.

Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30.

Kuleshov MV. et al. (2016) Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44(1): 90 – 97.

Liao, Y. et al. (2019) Gene set analysis toolkit with revamped UIs and APIs, Nucleic Acids Res. 47(1):199 – 205.

Petersen DR. et al. (2018) Elevated Nrf-2 responses are insufficient to mitigate protein carbonylation in hepatospecific PTEN deletion mice. PLoS One. 13(5):e0198139.

Shearn CT. et al. (2019) Cholestatic liver disease results increased production of reactive aldehydes and an atypical periportal hepatic antioxidant response. Free Radic Biol Med;143:101-114. [Epub ahead of print] PubMed PMID: 31377417.

Shearn CT. et al. (2018) Knockout of the Gsta4 Gene in Male Mice Leads to an Altered Pattern of Hepatic Protein Carbonylation and Enhanced Inflammation Following Chronic Consumption of an Ethanol Diet. Alcohol Clin Exp Res. 42(7):1192-1205.

Subramanian T. et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102, 15545-15550.

Yu G. et al. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 16(5), 284-287.



Try the diffEnrich package in your browser

Any scripts or data that you put into this service are public.

diffEnrich documentation built on June 28, 2022, 1:08 a.m.