Differential Expression Enrichment Tool (DEET)"

Install and load DEET

DEET relies on the following packages. Since they are all CRAN in origin, they should download and install automatically with devtools::install_github or utils::install.packages. The required dependencies are listed below.

Installation

  1. Github (Development Version)
devtools::install_github("wilsonlabgroup/DEET")
  1. CRAN (Stable Release)
# IN DEVELOPMENT

Downloading files

All processed DEGs, metadata, and enriched pathways in formats compatible with this package as well as other methods such as gene set enrichment analysis are stored here: https://www.wilsonlab.org/public/DEET_data/

No functions within DEET automatically load data for the user, so the data either needs to be downloaded directly from the ftp, or using the downloader function.

The DEET_data_download function, with possible inputs "ALL", "metadata", "enrich", and "feature_extract" automatically downloads the data required to run DEET_enrich and/or DEET_feature_extract.

We reccomended using:

downloaded <- DEET_data_download("ALL")
metadata <- downloaded$metadata
DEET_feature_extract_input <- downloaded$DEET_feature_extract
DEET_enrich_input <- downloaded$DEET_enrich

Here: DEET_enrich_input replaces DEET_example_data for DEET_enrich(). DEET_feature_extract_input replaces DEET_feature_extract_example_matrix for DEET_feature_extract() Lastly, metadata is not directly used in any of the function, but summarizes all of the pairwise comparisons using the following columns.

Once download, save these data and DEET can be used offline.

Structure of required datatypes

metadata

A comparison - by - explanatory piece of data dataframe providing important details to contextualize each study. For every pairwise comparison, the study name, source (SRA, TCGA, GTEx and SRA-manual), description from the DRA compendium, the number of samples (total, up-condition, and down-condition), samples (total ,up-condition, down-condition), tissue (including tumour from TCGA), number of DEs (total, up-condition, down-condition), age (mean +- sd), sex, top 15 DEGs - up, top 15 DEGs - down, top 5 enriched pathways, and top 5 enriched TFs. PMID are also available for studies selected from SRA. Lastly, each pairwise comparison was given an overall category based on those decided in Crow et al., 2019.

DEET_enrich_input

This is the meat and potatoes of the DEET dataset. Here, you can find all of the significant DE genes computed within DEET (padj < 0.05), DEGs, pathways, and TFs sorted into *gmt files compatible with traditional pathway enrichment tools (e.g., GSEA, gprofiler etc.), respective metadata, and the pathway enrichment and TF enrichment files used to generate the internal pathway enrichments of DEET_enrich. A more specific breakdown of these objects are below:

DEET_feature_extract_input

A gene by comparison matrix populated by the log2Fold-change of genes that are significantly DE in the comparison (padj < 0.05). The file is the input to the mat variable in DEET_feature_extract.

Implementation summary

The primary function of the DEET R package is to allow users to query their own list of DEGs against the consistently computed DEGs within DEET by using the function DEET_enrich(). The optimal input into DEET_enrich() is a data frame of genes (human gene symbols) with an associated p-value and coefficient (e.g., Fold-change) in conjunction with a list of genes designating the statistical background. DEET_enrich() first identifies enriched biological pathways and TF targets using the *.gmt files used for all of the DEET comparisons (i.e., “Human_GO_AllPathways_with_GO_iea_June_01_2021_symbo.gmt” for pathways and “Human_TranscriptionFactors_MSigdb_June_01_2021_symbol.gmt” for TFs), allowing us to not only compare overlapping genes between the user-inputted genes and the DEGs in DEET but also overlapping pathways and TFs. All gene-set enrichment within the DEET_enrich() functions use ActivePathways with all detected genes as the background, Brown’s p-value fusion method, a false-discovery rate for p-value correction, and a cutoff of 0.05. Then, DEET_enrich() enriches the users’ inputted genes, pathways, and TF targets against the DEGs, pathways, and TF targets stored within DEET. Enrichment of the user’s inputted gene lists against the DE comparisons within DEET are also completed with ActivePathways, with a minimum geneset filter of 15 and a maximum of 10000. Then, DEET_enrich() computes the Spearman’s and Pearson’s correlation between the coefficients within the user’s imputed list of DEGs that overlap with the log2(Fold-change) of DEGs within enriched pairwise comparisons.P-values of these correlations are corrected with an FDR-adjustment. Together, DEET_enrich() returns significantly enriched studies based on overlapping DEGs, pathways, and TFs. Similarly, DEET_enrich() returns the traditional pathway and TF motif enrichment of the inputted gene list. All enrichment outputs are in the format of the output of ActivePathways (study, FDR-adjusted p-value, input length, DE comparison length, overlapping genes). DEET_enrich() also returns a dataframe of the Spearman’s and Pearson’s correlation (with associated FDR-adjusted p-values) between the inputted DE list with the DEGs found in DEET as well as the intersecting DEGs within those studies. Optionally, DEET_enrich() may be used with a generic gene list (i.e. without P-values or coefficients). If the inputted gene list is ordered, then the p-value is artificially generated as equation 1 and the coefficient is artificially generated as equation 2. We assume an inputted list in decreasing order of significance, so the FDR and coef in equations 1 and 2 are reversed. DEET_enrich() then runs normally but Pearson's correlation between the inputted gene list and the DEGs within DEET are excluded. If the inputted gene list is unordered, then all of the p-values are set to 0.049 and both Spearman’s and Pearon’s correlations between the users inputted genes and the DEGs within DEET are excluded. If users do not provide a background set of genes, then we assume the background set is all genes detected within DEET.

For a sorted list of genes without a p-value or coefficient: Note, this happens internally, you do not have to do it.

DEG_list <- c("a", "b", "c", "d") # list of genes user inputs

DEG_processed <- data.frame(gene_symbol = DEG_list)
# DEG list is the list of genes that the user inputs

      padj <- 0.049
      for(i in 2:nrow(DEG_processed)) {
        padj[i] <- padj[i-1] * 0.95
      }
      padj <- rev(padj)
      log2fc <- rev(seq(1, 1 + 0.1*(nrow(DEG_processed) - 1), 0.1))

      DEG_processed$padj <- padj
      DEG_processed$coef <- log2fc
      colnames(DEG_processed) <- c("gene_symbol", "padj", "coef")

The DEET R package also contains plotting functions to summarize the most significant studies based on each enrichment test and correlation within DEET_enrich(). The proccess_and_plot_DEET_enrich() function plots barplots of the most enriched studies based on gene set enrichment (ActivePathways) of studies enriched based on overlapping DEGs, pathways, and TF targets. The DEET_plot_correlation() function generates scatterplots of the most enriched studies based on Spearman's correlation analysis. All plots are generated using ggplot2, and the functions return the ggplot2 objects, allowing researchers to further modify and/or save the plots.

Lastly, the DEET R package contains a function called DEET_feature_extract(), which allows researchers to identify genes that are associated with metadata. If the response variable are continuous (e.g., number of DEGs in study, Fold-change of TP53 etc.) then features are extracted by calculating the coefficients from a Gaussian family elastic net regression using the glmnet R package, as well as Spearman’s correlation between every gene and the response variable. If the response variable is categorical (e.g., Source, Category etc.), then features are extracted by calculating the coefficients from a multinomial family elastic net regression, as well as an ANOVA between each category within the response variable. Lastly, in the response variable is ordinal (e.g., enriches for TNFa pathway, Cancer study yes/no etc.), then features are extracted using a binomial family elastic net regression, as well as a Wilcoxon’s test between the two categories within the response variable.

Breaking down each core function within DEET

DEET_enrich: querying your own list again the DEGs stored within DEET.

Inputs

Examples

Running DEET with an datafame
data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
Running DEET with an ordered gene list
data("example_DEET_enrich_input")
data("DEET_example_data")

geneList <- example_DEET_enrich_input$gene_symbol
DEET_out <- DEET_enrich(geneList, DEET_dataset = DEET_example_data, ordered = TRUE)
Running DEET with an unordered gene list
data("example_DEET_enrich_input")
data("DEET_example_data")

geneList <- example_DEET_enrich_input$gene_symbol
DEET_out <- DEET_enrich(geneList, DEET_dataset = DEET_example_data, ordered =FALSE)
Differences in output between the inputted gene list types

The output of these three comparisons will be comparable, however the correlation variable is of note when the input gene set is just a list of genes.

When the gene set is ordered, a Spearman's correlation is interpretable, as it is simply the rank-order of genes, however a Pearson's correlation is not interpretable as we do not know the relative difference in coefficient size of your inputted genes

If the gene list is unordered, correlation analysis is entirely uninterpretable and is not run. You are given this message: Input gene list is considered UNORDERED: Correlation analysis will not be run and pathway enrichment will be unordered.

Since it not run the output is No variance in coefs. Cannot proceed with correlation.

Outputs

Named list where each element contains 6 objects. Each object will contain the results (enrichment or correlation) and corresponding metadata.

DEET_feature_extract

Inputs

Example

data(DEET_feature_extract_example_matrix)
data(DEET_feature_extract_example_response)
single1 <- DEET_feature_extract(DEET_feature_extract_example_matrix,
DEET_feature_extract_example_response,"categorical")

Outputs

DEET feature extract outputs a list of three objects.

Plotting the outputs of DEET_enrich.

Barplots of enrichement

The proccess_and_plot_DEET_enrich() function is a wrapper that generates barplots and a dot plot of enrichment for all of the individual outputs of DEET_enrich() (not the correlations). The outputs are in ggplot2 objects, allowing users to further modify the plots or print however they like.

Inputs

The remaining varables are for graphical parameters that are passed into the DEET_enrichment_plot() function.

Examples

data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
plotting_example <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
horizontal = TRUE, topn=4)

Another example Where AP_DEET_BP_output is not significant, to show the plotting function still works.

data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
DEET_out$AP_DEET_DE_output <- "No enrichment to be plotted"
plotting_example <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
horizontal = TRUE, topn=4)

Outputs

There are up to four outputs assuming everything is significant, each output is a list or a ggplot object.

DE_example <- DEET_out$AP_DEET_DE_output$results

# Changes for DEET_example_plot
DE_example$term.name <- DEET_out$AP_DEET_DE_output$metadata$DEET.Name
DE_example$domain <- "DE"
DE_example$overlap.size <- lengths(DE_example$overlap)
DE_example$p.value <- DE_example$adjusted.p.val

DE_example_plot <- DEET_enrichment_plot(list(DE_example = DE_example), "DE_example")

As shown above, from here you can also just use DEET_enrichment_plot directly to have some more control over these plots.

Scatterplots of correlations for DEET enrichment

This function also takes the direct output from DEET_enrich and generates scatterplots of the correlations of studies whose log2FCs are significantly correlated with the input DE list.

Input

correlation_input - The DE_correlations object that is the output of the DEET_enrich function. It only works if there was at least one study that was significantly correlated.

Examples

data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
correlation_input <- DEET_out$DE_correlations
correlation_plots <- DEET_plot_correlation(correlation_input)

Outputs

Using DEET gene lists with other studies and enriching two input lists simultaneously.

As mentioned previously, the genesets within DEET are easily transferrable to other gene set enrichment datasets.

Saving DEET gene set for GSEA, gprofiler, etc.

One option is to download the *gmt files diretly from our ftp. https://www.wilsonlab.org/public/DEET_data/DEET_DE.gmt Is directly compatible with these tools.

The other option would be to save the downloaded DEET gmt as a gmt file. This is completed using the ActivePathways R package. Instead of using the example data as shown below, please use the full dataset.

Instead of saving to a temporary directory like in this vignette, save the file wherever you want the directory to be saved.

DEET_gmt <- DEET_example_data$DEET_gmt_DE
message(paste0("DEET_gmt is an object of class gmt?: ",ActivePathways::is.GMT(DEET_gmt) ))

ActivePathways::write.GMT(DEET_gmt, file = paste0(tempdir(),"/DEET_DEs.gmt"))

Enriching two gene sets simultaneously.

If you have two dependent gene lists to input into DEET, you can use ActivePathways directly to find combind DEET-comparison enrichment of the two gene sets.

set.seed(1234) # as I sample p-values to make the toy example



# For example two, I had the same genes but I shuffled the p-value 

example_DEET_enrich_input$padj2 <- sample(example_DEET_enrich_input$padj, length(example_DEET_enrich_input$padj), replace = FALSE)

# Make a gene-by-input-list matrix of the adjusted p-values from your multiple gene sets

AP_matrix <- as.matrix(example_DEET_enrich_input[,c("padj", "padj2")])

# Run activepathways on the combined matrix.

# Get gmt file, again from the whole list:

DEET_gmt <- DEET_example_data$DEET_gmt_DE

head(AP_matrix)

AP_example_out <- ActivePathways::ActivePathways(scores=AP_matrix, gmt=DEET_gmt, geneset.filter = c(5,10000),correction.method = "fdr")

Outputs

The outputs of using ActivePathways are the same as DEET_enrich() but with a couple extra columns. evidence: Whether the DEET comparison is enriched because of one gene list, both gene lists, or an integrated version of these gene lists Genes_colname: The genes that contributed to enrichment from each inputted gene list.



Try the DEET package in your browser

Any scripts or data that you put into this service are public.

DEET documentation built on June 26, 2024, 5:08 p.m.