Don't Panic.
There is a lot of stuff in this document, just go through it step by step and try not not deal with everything at once. Here is what you should do.
We need to attach EntrezID's to each gene to conduct a GO-term or GSEA analysis. However, for some genes this might fail, because no data base so far contains all known genes and transcripts. Please check the section "Regulated genes that do not map to a EntrezID" in the Data Overview tab if there are genes that are important to you and that have no EntrezID assigned.
Does the Volcano plot like you would expect?
The GO-term overrepresentation analysis uses all significant genes of your data set. Check the figures and see if you find something interesting. If there are too many GO-terms or they are too redundant, please check GO-Term Simplified.
A GSE uses all detected genes and their fold change, rather than only the significant ones. This you should only do if there is absolutely nothing of interest in your GO-term analysis. You can also use this if you have a very subtle treatment and you expect very small genetic changes.
⬇️ Download Data
You can use the download-buttons on each table to download the data in a format of your liking.
🕵🏽♀️ Data Dictionary
Here is a description of each column of the GO-term analysis result.
| Column | Description |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ID
| Gene Ontology ID |
| Description
| Gene Ontology description |
| GeneRatio
| clusterProfiler (number of significant genes associated with GO-term)/(number of significant genes in whole dataset) |
| TotalCount
| number of significant genes in whole dataset (extracted from GeneRatio
) |
| BGRatio
| clusterProfiler background ratio. (number of genes of the GO-term)/(total number of genes in dataset) |
| GOTermGeneCount
| number of genes of the GO-term (extracted from BGRatio
) |
| BackGroundCount
| total number of genes in dataset (extracted from BGRatio
) |
| p.adjust
| clusterProfiler Benjamini-Hochberg adjusted p-value, adjustment for false discovery rate (FDR) |
| geneID
| clusterProfiler Significantly regulated genes that are part of this GO-term |
| Count
| number of significant genes associated with GO-term (extracted from GeneRatio
). |
| PercentSignificant
| Percentage of differentially expressed genes of the GO-term |
Settings
library(magrittr) library(enrichplot) library(rmyknife) library(mygo) library(org.Mm.eg.db) library(AnnotationDbi) library(stats4) library(BiocGenerics) library(parallel)
# Retreive parameter settings rmd_params <- params # Print settings as table rmyknife::print_params(rmd_params, omit = "dat")
# Summary for parameter file message("Data table head") rmd_params$dat %>% head() %>% knitr::kable() # Check for mandatory columns if (!("q_value" %in% colnames(rmd_params$dat))) { stop("Column `q_value` not found in data frame") } else if (!("fc" %in% colnames(rmd_params$dat))) { stop("Column `fc` not found in data frame") } # Check if there are significant genes if (rmd_params$dat %>% dplyr::filter(q_value <= rmd_params$significance_cutoff) %>% nrow() == 0) { paste0("No significant entries found with cutoff ", rmd_params$significance_cutoff) %>% stop() } if (rmd_params$save_plots_as_pdf) { # Create path for plots file.path(rmd_params$output_path, "plots") %>% dir.create() # https://stackoverflow.com/questions/27992239/knitr-include-figures-in-report-and-output-figures-to-separate-files#comment71370705_27996644 knitr::opts_chunk$set( dev = c("png", "pdf"), fig.path = paste0(file.path(rmd_params$output_path, "plots"), .Platform$file.sep) ) }
# Filter significant genes if (!("EntrezID" %in% colnames(rmd_params$dat))) { warning("EntrezID not available in data frame. Attaching it manually") dat <- rmd_params$dat %>% rmyknife::ensembl_to_entrez( ensembl_id_name = "ensembl_gene_id", keep_only_rows_with_entrez = TRUE, drop_duplicates = TRUE ) %>% rmyknife::attach_gene_symbol_from_entrez() # Get regulated genes that don't have a EntrezID # https://github.sf.mpg.de/bruening-lab/2018-01-Weiyi-Differential-Gene-Expression-Analysis/blob/master/Analysis_DESeq.Rmd rmd_params$dat %>% dplyr::filter(q_value <= rmd_params$significance_cutoff) %>% dplyr::anti_join(dat %>% dplyr::filter(q_value <= rmd_params$significance_cutoff), by = "ensembl_gene_id") %>% # Handle the case of all genes mapping to Entrez ID's (function(not_mapping) { if (not_mapping %>% nrow() > 0) { # Attach gene name from Biomart entry message("Regulated genes that do not map to a EntrezID:") not_mapping %>% rmyknife::attach_biomart(attributes = "external_gene_name", verbose = FALSE) %>% knitr::kable() %>% print() } else { message("Good news, all regulated genes map to EntrezID's") } }) message("Check for genes with duplicated entries. We will consider the ones with the lowest p-value") dat %<>% add_is_duplicated("EntrezID") dat %>% dplyr::filter(EntrezID_is_duplicated == T) %>% knitr::kable() %>% print() dat %<>% # For duplicated Entrez IDs only keep the ones with the lowest p-value dplyr::group_by(EntrezID) %>% dplyr::summarize(q_value = min(q_value)) %>% # Join back with the original data set in which we will also remove entries that have the same EntrwzID, q-value and FC # Because it can happen that a EntrezID maps to different Ensembl-IDs dplyr::inner_join(dat %>% dplyr::distinct(EntrezID, q_value, fc, .keep_all = TRUE)) # Finally check if we still have problems regarding multiple entries .temp_multiple_entries <- dat %>% add_is_duplicated("EntrezID") %>% dplyr::filter(EntrezID_is_duplicated == T) if (nrow(.temp_multiple_entries) > 0) { warning("Duplicates per EntrezID, check input data") .temp_multiple_entries %>% knitr::kable() } # Remove obsolete rows dat %<>% dplyr::select(-EntrezID_is_duplicated) } # Get names vector from EntrezIDs # dat %<>% attach_gene_symbol_from_entrez() # Named double vector of fold changes named after the gene symbol fc_symbol <- dat %>% .$fc %>% as.double() %>% `names<-`(dat$Symbol)
# Expect a data set containing q_values and Entrez/Ensembl IDs dat_goterms <- dat %>% get_go_all_ontologies( significance_cutoff = rmd_params$significance_cutoff, use_background = rmd_params$use_background ) # Get simplified ontologies if required if (rmd_params$simplify_ontologies) { dat_goterms_simplified <- simplify_ontologies( ontologies = dat_goterms, cutoff = rmd_params$simplify_cutoff ) } if (rmd_params$do_gse) { dat_gse_go <- dat %>% get_gse_all_ontologies(p_cutoff = rmd_params$significance_cutoff) dat_gse_kegg <- dat %>% get_kegg() } else { dat_gse_go <- list() dat_gse_go$Biological_Process <- tibble::tibble() dat_gse_go$Molecular_Function <- tibble::tibble() dat_gse_go$Cellular_Components <- tibble::tibble() dat_gse_kegg <- list() dat_gse_kegg$kegg <- tibble::tibble() }
# Create dummy variable for for simplified gene ontologies if we did not calculate it if (!rmd_params$simplify_ontologies) { dat_goterms_simplified <- data.frame() } # Store clusterProfiler objects to disk if (rmd_params$store_r_objects) { dat_goterms %>% saveRDS(file = file.path(rmd_params$output_path, "dat_goterms.rds")) if (rmd_params$do_gse) { dat_gse_go %>% saveRDS(file = file.path(rmd_params$output_path, "dat_gse_go.rds")) dat_gse_kegg %>% saveRDS(file = file.path(rmd_params$output_path, "dat_gse_kegg.rds")) } if (rmd_params$simplify_ontologies) { dat_goterms_simplified %>% saveRDS(file = file.path(rmd_params$output_path, "dat_goterms_simplified.rds")) } } # Export result if (rmd_params$save_excel) { export_path <- paste0(rmd_params$output_path, "/go_terms.xlsx") export_go_terms_to_excel( go_ontologies = dat_goterms, go_ontologies_simple = dat_goterms_simplified, gse_ontologies = dat_gse_go, kegg_ontologies = dat_gse_kegg, path = export_path ) }
Dataset summary
r nrow(dat)
genes.r dplyr::filter(dat, q_value <= 0.05) %>% nrow()
genes are significantly regulated.r dplyr::filter(dat, q_value <= 0.05 & fc > 0) %>% nrow()
genes are significantly up-regulated.r dplyr::filter(dat, q_value <= 0.05 & fc < 0) %>% nrow()
genes are significantly down-regulated.dat %>% # Draw Volcano plot ggplot2::ggplot( ggplot2::aes( x = fc, y = -log10(q_value), color = (q_value <= 0.05) %>% ifelse(., "significant", "not significant") ) ) + ggplot2::geom_point( alpha = 0.3, size = 0.5 ) + ggplot2::scale_color_manual(values = c("grey", "blue")) + ggplot2::xlab(expression(log[2](fc))) + ggplot2::ylab(expression(-log[10](adjusted ~ p ~ value))) + ggplot2::labs(colour = "Significance") + ggplot2::theme_minimal() + ggrepel::geom_text_repel( data = . %>% dplyr::arrange(q_value) %>% head(10), mapping = ggplot2::aes(label = Symbol), size = 3 ) + ggplot2::geom_hline( yintercept = -log10(0.05), linetype = "dotted" )
Gene-Ontology (GO)-term analysis is also called a GO-over represenation test.
We use the over-representation test implemented in clusterProfiler and described here
The over-representation test is carried out internally with the DOSE package and is described here
For details on the hypergeometric distribution, refer to GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes
Here is the list of all regulated GO-terms sorted by significance.
dat_goterms$Biological_Process %>% print_goterm_as_datatable()
The Percentage Overlap Plot plots the top 25 GO-terms ordered by
The adjusted p-value is mapped to color and the number of significant genes to circle size.
dat_goterms$Biological_Process %>% mygo::attach_goterm_genecount() %>% mygo::overlap_percentage_plot(order_by = "significance")
The Enrich Map plot shows overlapping genes between GO-terms as grey edged between each node. Each node represents a regulated GO-term. Only the top 50 regulated GO-terms are plotted here.
dat_goterms$Biological_Process %>% mygo::emap_plot("Enrich Map Plot Plot", 50)
The following scatter plot displays all regulated GO-terms w.r.t total GO-term gene count (GO-term size) on the x-axis and percentage of significant genes (how many of the GO-term genes are significantly regulated) on the y-axis. The top 15 regulated GO-terms are labeled.
dat_goterms$Biological_Process %>% mygo::attach_goterm_genecount() %>% mygo::overlap_scatterplot()
dat_goterms$Molecular_Function %>% print_goterm_as_datatable()
dat_goterms$Molecular_Function %>% mygo::attach_goterm_genecount() %>% mygo::overlap_percentage_plot(order_by = "significance")
dat_goterms$Molecular_Function %>% mygo::emap_plot("Enrich Map Plot Plot", 50)
dat_goterms$Molecular_Function %>% mygo::attach_goterm_genecount() %>% mygo::overlap_scatterplot()
dat_goterms$Cellular_Components %>% print_goterm_as_datatable()
dat_goterms$Cellular_Components %>% mygo::attach_goterm_genecount() %>% mygo::overlap_percentage_plot(order_by = "significance")
dat_goterms$Cellular_Components %>% mygo::emap_plot("Enrich Map Plot Plot", 50)
dat_goterms$Cellular_Components %>% mygo::attach_goterm_genecount() %>% mygo::overlap_scatterplot()
If the number of GO-terms is overwhelming, we can simplify them by removing redundant terms. For details, refer to this blog post.
if (!rmd_params$simplify_ontologies) { message("Simplified ontologies deactivated.") }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Biological_Process %>% print_goterm_as_datatable() }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Biological_Process %>% mygo::attach_goterm_genecount() %>% mygo::overlap_percentage_plot(order_by = "significance") }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Biological_Process %>% mygo::emap_plot("Enrich Map Plot Plot", 50) }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Biological_Process %>% mygo::attach_goterm_genecount() %>% mygo::overlap_scatterplot() }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Molecular_Function %>% print_goterm_as_datatable() }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Molecular_Function %>% mygo::attach_goterm_genecount() %>% mygo::overlap_percentage_plot(order_by = "significance") }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Molecular_Function %>% mygo::emap_plot("Enrich Map Plot Plot", 50) }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Molecular_Function %>% mygo::attach_goterm_genecount() %>% mygo::overlap_scatterplot() }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Cellular_Components %>% print_goterm_as_datatable() }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Cellular_Components %>% mygo::attach_goterm_genecount() %>% mygo::overlap_percentage_plot(order_by = "significance") }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Cellular_Components %>% mygo::emap_plot("Enrich Map Plot Plot", 50) }
if (rmd_params$simplify_ontologies) { dat_goterms_simplified$Cellular_Components %>% mygo::attach_goterm_genecount() %>% mygo::overlap_scatterplot() }
if (rmd_params$do_gse) { dat_gse_go %>% plot_all_ontologies(fc_symbol) } else { print("GSE analysis disabled") }
Here we apply the GSE analysis to KEGG pathways.
if (rmd_params$do_gse) { dat_gse_kegg %>% plot_all_ontologies(fc_symbol) } else { print("GSE analysis disabled") }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.