Don't Panic.

There is a lot of stuff in this document, just go through it step by step and try not not deal with everything at once. Here is what you should do.

  1. 🦅 Check the Data Overview

We need to attach EntrezID's to each gene to conduct a GO-term or GSEA analysis. However, for some genes this might fail, because no data base so far contains all known genes and transcripts. Please check the section "Regulated genes that do not map to a EntrezID" in the Data Overview tab if there are genes that are important to you and that have no EntrezID assigned.

  1. 🌋 Check the Volcano plot.

Does the Volcano plot like you would expect?

  1. 🗺 Check GO-terms (GO-Term All)

The GO-term overrepresentation analysis uses all significant genes of your data set. Check the figures and see if you find something interesting. If there are too many GO-terms or they are too redundant, please check GO-Term Simplified.

  1. 🗺 Optional: Check GSE GO Analysis

A GSE uses all detected genes and their fold change, rather than only the significant ones. This you should only do if there is absolutely nothing of interest in your GO-term analysis. You can also use this if you have a very subtle treatment and you expect very small genetic changes.

⬇️ Download Data

You can use the download-buttons on each table to download the data in a format of your liking.

🕵🏽‍♀️ Data Dictionary

Here is a description of each column of the GO-term analysis result.

| Column | Description | |----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| | ID | Gene Ontology ID | | Description | Gene Ontology description | | GeneRatio | clusterProfiler (number of significant genes associated with GO-term)/(number of significant genes in whole dataset) | | TotalCount | number of significant genes in whole dataset (extracted from GeneRatio) | | BGRatio | clusterProfiler background ratio. (number of genes of the GO-term)/(total number of genes in dataset) | | GOTermGeneCount | number of genes of the GO-term (extracted from BGRatio) | | BackGroundCount | total number of genes in dataset (extracted from BGRatio) | | p.adjust | clusterProfiler Benjamini-Hochberg adjusted p-value, adjustment for false discovery rate (FDR) | | geneID | clusterProfiler Significantly regulated genes that are part of this GO-term | | Count | number of significant genes associated with GO-term (extracted from GeneRatio). | | PercentSignificant | Percentage of differentially expressed genes of the GO-term |

{.tabset}

Data Overview

Settings

library(magrittr)
library(enrichplot)
library(rmyknife)
library(mygo)
library(org.Mm.eg.db)
library(AnnotationDbi)
library(stats4)
library(BiocGenerics)
library(parallel)
# Retreive parameter settings

rmd_params <- params

# Print settings as table
rmyknife::print_params(rmd_params, omit = "dat")
# Summary for parameter file
message("Data table head")
rmd_params$dat %>%
  head() %>%
  knitr::kable()

# Check for mandatory columns
if (!("q_value" %in% colnames(rmd_params$dat))) {
  stop("Column `q_value` not found in data frame")
} else if (!("fc" %in% colnames(rmd_params$dat))) {
  stop("Column `fc` not found in data frame")
}

# Check if there are significant genes
if (rmd_params$dat %>% dplyr::filter(q_value <= rmd_params$significance_cutoff) %>% nrow() == 0) {
  paste0("No significant entries found with cutoff ", rmd_params$significance_cutoff) %>%
    stop()
}

if (rmd_params$save_plots_as_pdf) {
  # Create path for plots
  file.path(rmd_params$output_path, "plots") %>%
    dir.create()
  # https://stackoverflow.com/questions/27992239/knitr-include-figures-in-report-and-output-figures-to-separate-files#comment71370705_27996644
  knitr::opts_chunk$set(
    dev = c("png", "pdf"),
    fig.path = paste0(file.path(rmd_params$output_path, "plots"), .Platform$file.sep)
  )
}
# Filter significant genes
if (!("EntrezID" %in% colnames(rmd_params$dat))) {
  warning("EntrezID not available in data frame. Attaching it manually")
  dat <- rmd_params$dat %>%
    rmyknife::ensembl_to_entrez(
      ensembl_id_name = "ensembl_gene_id",
      keep_only_rows_with_entrez = TRUE,
      drop_duplicates = TRUE
    ) %>%
    rmyknife::attach_gene_symbol_from_entrez()

  # Get regulated genes that don't have a EntrezID
  # https://github.sf.mpg.de/bruening-lab/2018-01-Weiyi-Differential-Gene-Expression-Analysis/blob/master/Analysis_DESeq.Rmd
  rmd_params$dat %>%
    dplyr::filter(q_value <= rmd_params$significance_cutoff) %>%
    dplyr::anti_join(dat %>% dplyr::filter(q_value <= rmd_params$significance_cutoff), by = "ensembl_gene_id") %>%
    # Handle the case of all genes mapping to Entrez ID's
    (function(not_mapping) {
      if (not_mapping %>% nrow() > 0) {
        # Attach gene name from Biomart entry
        message("Regulated genes that do not map to a EntrezID:")
        not_mapping %>%
          rmyknife::attach_biomart(attributes = "external_gene_name", verbose = FALSE) %>%
          knitr::kable() %>%
          print()
      } else {
        message("Good news, all regulated genes map to EntrezID's")
      }
    })

  message("Check for genes with duplicated entries. We will consider the ones with the lowest p-value")

  dat %<>% add_is_duplicated("EntrezID")
  dat %>%
    dplyr::filter(EntrezID_is_duplicated == T) %>%
    knitr::kable() %>%
    print()

  dat %<>%
    # For duplicated Entrez IDs only keep the ones with the lowest p-value
    dplyr::group_by(EntrezID) %>%
    dplyr::summarize(q_value = min(q_value)) %>%
    # Join back with the original data set in which we will also remove entries that have the same EntrwzID, q-value and FC
    # Because it can happen that a EntrezID maps to different Ensembl-IDs
    dplyr::inner_join(dat %>% dplyr::distinct(EntrezID, q_value, fc, .keep_all = TRUE))
  # Finally check if we still have problems regarding multiple entries
  .temp_multiple_entries <- dat %>%
    add_is_duplicated("EntrezID") %>%
    dplyr::filter(EntrezID_is_duplicated == T)
  if (nrow(.temp_multiple_entries) > 0) {
    warning("Duplicates per EntrezID, check input data")
    .temp_multiple_entries %>% knitr::kable()
  }
  # Remove obsolete rows
  dat %<>% dplyr::select(-EntrezID_is_duplicated)
}
# Get names vector from EntrezIDs
# dat %<>% attach_gene_symbol_from_entrez()
# Named double vector of fold changes named after the gene symbol
fc_symbol <- dat %>%
  .$fc %>%
  as.double() %>%
  `names<-`(dat$Symbol)
# Expect a data set containing q_values and Entrez/Ensembl IDs
dat_goterms <- dat %>% get_go_all_ontologies(
  significance_cutoff = rmd_params$significance_cutoff,
  use_background = rmd_params$use_background
)
# Get simplified ontologies if required
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified <- simplify_ontologies(
    ontologies = dat_goterms,
    cutoff = rmd_params$simplify_cutoff
  )
}
if (rmd_params$do_gse) {
  dat_gse_go <- dat %>% get_gse_all_ontologies(p_cutoff = rmd_params$significance_cutoff)
  dat_gse_kegg <- dat %>% get_kegg()
} else {
  dat_gse_go <- list()
  dat_gse_go$Biological_Process <- tibble::tibble()
  dat_gse_go$Molecular_Function <- tibble::tibble()
  dat_gse_go$Cellular_Components <- tibble::tibble()
  dat_gse_kegg <- list()
  dat_gse_kegg$kegg <- tibble::tibble()
}
# Create dummy variable for for simplified gene ontologies if we did not calculate it
if (!rmd_params$simplify_ontologies) {
  dat_goterms_simplified <- data.frame()
}

# Store clusterProfiler objects to disk
if (rmd_params$store_r_objects) {
  dat_goterms %>%
    saveRDS(file = file.path(rmd_params$output_path, "dat_goterms.rds"))

  if (rmd_params$do_gse) {
    dat_gse_go %>%
      saveRDS(file = file.path(rmd_params$output_path, "dat_gse_go.rds"))
    dat_gse_kegg %>%
      saveRDS(file = file.path(rmd_params$output_path, "dat_gse_kegg.rds"))
  }
  if (rmd_params$simplify_ontologies) {
    dat_goterms_simplified %>%
      saveRDS(file = file.path(rmd_params$output_path, "dat_goterms_simplified.rds"))
  }
}

# Export result
if (rmd_params$save_excel) {
  export_path <- paste0(rmd_params$output_path, "/go_terms.xlsx")
  export_go_terms_to_excel(
    go_ontologies = dat_goterms,
    go_ontologies_simple = dat_goterms_simplified,
    gse_ontologies = dat_gse_go,
    kegg_ontologies = dat_gse_kegg,
    path = export_path
  )
}

Volcano Plot

Dataset summary

dat %>%
  # Draw Volcano plot
  ggplot2::ggplot(
    ggplot2::aes(
      x = fc,
      y = -log10(q_value),
      color = (q_value <= 0.05) %>% ifelse(., "significant", "not significant")
    )
  ) +
  ggplot2::geom_point(
    alpha = 0.3,
    size = 0.5
  ) +
  ggplot2::scale_color_manual(values = c("grey", "blue")) +
  ggplot2::xlab(expression(log[2](fc))) +
  ggplot2::ylab(expression(-log[10](adjusted ~ p ~ value))) +
  ggplot2::labs(colour = "Significance") +
  ggplot2::theme_minimal() +
  ggrepel::geom_text_repel(
    data = . %>%
      dplyr::arrange(q_value) %>%
      head(10),
    mapping = ggplot2::aes(label = Symbol),
    size = 3
  ) +
  ggplot2::geom_hline(
    yintercept = -log10(0.05),
    linetype = "dotted"
  )

GO-Term All

Gene-Ontology (GO)-term analysis is also called a GO-over represenation test.

We use the over-representation test implemented in clusterProfiler and described here

The over-representation test is carried out internally with the DOSE package and is described here

For details on the hypergeometric distribution, refer to GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes

Biological Process

Here is the list of all regulated GO-terms sorted by significance.

dat_goterms$Biological_Process %>%
  print_goterm_as_datatable()

The Percentage Overlap Plot plots the top 25 GO-terms ordered by

  1. significance and
  2. overlap percentage in each GO-term

The adjusted p-value is mapped to color and the number of significant genes to circle size.

dat_goterms$Biological_Process %>%
  mygo::attach_goterm_genecount() %>%
  mygo::overlap_percentage_plot(order_by = "significance")

The Enrich Map plot shows overlapping genes between GO-terms as grey edged between each node. Each node represents a regulated GO-term. Only the top 50 regulated GO-terms are plotted here.

dat_goterms$Biological_Process %>%
  mygo::emap_plot("Enrich Map Plot Plot", 50)

The following scatter plot displays all regulated GO-terms w.r.t total GO-term gene count (GO-term size) on the x-axis and percentage of significant genes (how many of the GO-term genes are significantly regulated) on the y-axis. The top 15 regulated GO-terms are labeled.

dat_goterms$Biological_Process %>%
  mygo::attach_goterm_genecount() %>%
  mygo::overlap_scatterplot()

Molecular Function

dat_goterms$Molecular_Function %>%
  print_goterm_as_datatable()
dat_goterms$Molecular_Function %>%
  mygo::attach_goterm_genecount() %>%
  mygo::overlap_percentage_plot(order_by = "significance")
dat_goterms$Molecular_Function %>%
  mygo::emap_plot("Enrich Map Plot Plot", 50)
dat_goterms$Molecular_Function %>%
  mygo::attach_goterm_genecount() %>%
  mygo::overlap_scatterplot()

Cellular Components

dat_goterms$Cellular_Components %>%
  print_goterm_as_datatable()
dat_goterms$Cellular_Components %>%
  mygo::attach_goterm_genecount() %>%
  mygo::overlap_percentage_plot(order_by = "significance")
dat_goterms$Cellular_Components %>%
  mygo::emap_plot("Enrich Map Plot Plot", 50)
dat_goterms$Cellular_Components %>%
  mygo::attach_goterm_genecount() %>%
  mygo::overlap_scatterplot()

GO-Term Simplified

If the number of GO-terms is overwhelming, we can simplify them by removing redundant terms. For details, refer to this blog post.

if (!rmd_params$simplify_ontologies) {
  message("Simplified ontologies deactivated.")
}

Biological Process

if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Biological_Process %>%
    print_goterm_as_datatable()
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Biological_Process %>%
    mygo::attach_goterm_genecount() %>%
    mygo::overlap_percentage_plot(order_by = "significance")
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Biological_Process %>%
    mygo::emap_plot("Enrich Map Plot Plot", 50)
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Biological_Process %>%
    mygo::attach_goterm_genecount() %>%
    mygo::overlap_scatterplot()
}

Molecular Function

if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Molecular_Function %>%
    print_goterm_as_datatable()
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Molecular_Function %>%
    mygo::attach_goterm_genecount() %>%
    mygo::overlap_percentage_plot(order_by = "significance")
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Molecular_Function %>%
    mygo::emap_plot("Enrich Map Plot Plot", 50)
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Molecular_Function %>%
    mygo::attach_goterm_genecount() %>%
    mygo::overlap_scatterplot()
}

Cellular Components

if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Cellular_Components %>%
    print_goterm_as_datatable()
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Cellular_Components %>%
    mygo::attach_goterm_genecount() %>%
    mygo::overlap_percentage_plot(order_by = "significance")
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Cellular_Components %>%
    mygo::emap_plot("Enrich Map Plot Plot", 50)
}
if (rmd_params$simplify_ontologies) {
  dat_goterms_simplified$Cellular_Components %>%
    mygo::attach_goterm_genecount() %>%
    mygo::overlap_scatterplot()
}

GSE GO Analysis

if (rmd_params$do_gse) {
  dat_gse_go %>% plot_all_ontologies(fc_symbol)
} else {
  print("GSE analysis disabled")
}

GSE KEGG Pathways

Here we apply the GSE analysis to KEGG pathways.

if (rmd_params$do_gse) {
  dat_gse_kegg %>% plot_all_ontologies(fc_symbol)
} else {
  print("GSE analysis disabled")
}


paulklemm/mygo documentation built on July 27, 2023, 11:36 a.m.