calculate_go_enrichment: Perform gene ontology enrichment analysis
In protti: Bottom-Up Proteomics and LiP-MS Quality Control and Data Analysis Tools

View source: R/calculate_go_enrichment.R

calculate_go_enrichment

R Documentation

Perform gene ontology enrichment analysis

Description

Analyses enrichment of gene ontology terms associated with proteins in the fraction of significant proteins compared to all detected proteins. A two-sided Fisher's exact test is performed to test significance of enrichment or depletion. GO annotations can be provided to this function either through UniProt go_annotations_uniprot, through a table obtained with fetch_go in the go_data argument or GO annotations are fetched automatically by the function by providing ontology_type and organism_id.

Usage

calculate_go_enrichment(
  data,
  protein_id,
  is_significant,
  group = NULL,
  y_axis_free = TRUE,
  facet_n_col = 2,
  go_annotations_uniprot = NULL,
  ontology_type,
  organism_id = NULL,
  go_data = NULL,
  plot = TRUE,
  plot_style = "barplot",
  plot_title = "Gene ontology enrichment of significant proteins",
  barplot_fill_colour = c("#56B4E9", "#E76145"),
  heatmap_fill_colour = protti::mako_colours,
  heatmap_fill_colour_rev = TRUE,
  label = TRUE,
  enrichment_type = "all",
  replace_long_name = TRUE,
  label_move_frac = 0.2,
  min_n_detected_proteins_in_process = 1,
  plot_cutoff = "adj_pval top10"
)

Arguments

`data`	a data frame that contains at least the input variables.
`protein_id`	a character column in the `data` data frame that contains the protein accession numbers.
`is_significant`	a logical column in the `data` data frame that indicates if the corresponding protein has a significantly changing peptide. The input data frame may contain peptide level information with significance information. The function is able to extract protein level information from this.
`group`	optional, character column in the `data` data frame that contains information by which the analysis should be grouped. The analysis will be performed separately for each of the groups. This is most likely a column that labels separate comparisons of different conditions. In protti the `assign_missingness()` function creates such a column automatically.
`y_axis_free`	a logical value that specifies if the y-axis of the plot should be "free" for each facet if a grouping variable is provided. Default is `TRUE`. If `FALSE` is selected it is easier to compare GO categories directly with each other.
`facet_n_col`	a numeric value that specifies the number of columns the faceted plot should have if a column name is provided to group. The default is 2.
`go_annotations_uniprot`	recommended, a character column in the `data` data frame that contains gene ontology annotations obtained from UniProt using `fetch_uniprot`. These annotations are already separated into the desired ontology type so the argument `ontology_type` is not required.
`ontology_type`	optional, character value specifying the type of ontology that should be used. Possible values are molecular function (MF), biological process (BP), cellular component (CC). This argument is not required if GO annotations are provided from UniProt in `go_annotations_uniprot`. It is required if annotations are provided through `go_data` or automatically fetched.
`organism_id`	optional, character value specifying an NCBI taxonomy identifier of an organism (TaxId). Possible inputs include only: "9606" (Human), "559292" (Yeast) and "83333" (E. coli). Is only necessary if GO data is not provided either by `go_annotations_uniprot` or in `go_data`.
`go_data`	Optional, a data frame that can be obtained with `fetch_go()`. If you provide data not obtained with `fetch_go()` make sure column names for protein ID (`db_id`) and GO ID (`go_id`) are the same as for data obtained with `fetch_go()`.
`plot`	a logical argument indicating whether the result should be plotted or returned as a table.
`plot_style`	a character argument that specifies the plot style. Can be either "barplot" (default) or "heatmap". The "heatmap" plot is especially useful for the comparison of multiple groups. We recommend, however, that you use it only with `enrichment_type = "enriched"` or `⁠enrichment_type = "deenriched⁠`, because otherwise it is not possible to distinguish between enrichment and deenrichment in the plot.
`plot_title`	a character value that specifies the title of the plot. The default is "Gene ontology enrichment of significant proteins".
`barplot_fill_colour`	a vector that contains two colours that should be used as the fill colours for deenriched and enriched GO terms, respectively. If `enrichment_type = "enriched"` or `⁠"deenriched⁠`, please still provide two values in the vector, the colour not used for the plot can be `NA` in this case. E.g. `c(NA, "red")` for `enrichment_type = "enriched"`.
`heatmap_fill_colour`	a vector that contains colours that should be used to create the gradient in the heatmap plot. Default is `mako_colours`.
`heatmap_fill_colour_rev`	a logical value that specifies if the provided colours in `heatmap_fill_colour` should be reversed in order. Default is `TRUE`.
`label`	a logical argument indicating whether labels should be added to the plot. Default is TRUE.
`enrichment_type`	a character argument that is either "all", "enriched" or "deenriched". This determines if the enrichment analysis should be performed in order to check for both enrichemnt and deenrichemnt or only one of the two. This affects the statistics performed and therefore also the displayed plot.
`replace_long_name`	a logical argument that specifies if GO term names above 50 characters should be replaced by the GO ID instead for the plot. This ensures that the plotting area doesn't become too small due to the long name. The default is `TRUE`.
`label_move_frac`	a numeric argument between 0 and 1 that specifies which labels should be moved outside of the bar. The default is 0.2, which means that the labels of all bars that have a size of 20% or less of the largest bar are moved to the right of the bar. This prevents labels from overlapping with the bar boundaries.
`min_n_detected_proteins_in_process`	is a numeric argument that specifies the minimum number of detected proteins required for a GO term to be displayed in the plot. The default is 1, meaning no filtering of the plotted data is performed. This argument does not affect any computations or the returned data if `plot = FALSE`. This argument is useful in order to remove terms that were only detected in for example 1 protein. Even though these terms are sometimes significant, they are not really relevant.
`plot_cutoff`	a character value indicating if the plot should contain the top n (e.g. top10) most significant proteins (p-value or adjusted p-value), or if a significance cutoff should be used to determine the number of GO terms in the plot. This information should be provided with the type first followed by the threshold separated by a space. Example are `plot_cutoff = "adj_pval top10"`, `plot_cutoff = "pval 0.05"` or `plot_cutoff = "adj_pval 0.01"`. The threshold can be chosen freely. The default value is `"adj_pval top10"`.

Value

A bar plot or heatmap (depending on plot_style). By default the bar plot displays negative log10 adjusted p-values for the top 10 enriched or deenriched gene ontology terms. Alternatively, plot cutoffs can be chosen individually with the plot_cutoff argument. Bars are colored according to the direction of the enrichment (enriched or deenriched). If a heatmap is returned, terms are organised on the y-axis, while the colour of each tile represents the negative log10 adjusted p-value (default). If a group column is provided the x-axis contains all groups. If plot = FALSE, a data frame is returned. P-values are adjusted with Benjamini-Hochberg.

Examples


# Load libraries
library(dplyr)
library(stringr)

# Create example data
# Contains artificial de-enrichment for ribosomes.
uniprot_go_data <- fetch_uniprot_proteome(
  organism_id = 83333,
  columns = c(
    "accession",
    "go_f"
  )
)

if (!is(uniprot_go_data, "character")) {
  data <- uniprot_go_data %>%
    mutate(significant = c(
      rep(TRUE, 1000),
      rep(FALSE, n() - 1000)
    )) %>%
    mutate(significant = ifelse(
      str_detect(
        go_f,
        pattern = "ribosome"
      ),
      FALSE,
      significant
    )) %>%
    mutate(group = c(
      rep("A", 500),
      rep("B", 500),
      rep("A", (n() - 1000) / 2),
      rep("B", round((n() - 1000) / 2))
    ))

  # Plot gene ontology enrichment
  calculate_go_enrichment(
    data,
    protein_id = accession,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    plot = TRUE,
    plot_cutoff = "pval 0.01"
  )

  # Plot gene ontology enrichment with group
  calculate_go_enrichment(
    data,
    protein_id = accession,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    group = group,
    facet_n_col = 1,
    plot = TRUE,
    plot_cutoff = "pval 0.01"
  )

  # Plot gene ontology enrichment with group in a heatmap plot
  calculate_go_enrichment(
    data,
    protein_id = accession,
    group = group,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    min_n_detected_proteins_in_process = 15,
    plot = TRUE,
    label = TRUE,
    plot_style = "heatmap",
    enrichment_type = "enriched",
    plot_cutoff = "pval 0.01"
  )

  # Calculate gene ontology enrichment
  go_enrichment <- calculate_go_enrichment(
    data,
    protein_id = accession,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    plot = FALSE,
  )

  head(go_enrichment, n = 10)
}

protti documentation built on Oct. 22, 2024, 1:06 a.m.