go_reduce: Reduce redundancy of human GO terms
In RHReynolds/rutils: Common utility functions

View source: R/go_reduce.R

go_reduce

R Documentation

Reduce redundancy of human GO terms

Description

This function will reduce GO redundancy first by creating a semantic similarity matrix (using GOSemSim::mgoSim), which is then passed through rrvgo::reduceSimMatrix(), which will reduce a set of GO terms based on their semantic similarity and scores (in this case, a default score based on set size is assigned.)

Usage

go_reduce(
  pathway_df,
  orgdb = "org.Hs.eg.db",
  threshold = 0.7,
  scores = NULL,
  measure = "Wang"
)

Arguments

`pathway_df`	a `data.frame` or tibble object, with the following columns: `go_type`: the sub-ontology the GO term relates to. Should be one of `c("BP", "CC", "MF")`. `go_id`: the gene ontology identifier (e.g. GO:0016209)
`orgdb`	`character()` vector, indicating name of the org.* Bioconductor package to be used
`threshold`	`numeric()` vector. Similarity threshold (0-1) for `rrvgo::reduceSimMatrix()`. Default option is 0.7. Some guidance: For large term groupings, use `threshold = 0.9` For medium term groupings, use `threshold = 0.7` For small term groupings, use `threshold = 0.5` For tiny term groupings, use `threshold = 0.4`
`scores`	named vector, with scores (weights) assigned to each term. Higher is better. Can be NULL (default, means no scores. In this case, a default score based on set size is assigned, thus favoring larger sets). Note: if you have p-values as scores, consider log-transforming them (`-log10(p)`).
`measure`	`character()` vector, indicating method to be used to calculate semantic similarity measure. Must be one of the methods supported by GOSemSim: c("Resnik", "Lin", "Rel", "Jiang", "Wang"). Default is "Wang".

Details

Semantic similarity is calculated using the "Wang" method, a graph-based strategy to compute semantic similarity using the topology of the GO graph structure. GOSemSim::mgoSim does permit use of other measures (primarily information-content measures), but "Wang" is used as the default in GOSemSim (and was, thus, used as the default here). If you wish to use a different measure, please refer to the GOSemSim documentation.

rrvgo::reduceSimMatrix() creates a distance matrix, defined as (1-simMatrix). The terms are then hierarchically clustered using complete linkage (an agglomerative, or "bottom-up" clustering approach), and the tree is cut at the desired threshold. The term with the highest "score" is used to represent each group.

Value

a tibble object of pathway results, a "reduced" parent term to which pathways have been assigned. New columns:

parent_id: the GO ID of the parent term
parent_term: a description of the GO ID
parent_sim_score: the similarity score between the child GO term and its parent term

References

Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
Yu (2021) Biomedical Knowledge Mining using GOSemSim and clusterProfiler https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
Sayols S (2020). rrvgo: a Bioconductor package to reduce and visualize Gene Ontology terms. https://ssayols.github.io/rrvgo

Examples

file_path <-
    system.file(
        "testdata",
        "go_test_data.txt",
        package = "rutils",
        mustWork = TRUE
    )

pathway_df <-
    readr::read_delim(file_path,
        delim = "\t"
    )

go_reduce(
    pathway_df = pathway_df,
    orgdb = "org.Hs.eg.db",
    threshold = 0.9,
    scores = NULL,
    measure = "Wang"
)

RHReynolds/rutils documentation built on March 26, 2022, 8:17 a.m.