getGenesets: Definition of gene sets according to different sources

View source: R/getGenesets.R

getGenesetsR Documentation

Definition of gene sets according to different sources

Description

Functionality for retrieving gene sets for an organism under investigation from databases such as GO and KEGG. Parsing and writing a list of gene sets from/to a flat text file in GMT format is also supported.

The GMT (Gene Matrix Transposed) file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set. Each gene set is described by a name, a description, and the genes in the gene set. See references.

Usage

getGenesets(
  org,
  db = c("go", "kegg", "msigdb", "enrichr"),
  gene.id.type = "ENTREZID",
  cache = TRUE,
  return.type = c("list", "GeneSetCollection"),
  ...
)

showAvailableSpecies(db = c("go", "kegg", "msigdb", "enrichr"), cache = TRUE)

showAvailableCollections(
  org,
  db = c("go", "kegg", "msigdb", "enrichr"),
  cache = TRUE
)

writeGMT(gs, gmt.file)

Arguments

org

An organism in (KEGG) three letter code, e.g. ‘hsa’ for ‘Homo sapiens’. Alternatively, this can also be a text file storing gene sets in GMT format. See details.

db

Database from which gene sets should be retrieved. Currently, either 'go' (default), 'kegg', 'msigdb', or 'enrichr'.

gene.id.type

Character. Gene ID type of the returned gene sets. Defaults to "ENTREZID". See idTypes for available gene ID types.

cache

Logical. Should a locally cached version used if available? Defaults to TRUE.

return.type

Character. Determines whether gene sets are returned as a simple list of gene sets (each being a character vector of gene IDs), or as an object of class GeneSetCollection.

...

Additional arguments for individual gene set databases. For db = "GO":

  • onto: Character. Specifies one of the three GO ontologies: 'BP' (biological process), 'MF' (molecular function), 'CC' (cellular component). Defaults to 'BP'.

  • evid: Character. Specifies one or more GO evidence code(s) such as IEP (inferred from expression pattern) or TAS (traceable author statement). Defaults to NULL which includes all annotations, i.e. does not filter by evidence codes. See references for a list of available evidence codes.

  • hierarchical: Logical. Incorporate hierarchical relationships between GO terms ('is_a' and 'has_a') when collecting genes annotated to a GO term? If set to TRUE, this will return all genes annotated to a GO term *or to one of its child terms* in the GO ontology. Defaults to FALSE, which will then only collect genes directly annotated to a GO term.

  • mode: Character. Determines in which way the gene sets are retrieved. This can be either 'GO.db' or 'biomart'. The 'GO.db' mode creates the gene sets based on BioC annotation packages - which is fast, but represents not necessarily the most up-to-date mapping. In addition, this option is only available for the currently supported model organisms in BioC. The 'biomart' mode downloads the mapping from BioMart - which can be time consuming, but allows to select from a larger range of organisms and contains the latest mappings. Defaults to 'GO.db'.

For db = "msigdb":

  • cat: Character. MSigDB collection category: 'H' (hallmark), 'C1' (genomic position), 'C2' (curated databases), 'C3' (binding site motifs), 'C4' (computational cancer), 'C5' (Gene Ontology), 'C6' (oncogenic), 'C7' (immunologic), 'C8' (cell type). See references.

  • subcat: Character. MSigDB collection subcategory. Depends on the chosen MSigDB collection category. For example, 'MIR' to obtain microRNA targets from the 'C3' collection. See references.

For db = "enrichr":

  • lib: Character. Enrichr gene set library. For example, 'Genes_Associated_with_NIH_Grants' to obtain gene sets based on associations with NIH grants. See references.

gs

A list of gene sets (character vectors of gene IDs).

gmt.file

Gene set file in GMT format. See details.

Value

For getGenesets: a list of gene sets (vectors of gene IDs). For writeGMT: none, writes to file.

For showAvailableSpecies and showAvailableCollections: a DataFrame, displaying supported species and available gene set collections for a gene set database of choice.

Author(s)

Ludwig Geistlinger

References

GO: http://geneontology.org/

GO evidence codes: http://geneontology.org/docs/guide-go-evidence-codes/

KEGG Organism code: http://www.genome.jp/kegg/catalog/org_list.html

MSigDB: http://software.broadinstitute.org/gsea/msigdb/collections.jsp

Enrichr: https://maayanlab.cloud/Enrichr/#stats

GMT file format: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

See Also

the GO.db package for GO2gene mapping used in 'GO.db' mode, and the biomaRt package for general queries to BioMart.

keggList and keggLink for accessing the KEGG REST server.

msigdbr::msigdbr for obtaining gene sets from the MSigDB.

Examples


    # (1) Typical usage for gene set enrichment analysis with GO:
    # Biological process terms based on BioC annotation (for human)
    go.gs <- getGenesets(org = "hsa", db = "go")
    
    # eq.:  
    # go.gs <- getGenesets(org = "hsa", db = "go", onto = "BP", mode = "GO.db")
    
    # Alternatively:
    # downloading from BioMart 
    # this may take a few minutes ...
    go.gs <- getGenesets(org = "hsa", db = "go", mode = "biomart")

    # list supported species for obtaining gene sets from GO 
    showAvailableSpecies(db = "go")
    
    # (2) Defining gene sets according to KEGG  
    kegg.gs <- getGenesets(org = "hsa", db = "kegg")
    
    # list supported species for obtaining gene sets from KEGG 
    showAvailableSpecies(db = "kegg")

    # (3) Obtaining *H*allmark gene sets from MSigDB
    hall.gs <- getGenesets(org = "hsa", db = "msigdb", cat = "H")

    # list supported species for obtaining gene sets from MSigDB
    showAvailableSpecies(db = "msigdb")

    # list available gene set collections in the MSigDB
    showAvailableCollections(db = "msigdb") 

    # (4) Obtaining gene sets from Enrichr
    tfppi.gs <- getGenesets(org = "hsa", db = "enrichr", 
                            lib = "Transcription_Factor_PPIs")

    # list supported species for obtaining gene sets from Enrichr
    showAvailableSpecies(db = "enrichr")

    # list available Enrichr gene set libraries
    showAvailableCollections(org = "hsa", db = "enrichr")        
    
    # (6) parsing gene sets from GMT
    gmt.file <- system.file("extdata/hsa_kegg_gs.gmt",
                            package = "EnrichmentBrowser")
    gs <- getGenesets(gmt.file)     
    
    # (7) writing gene sets to file
    writeGMT(gs, gmt.file)


lgeistlinger/EnrichmentBrowser documentation built on May 9, 2024, 7:22 p.m.