PrepareDB: Prepare the gene annotation databases

View source: R/SCP-analysis.R

PrepareDBR Documentation

Prepare the gene annotation databases

Description

This function prepares the gene annotation databases for a given species and set of annotation sources. It retrieves the necessary information from various annotation packages or external resources and organizes it into a list. The list contains the annotation data for each specified annotation source.

Usage

PrepareDB(
  species = c("Homo_sapiens", "Mus_musculus"),
  db = c("GO", "GO_BP", "GO_CC", "GO_MF", "KEGG", "WikiPathway", "Reactome", "CORUM",
    "MP", "DO", "HPO", "PFAM", "CSPA", "Surfaceome", "SPRomeDB", "VerSeDa", "TFLink",
    "hTFtarget", "TRRUST", "JASPAR", "ENCODE", "MSigDB", "CellTalk", "CellChat",
    "Chromosome", "GeneType", "Enzyme", "TF"),
  db_IDtypes = c("symbol", "entrez_id", "ensembl_id"),
  db_version = "latest",
  db_update = FALSE,
  convert_species = TRUE,
  Ensembl_version = 103,
  mirror = NULL,
  biomart = NULL,
  max_tries = 5,
  custom_TERM2GENE = NULL,
  custom_TERM2NAME = NULL,
  custom_species = NULL,
  custom_IDtype = NULL,
  custom_version = NULL
)

Arguments

species

A character vector specifying the species for which the gene annotation databases should be prepared. Default is c("Homo_sapiens", "Mus_musculus").

db

A character vector specifying the annotation sources to be included in the gene annotation databases. Default is c("GO", "GO_BP", "GO_CC", "GO_MF", "KEGG", "WikiPathway", "Reactome", "CORUM", "MP", "DO", "HPO", "PFAM", "CSPA", "Surfaceome", "SPRomeDB", "VerSeDa", "TFLink", "hTFtarget", "TRRUST", "JASPAR", "ENCODE", "MSigDB", "CellTalk", "CellChat", "Chromosome", "GeneType", "Enzyme", "TF").

db_IDtypes

A character vector specifying the desired ID types to be used for gene identifiers in the gene annotation databases. Default is c("symbol", "entrez_id", "ensembl_id").

db_version

A character vector specifying the version of the gene annotation databases to be retrieved. Default is "latest".

db_update

A logical value indicating whether the gene annotation databases should be forcefully updated. If set to FALSE, the function will attempt to load the cached databases instead. Default is FALSE.

convert_species

A logical value indicating whether to use a species-converted database when the annotation is missing for the specified species. The default value is TRUE.

Ensembl_version

Ensembl database version. If NULL, use the current release version.

mirror

Specify an Ensembl mirror to connect to. The valid options here are 'www', 'uswest', 'useast', 'asia'.

biomart

The name of the BioMart database that you want to connect to. Possible options include "ensembl", "protists_mart", "fungi_mart", and "plants_mart".

max_tries

The maximum number of attempts to connect with the BioMart service.

custom_TERM2GENE

A data frame containing a custom TERM2GENE mapping for the specified species and annotation source. Default is NULL.

custom_TERM2NAME

A data frame containing a custom TERM2NAME mapping for the specified species and annotation source. Default is NULL.

custom_species

A character vector specifying the species name to be used in a custom database. Default is NULL.

custom_IDtype

A character vector specifying the ID type to be used in a custom database. Default is NULL.

custom_version

A character vector specifying the version to be used in a custom database. Default is NULL.

Details

The 'PrepareDB' function prepares gene annotation databases for a given species and set of annotation sources. It retrieves the necessary information from various annotation packages or external resources and organizes it into a list. The function also supports creating custom databases based on user-provided gene sets.

Value

A list containing the prepared gene annotation databases:

  • TERM2GENE: mapping of gene identifiers to terms

  • TERM2NAME: mapping of terms to their names

  • semData: semantic similarity data for gene sets (only for Gene Ontology terms)

See Also

ListDB

Examples

if (interactive()) {
  db_list <- PrepareDB(species = "Homo_sapiens", db = "GO_BP")
  ListDB(species = "Homo_sapiens", db = "GO_BP")
  head(db_list[["Homo_sapiens"]][["GO_BP"]][["TERM2GENE"]])

  # Based on homologous gene conversion, prepare a gene annotation database that originally does not exist in the species.
  db_list <- PrepareDB(species = "Homo_sapiens", db = "MP")
  ListDB(species = "Homo_sapiens", db = "MP")
  head(db_list[["Homo_sapiens"]][["MP"]][["TERM2GENE"]])

  # Prepare databases for other species
  db_list <- PrepareDB(species = "Macaca_fascicularis", db = "GO_BP")
  ListDB(species = "Macaca_fascicularis", db = "GO_BP")
  head(db_list[["Macaca_fascicularis"]][["GO_BP"]][["TERM2GENE"]])

  db_list <- PrepareDB(species = "Saccharomyces_cerevisiae", db = "GO_BP")
  ListDB(species = "Saccharomyces_cerevisiae", db = "GO_BP")
  head(db_list[["Saccharomyces_cerevisiae"]][["GO_BP"]][["TERM2GENE"]])

  # Prepare databases for Arabidopsis (plant)
  db_list <- PrepareDB(
    species = "Arabidopsis_thaliana",
    db = c(
      "GO_BP", "GO_CC", "GO_MF", "KEGG", "WikiPathway",
      "ENZYME", "Chromosome"
    ),
    biomart = "plants_mart"
  )
  head(db_list[["Arabidopsis_thaliana"]][["KEGG"]][["TERM2GENE"]])

  # You can also build a custom database based on the gene sets you have
  ccgenes <- CC_GenePrefetch("Homo_sapiens")
  custom_TERM2GENE <- rbind(
    data.frame(term = "S_genes", gene = ccgenes[["cc_S_genes"]]),
    data.frame(term = "G2M_genes", gene = ccgenes[["cc_G2M_genes"]])
  )
  str(custom_TERM2GENE)

  # Set convert_species = TRUE to build a custom database for both species, with the name "CellCycle"
  db_list <- PrepareDB(
    species = c("Homo_sapiens", "Mus_musculus"), db = "CellCycle", convert_species = TRUE,
    custom_TERM2GENE = custom_TERM2GENE, custom_species = "Homo_sapiens", custom_IDtype = "symbol", custom_version = "Seurat_v4"
  )
  ListDB(db = "CellCycle")

  db_list <- PrepareDB(species = "Mus_musculus", db = "CellCycle")
  head(db_list[["Mus_musculus"]][["CellCycle"]][["TERM2GENE"]])
}

zh542370159/SCP documentation built on Nov. 22, 2023, 2:34 a.m.