getAnnotation: Annotation downloader

View source: R/sitadela.R

getAnnotationR Documentation

Annotation downloader

Description

For Ensembl based annotations, this function connects to the EBI's Biomart service using the package biomaRt and downloads annotation elements (gene co-ordinates, exon co-ordinates, gene identifications, biotypes etc.) for each of the supported organisms. For UCSC/RefSeq annotations, it connects to the respective UCSC SQL databases if the package RMySQL is present, otherwise it downloads flat files and build a temporary SQLite database to make the necessary build queries. Gene and transcript versions can be attached (when available) using the tv argument. This is very useful when transcript versioning is required, such as several precision medicine applications.

Usage

    getAnnotation(org, type, refdb = "ensembl", ver = NULL,
        tv = FALSE, rc = NULL)

Arguments

org

the organism for which to download annotation (one of the supported ones, see Details).

type

the transcriptional unit annotation level to load. It can be one of "gene" (default), "transcript", "utr", "transexon", "transutr", "exon". See Details for further explanation of each option.

refdb

the online source to use to fetch annotation. It can be "ensembl" (default), "ucsc", "refseq" or "ncbi". In the later three cases, an SQL connection is opened with the UCSC public databases.

ver

the version of the annotation to use.

tv

attach or not gene/transcript version to gene/transcript name. Defaults to FALSE.

rc

Fraction of cores to use. Same as the rc in addAnnotation.

Details

Regarding org, it can be, for human genomes "hg18", "hg19" or "hg38", for mouse genomes "mm9", "mm10", for rat genomes "rn5" or "rn6", for drosophila genome "dm3" or "dm6", for zebrafish genome "danrer7", "danrer10" or "danrer11", for chimpanzee genome "pantro4", "pantro5", for pig genome "susscr3", "susscr11", for Arabidopsis thaliana genome "tair10" and for Equus caballus genome "equcab2" and "equcab3". Finally, it can be "USER_NAMED_ORG" with a custom organism which has been imported to the annotation database by the user using a GTF/GFF file. For example org="mm10_p1".

Regarding type, it defines the level of transcriptional unit (gene, transcript, 3' UTR, exon) coordinates to be loaded or fetched if not present. The following types are supported:

  • "gene": canonical gene coordinates are retrieved from the chosen database.

  • "transcript": all transcript coordinates are retrieved from the chosen database.

  • "utr": all 3' UTR coordinates are retrieved from the chosen database, grouped per gene.

  • "transutr": all 3' UTR coordinates are retrieved from the chosen database, grouped per \ transcript.

  • "transexon": all exon coordinates are retrieved from the chosen database, grouped per transcript.

  • "exon": all exon coordinates are retrieved from the chosen database.

Value

A data frame with the canonical genes, transcripts, exons or 3' UTRs of the requested organism. When type="genes", the data frame has the following columns: chromosome, start, end, gene_id, gc_content, strand, gene_name, biotype. When type="exon" and type="transexon" the data frame has the following columns: chromosome, start, end, exon_id, gene_id, strand, gene_name, biotype. When type="utr" or type="transutr", the data frame has the following columns: chromosome, start, end, transcript_id, gene_id, strand, gene_name, biotype. The latter applies to when type="transcript". The gene_id and exon_id correspond to type="transcript" Ensembl, UCSC or RefSeq gene, transcript and exon accessions respectively. The gene_name corresponds to HUGO nomenclature gene names.

Note

The data frame that is returned contains only "canonical" chromosomes for each organism. It does not contain haplotypes or non-anchored sequences and does not contain mitochondrial chromosomes.

Author(s)

Panagiotis Moulos

Examples

mm10Genes <- getAnnotation("mm10","gene")

pmoulos/sitadela documentation built on March 19, 2024, 2:02 a.m.