addAnnotation: Build a local genomic regions annotation database

Build a local genomic regions annotation database


This function is the main annotation database creator of sitadela. It creates a local SQLite database for various organisms and categories of genomic regions. Annotations are retrieved in simple, tab-delimited or GRanges formats.


    addAnnotation(organisms, sources, db = getDbPath(),
        versioned = FALSE, forceDownload = TRUE, retries = 5,
        rc = NULL, stopIfNotBS = FALSE)



a list of organisms and versions for which to download and build annotations. See also Details.


a character vector of public sources from which to download and build annotations. It can be one or more of "ensembl", "ucsc", "refseq" or "ncbi". See also Details.


a valid path (accessible at least by the current user) where the annotation database will be set up. It defaults to system.file(package = "sitadela"), "annotation.sqlite") that is, the installation path of sitadela package.


create an annotation database with versioned genes and transcripts, when possible.


by default, addAnnotation will not download an existing annotation again (FALSE). Set to TRUE if you wish to update the annotation database for a particular version.


how many times should the annotation worker try to re-connect to internet resources in case of a connection problem or failure.


fraction (0-1) of cores to use in a multicore system. It defaults to NULL (no parallelization). Sometimes used for building certain annotation types.


stop or warn (default) if certain BSgenome packages are not present. See also Details.


Regarding the organisms argument, it is a list with specific format which instructs addAnnotation on which organisms and versions to download from the respective sources. Such a list may have the format: organisms=list(hg19=75, mm9=67, mm10=96:97) This is explained as follows:

  • A database comprising the human genome versions hg19 and the mouse genome versions mm9, mm10 will be constructed.

  • If "ensembl" is in sources, version 75 is downloaded for hg19 and versions 67, 96, 97 for mm9, mm10.

  • If "ucsc" or "refseq" are in sources, the latest versions are downloaded and marked by the download date. As UCSC and RefSeq versions are not accessible in the same way as Ensembl, this procedure cannot always be replicated.

organisms can also be a character vector with organism names/versions (e.g. organisms = c("mm10","hg19")), then the latest versions are downloaded in the case of Ensembl.

The supported supported organsisms are, for human genomes "hg18", "hg19" or "hg38", for mouse genomes "mm9", "mm10", for rat genomes "rn5" or "rn6", for drosophila genome "dm3" or "dm6", for zebrafish genome "danrer7", "danrer10" or "danrer11", for chimpanzee genome "pantro4", "pantro5", for pig genome "susscr3", "susscr11", for Arabidopsis thaliana genome "tair10" and for Equus caballus genome "equcab2" and "equcab3". Finally, it can be "USER_NAMED_ORG" with a custom organism which has been imported to the annotation database by the user using a GTF/GFF file. For example org="mm10_p1".

Regarding sources, "ucsc" corresponds to UCSC Genome Browser annotated transcripts, "refseq" corresponds to UCSC RefSeq maintained transcripts while "ncbi" corresponds to NCBI RefSeq annotated and maintained transcripts. UCSC, RefSeq and NCBI annotations are constructed by querying the UCSC Genome Browser database.

Regarding stopIfNotBS, when sources includes "ucsc", "refseq" or "ncbi", the GC content of a gene is not available as a database attribute as with Ensembl and has to be calculated if to be included in the respective annotation. For this reason, sitadela uses 'BSgenome' packages. If stopIfNotBS=FALSE (default), then the annotation building continues and GC content is NA for the missing 'BSgenome' packages.If stopIfNotBS=FALSE, then building stops until all the required packages for the selected organisms become available (installed by the user).


The function does not return anything. Only the SQLite database is created or updated.


Panagiotis Moulos


# Build a test database with one genome
myDb <- file.path(tempdir(),"testann.sqlite")

organisms <- list(mm10=100)
sources <- "ensembl"

# If the example is not running in a multicore system, rc is ignored

# A more complete case, don't run as example
# Since we are using Ensembl, we can also ask for a version
#organisms <- list(
#    mm9=67,
#    mm10=96:97,
#    hg19=75,
#    hg38=96:97
#sources <- c("ensembl", "refseq")

## Build on the default location (depending on package location, it may
## require root/sudo)

## Build on an alternative location
#myDb <- file.path(path.expand("~"),"my_ann.sqlite")

