knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
Interpretation of genetic variation data is a crucial step to understand the relationship between gene sequence changes and biological function. There are several annotation tools, such as ANNOVAR, VEP, vcfanno, have been developed. These tools make gene variation data annotation more convenient and faster than before. However, because different annotation tools have their own methods of use and design architecture, this increases the difficulty for bioinformatics beginner to utilize these tools. In addition, many of existing database resources and annotation scripts have not been well integrated and shared.
So, it is worth us to develop an integrated annotation system that not only includes the integration of different annotation tools but also integrate the relevant database resources. Here, we present an integrated annotation R package 'anor' to do this. It provides a series R functions to integrate external annotation tools and annotation databases.
To install anor, you need to install R interpreter (Supported Linux, MAC, and Windows). This package has been uploaded on The Comprehensive R Archive Network (CRAN, https://cran.r-project.org). You can use the command to install anor package easily:
# setRepositories ind 1 is CRAN, 2 is Bioconductor setRepositories(ind=1:2) install.packages('anor')
If you want to use the latest development version, you need to use devtools install_github
function.
# Install the cutting edge development version from GitHub: # install.packages("devtools") devtools::install_github("JhuangLab/anor", ref = "develop")
anor can also be installed using the source code archive (R CMD INSTALL
). In this situation, you need to manually handle dependencies on many packages.
Tips: When the RMySQL or RSQLite package can not directly installed by R, conda is an optional solution: conda install -c r r-rmysql r-rsqlite
. Or you need root permissions to install the corresponding system dependency.
To reduce the procedure of download database and other material, anor provides single function download.database
to download various annotation databases for ANNOVAR, AnnotationDbi, vcfanno, vep, oncotator and giggle. Moreover, you can share the database configuration for others to download your database in protected (license code required) or unprotected mode.
# Library needed R package library(BioInstaller) library(anor) library(data.table) # Show all anor supported database x <- download.database(show.all.names = TRUE)
suppressWarnings( DT::datatable(matrix(x, ncol=3), caption = sprintf("Annotation database supported by anor (n=%s)", length(x))) )
# Show all supported version of database (e.g. db_annovar_avsnp) download.database(download.name = "db_annovar_avsnp", show.all.version = TRUE) # Show all supprted buildver of specific version database download.database(download.name = "db_annovar_avsnp", version = "avsnp147", show.all.buildvers = TRUE) # To reduce the download time, we use the local demo configuratin file to download demo file demo.cfg <- system.file("extdata", "demo/demo.cfg", package = "anor") download.database("download_demo", show.all.versions = T, download.cfg = demo.cfg) download.database("download_demo", "demo", buildver = "GRCh37", database.dir = sprintf("%s/databases/", tempdir()), download.cfg = demo.cfg) # Protected mode # download.database("your_db_name", license = "license_code") # If you want to download other resource in BioInstaller, # you can use function `install.bioinfo` x <- install.bioinfo(show.all.names = TRUE)
suppressWarnings( DT::datatable(matrix(x, ncol=3), caption = sprintf("Items supported by BioInstaller (n=%s)", length(x))) )
anno.name
is the key of shortcut annotation, you can get the annotation name through function get.annotation.names()
, or you can search the vignette to get the annotation name
Besides, download.name
is also required for you to download the related database resource connected with anno.name
. Now, we provides the function get.download.name
to get it.
# Get all supprted anno.name in anor x <- get.annotation.names()
suppressWarnings( DT::datatable(matrix(x, ncol=3), caption = sprintf("Annotation database supported by anor (n=%s)", length(x))) )
# Get annotation name needed download.name and # you can use download.database to download database using the download.name. download.name <- get.download.name('avsnp147')
# Database configuration file database.cfg <- system.file('extdata', 'config/databases.toml', package = "anor") # Get anno.name needed input cols get.annotation.needcols('avsnp147')
To facilitate the SQLite format database annotation. anor also provides a simple function sqlite.build
to build your sqlite format database: input a text file and output a sqlite format database.
It is helpful for the annotation database with very large size file. Because the SQLite or other SQL format database with the indexes can singlificantly reduce the search and analysis time without any other bio-annotation tools, and just need the powerful SQL database client.
# build sqlite database for(i in c("hg19_ALL.sites.2015_08", "hg19_avsnp147")) { database <- system.file("extdata", sprintf("demo/%s.txt", i), package = "anor") sqlite.db <- sprintf("%s/%s.sqlite", tempdir(), i) file.copy(database, sprintf("%s/%s.txt", tempdir(), i)) sqlite.build(database, sqlite.connect.params = list(dbname = sqlite.db, table.name = sprintf("%s", i))) }
Function annotation
is the main interface to access various annotation system including ANNOVAR, vcfanno, vep, and R annotation system, such as AnnotationDbi and other R format anntotation script.
For example, if the anno.name
contains perl_annovar
, it will use the external ANNOVAR
to finish the annotation step and read the output file (VCF by vcfR) as the data.table
object.
# use the defined rule to annotate 1000 Genome Project frequency database.dir <- tempdir() chr <- c("chr1", "chr2", "chr1") start <- c("10177", "10177", "10020") end <- c("10177", "10177", "10020") ref <- c("-", "A", "A") alt <- c("C", "AC", "-") dat <- data.table(chr = chr, start = start, end = end, ref = ref, alt = alt) x <- annotation(dat = dat, anno.name = "1000g2015aug_all", database.dir = database.dir, db.type = "txt") x x <- annotation(dat = dat, anno.name = "1000g2015aug_all", database.dir = database.dir, db.type = "sqlite") x
Excepting the predetermined shortcut anno.name
, anor also provides the austomizable functions annotation.cols.match
and annotation.region.match
for the full match and region match respectively.
# Do annotation using full match function (default to use chr, start to select data # and use chr, start, end, ref, and alt to match data) # Use `?annotation.cols.match` to see more detail about `annotation.cols.match` chr <- c("chr1", "chr2", "chr1") start <- c("10020", "10020", "10020") end <- c("10020", "10020", "10020") ref <- c("A", "A", "A") alt <- c("-", "-", "-") dat <- data.table(chr = chr, start = start, end = end, ref = ref, alt = alt) x <- annotation.cols.match(dat, "avsnp147", database.dir = database.dir, return_col_names = "avSNP147", db.type = "sqlite") x # Region match mode bed.file <- system.file("extdata", "demo/example.bed", package = "anor") chr <- c("chr10", "chr1") start <- c("100188904", "100185955") end <- c("100188904", "100185955") dat <- data.table(chr = chr, start = start, end = end) # format.cols.plus.chr will add "chr" in chr colum # if your input chr colum not contain string 'chr' # format.db.region.tb will process the region matched data x <- annotation.region.match(dat = dat, database.dir = tempdir(), dbname_fixed = bed.file, table_name_fixed = "bed", db.type = "txt", format_dat_fun = "format.cols.plus.chr", format_db_tb_fun = "format.db.region.tb") x
Other various small annotation functions/script also be provided, and the number of the annotation function will continue to increase.
# Demo of `convert dbsnp rs number to genomic location` and `alias of human gene symbol`. # rs2pos: convert snp rs number to genomic location snp.id <- c("rs775809821", "rs768019142") x <- annotation(dat = data.table(rs = rep(snp.id, 3)), database.dir = database.dir, anno.name = "rs2pos147", buildver = "hg19", verbose = FALSE, db.type = "txt") # Annotate gene from BioConductor org.hs.eg.db gene <- c("TP53", "NSD2") annotation(dat = gene, anno.name = "bioc_gene2alias")
If you want to use the ANNOVAR of perl version, you need to download the ANNOVAR source code using R package BioInstaller, and also need to prepare the avinput
format refer the tutorial:
chr|start|end|ref|alt --|--|--|--|-- chr1|100000|100000|A|T chr1|100000|100001|AA|- chr1|100000|100000|-|T chr1|100000|100000|-|CAC
library(BioInstaller) install.bioinfo('annovar', annovar.dir)
To reduce the test time, we set debug
to TRUE
, and it will returend the command to run ANNOVAR.
# Annotate avinput format R data using ANNOVAR # set debug to TRUE will not to run command chr = "chr1" start = "123" end = "123" ref = "A" alt = "C" dat <- data.table(chr, start, end, ref, alt) x <- annotation(dat, "perl_annovar_refGene", annovar.dir = "/opt/bin/annovar", database.dir = "{{annovar.dir}}/humandb", debug = TRUE)
VCF format files can be processed by ANNOVAR, VEP and vcfanno. If you don't want to read the output files in R, you can set the parameter debug
to TRUE
, and paste/run the returend command to shell client.
# Annotate VCF file using ANNOVAR # set debug to TRUE will not to run command x <- annotation(anno.name = "perl_annovar_ensGene", input.file = "/tmp/test.vcf", annovar.dir = "/opt/bin/annovar/", database.dir = "{{annovar.dir}}/humandb", out = tempfile(), vcfinput = TRUE, debug = TRUE) # Annotation VCF file use VEP vep(debug = TRUE) x <- annotation(anno.name = "vep_all", input.file = "/tmp/test.vcf", out = tempfile(), debug = TRUE) # Annotation VCF file use vcfanno vcfanno(debug = TRUE) x <- annotation(anno.name = "vcfanno_demo", input.file = system.file("extdata", "demo/vcfanno_demo/query.vcf.gz", package = "anor"), out = "test.vcf", vcfanno = "/path/vcfanno", debug = TRUE)
Detail about supported annotation database can be found in the another vignette.
Here is the output of sessionInfo()
on the system on which this document was compiled:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.