In JhuangLab/annovarR: Integrated Framework to Annotate Genetic Variants

knitr::opts_chunk$set(comment = "#>", collapse = TRUE)

Introduction

Interpretation of genetic variation data is a crucial step to understand the relationship between gene sequence changes and biological function. There are several annotation tools, such as ANNOVAR, VEP, vcfanno, have been developed. These tools make gene variation data annotation more convenient and faster than before. However, because different annotation tools have their own methods of use and design architecture, this increases the difficulty for bioinformatics beginner to utilize these tools. In addition, many of existing database resources and annotation scripts have not been well integrated and shared.

So, it is worth us to develop an integrated annotation system that not only includes the integration of different annotation tools but also integrate the relevant database resources. Here, we present an integrated annotation R package 'anor' to do this. It provides a series R functions to integrate external annotation tools and annotation databases.

Installation

To install anor, you need to install R interpreter (Supported Linux, MAC, and Windows). This package has been uploaded on The Comprehensive R Archive Network (CRAN, https://cran.r-project.org). You can use the command to install anor package easily:

# setRepositories ind 1 is CRAN, 2 is Bioconductor
setRepositories(ind=1:2)
install.packages('anor')

If you want to use the latest development version, you need to use devtools install_github function.

# Install the cutting edge development version from GitHub:
# install.packages("devtools")
devtools::install_github("JhuangLab/anor", ref = "develop")

anor can also be installed using the source code archive (R CMD INSTALL). In this situation, you need to manually handle dependencies on many packages.

Tips: When the RMySQL or RSQLite package can not directly installed by R, conda is an optional solution: conda install -c r r-rmysql r-rsqlite. Or you need root permissions to install the corresponding system dependency.

Download database resource

To reduce the procedure of download database and other material, anor provides single function download.database to download various annotation databases for ANNOVAR, AnnotationDbi, vcfanno, vep, oncotator and giggle. Moreover, you can share the database configuration for others to download your database in protected (license code required) or unprotected mode.

# Library needed R package
library(BioInstaller)
library(anor)
library(data.table)

# Show all anor supported database
x <- download.database(show.all.names = TRUE)

suppressWarnings(
  DT::datatable(matrix(x, ncol=3), caption = sprintf("Annotation database supported by anor (n=%s)", length(x)))
)

# Show all supported version of database (e.g. db_annovar_avsnp)
download.database(download.name = "db_annovar_avsnp", show.all.version = TRUE)

# Show all supprted buildver of specific version database
download.database(download.name = "db_annovar_avsnp", version = "avsnp147", show.all.buildvers = TRUE)

# To reduce the download time, we use the local demo configuratin file to download demo file
demo.cfg <- system.file("extdata", "demo/demo.cfg", package = "anor")
download.database("download_demo", show.all.versions = T, download.cfg = demo.cfg)
download.database("download_demo", "demo", buildver = "GRCh37", database.dir = sprintf("%s/databases/", 
  tempdir()), download.cfg = demo.cfg)

# Protected mode
# download.database("your_db_name", license = "license_code")

# If you want to download other resource in BioInstaller,
# you can use function `install.bioinfo`
x <- install.bioinfo(show.all.names = TRUE)

suppressWarnings(
  DT::datatable(matrix(x, ncol=3), caption = sprintf("Items supported by BioInstaller (n=%s)", length(x)))
)

Basic annotation

anno.name is the key of shortcut annotation, you can get the annotation name through function get.annotation.names(), or you can search the vignette to get the annotation name

Besides, download.name is also required for you to download the related database resource connected with anno.name. Now, we provides the function get.download.name to get it.

# Get all supprted anno.name in anor
x <- get.annotation.names()

suppressWarnings(
  DT::datatable(matrix(x, ncol=3), caption = sprintf("Annotation database supported by anor (n=%s)", length(x)))
)

# Get annotation name needed download.name and 
# you can use download.database to download database using the download.name.
download.name <- get.download.name('avsnp147')

# Database configuration file
database.cfg <- system.file('extdata', 'config/databases.toml', package = "anor")

# Get anno.name needed input cols
get.annotation.needcols('avsnp147')

To facilitate the SQLite format database annotation. anor also provides a simple function sqlite.build to build your sqlite format database: input a text file and output a sqlite format database.

It is helpful for the annotation database with very large size file. Because the SQLite or other SQL format database with the indexes can singlificantly reduce the search and analysis time without any other bio-annotation tools, and just need the powerful SQL database client.

# build sqlite database
for(i in c("hg19_ALL.sites.2015_08", "hg19_avsnp147")) {
  database <- system.file("extdata", sprintf("demo/%s.txt", i), package = "anor")
  sqlite.db <- sprintf("%s/%s.sqlite", tempdir(), i)
  file.copy(database, sprintf("%s/%s.txt", tempdir(), i))
  sqlite.build(database, sqlite.connect.params = list(dbname = sqlite.db, table.name = sprintf("%s", 
    i)))
}

Function annotation is the main interface to access various annotation system including ANNOVAR, vcfanno, vep, and R annotation system, such as AnnotationDbi and other R format anntotation script.

For example, if the anno.name contains perl_annovar, it will use the external ANNOVAR to finish the annotation step and read the output file (VCF by vcfR) as the data.table object.

# use the defined rule to annotate 1000 Genome Project frequency
database.dir <- tempdir()
chr <- c("chr1", "chr2", "chr1")
start <- c("10177", "10177", "10020")
end <- c("10177", "10177", "10020")
ref <- c("-", "A", "A")
alt <- c("C", "AC", "-")
dat <- data.table(chr = chr, start = start, end = end, ref = ref, alt = alt)
x <- annotation(dat = dat, anno.name = "1000g2015aug_all", database.dir = database.dir, db.type = "txt")
x
x <- annotation(dat = dat, anno.name = "1000g2015aug_all", database.dir = database.dir, db.type = "sqlite")
x

Advanced annotation

Excepting the predetermined shortcut anno.name, anor also provides the austomizable functions annotation.cols.match and annotation.region.match for the full match and region match respectively.

# Do annotation using full match function (default to use chr, start to select data 
# and use chr, start, end, ref, and alt to match data)
# Use `?annotation.cols.match` to see more detail about `annotation.cols.match`
chr <- c("chr1", "chr2", "chr1")
start <- c("10020", "10020", "10020")
end <- c("10020", "10020", "10020")
ref <- c("A", "A", "A")
alt <- c("-", "-", "-")
dat <- data.table(chr = chr, start = start, end = end, ref = ref, alt = alt)
x <- annotation.cols.match(dat, "avsnp147", database.dir = database.dir, 
  return_col_names = "avSNP147", db.type = "sqlite")
x

# Region match mode
bed.file <- system.file("extdata", "demo/example.bed", package = "anor")
chr <- c("chr10", "chr1")
start <- c("100188904", "100185955")
end <- c("100188904", "100185955")
dat <- data.table(chr = chr, start = start, end = end)

# format.cols.plus.chr will add "chr" in chr colum 
# if your input chr colum not contain string 'chr'
# format.db.region.tb will process the region matched data
x <- annotation.region.match(dat = dat, database.dir = tempdir(), dbname_fixed = bed.file, 
  table_name_fixed = "bed", db.type = "txt", format_dat_fun = "format.cols.plus.chr", 
  format_db_tb_fun = "format.db.region.tb")
x

Other various small annotation functions/script also be provided, and the number of the annotation function will continue to increase.

# Demo of `convert dbsnp rs number to genomic location` and `alias of human gene symbol`.
# rs2pos: convert snp rs number to genomic location
snp.id <- c("rs775809821", "rs768019142")
x <- annotation(dat = data.table(rs = rep(snp.id, 3)), database.dir = database.dir, anno.name = "rs2pos147", 
    buildver = "hg19", verbose = FALSE, db.type = "txt")

# Annotate gene from BioConductor org.hs.eg.db
gene <- c("TP53", "NSD2")
annotation(dat = gene, anno.name = "bioc_gene2alias")

External annotation system

If you want to use the ANNOVAR of perl version, you need to download the ANNOVAR source code using R package BioInstaller, and also need to prepare the avinput format refer the tutorial:

chr|start|end|ref|alt --|--|--|--|-- chr1|100000|100000|A|T chr1|100000|100001|AA|- chr1|100000|100000|-|T chr1|100000|100000|-|CAC

library(BioInstaller)
install.bioinfo('annovar', annovar.dir)

To reduce the test time, we set debug to TRUE, and it will returend the command to run ANNOVAR.

# Annotate avinput format R data using ANNOVAR
# set debug to TRUE will not to run command
chr = "chr1"
start = "123"
end = "123"
ref = "A"
alt = "C"
dat <- data.table(chr, start, end, ref, alt)
x <- annotation(dat, "perl_annovar_refGene", annovar.dir = "/opt/bin/annovar", 
             database.dir = "{{annovar.dir}}/humandb", debug = TRUE)

VCF format files can be processed by ANNOVAR, VEP and vcfanno. If you don't want to read the output files in R, you can set the parameter debug to TRUE, and paste/run the returend command to shell client.

# Annotate VCF file using ANNOVAR
# set debug to TRUE will not to run command
x <- annotation(anno.name = "perl_annovar_ensGene", input.file = "/tmp/test.vcf",
             annovar.dir = "/opt/bin/annovar/", database.dir = "{{annovar.dir}}/humandb", 
             out = tempfile(), vcfinput = TRUE, debug = TRUE)

# Annotation VCF file use VEP
vep(debug = TRUE)
x <- annotation(anno.name = "vep_all", input.file = "/tmp/test.vcf",
             out = tempfile(), debug = TRUE)

# Annotation VCF file use vcfanno
vcfanno(debug = TRUE)
x <- annotation(anno.name = "vcfanno_demo", input.file = system.file("extdata", "demo/vcfanno_demo/query.vcf.gz", 
                   package = "anor"), out = "test.vcf", vcfanno = "/path/vcfanno", debug = TRUE)