knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60))
Variant calling format (VCF) is the final file format that results from various steps of transforming sequencing data into high quality annotated genetic variants. Variants in a VCF file are prone to errors, so, different quality measurements are taken to validate the information in the file. If these measurements are not met, a further step of hard-filtering can be done to remove low quality calls [@de2017gatk]. Genome Analysis Toolkit (GTAK) is an established, well-documented variant calling pipeline published by the broad institute [@mckenna2010genome]. It provides various quality annotations to detect errors in the different steps of variant calling. This package visualizes these annotations and prints out GATK guidelines on hard-filtering based on the chosen parameter.
This tool aims to guide the user on filtering GATK VCF files by visualizing the quality distribution of variants and providing GTAK filtering threshold values and tips.
The package functions on GATK annotated VCF files. A GATK user needs to convert the VCF into table format using the following bash command:
```{bash, eval=FALSE, tidy = TRUE, size="tiny"} gatk VariantsToTable -V $file\ -F CHROM -F POS -F REF -F ALT \ -F FILTER -F QUAL -F QD -F FS \ -F SOR -F MQ -F MQRankSum -F ReadPosRankSum\ -O /home/alihh/freq/stat/ksu/$file.table
# How to use Install the latest version of vcfqc package by copying the following commands into R console: ```r library(devtools) devtools::install_github("Hatoonli/vcfqc", build_vignettes = TRUE)
Load the package
library(vcfqc)
Then choose a quality parameter from the list below and run it on your VCF file. QD FS SOR MQ MQRankSum ReadPosRankSum
You can read more about each parameter by inquiring about it with a question mark, for example:
?QC
You can preview the performance of the package on the example file provided in /extdata
, to get the file pathway run the following command:
system.file("extdata", "HG001_GRCh38.table", package = "vcfqc")
and to learn more about the file, read the accompanying documentation by running the following command:
?extdata/HG001_GRCh38.table
Before choosing the name vcfqc, I checked for its availability using the package available
library() valid_package_name("vcfqc") available_on_cran("vcfqc") available_on_bioc("vcfqc") available_on_github("vcfqc")
Using hexSticker, the sticker was created and saved in /inst/figures/sticker.png
.
library(vcfqc) library(ggplot2) library(hexSticker) p <- QD(system.file("extdata", "HG001_GRCh38.table", package = "vcfqc")) p <- p + theme_void() + theme_transparent() s <- sticker(p, package="vcfqc", p_size=20, s_x=1, s_y=.75, s_width=1, s_height=1, h_fill="#498ef5", h_color="#71a6f5", filename="inst/figures/sticker.png") plot(s)
library(vcfqc) library(ggplot2) library(hexSticker) p <- QD(system.file("extdata", "HG001_GRCh38.table", package = "vcfqc")) p <- p + theme_void() + theme_transparent() s <- sticker(p, package="vcfqc", p_size=20, s_x=1, s_y=.75, s_width=1, s_height=1, h_fill="#498ef5", h_color="#71a6f5", filename="inst/figures/sticker.png") plot(s)
plot(s)
It's then turned into a logo by using the function use_logo("inst/figures/sticker.png")
the generated text is copied and added next to the title in the README.Rmd
file.
This package was built using devtools [@dev]. By creating the package, all required directories and files are built in the package directory.
usethis::create_package("vcfqc")
The package description is added in DESCRIPTION file. The license used is an open source licence by MIT, added using the function use_mit_license()
.
All vcfqc functions are saved in the directory \R
and generated using the devtools command use_r()
. Information about each function was taken from GATK hard-filtering guidelines [@de2017gatk] and added at the header of the function file (e.g. QD.R
).
#' Plot Quality by Depth (QD) #' @description #' This is the .... #' @param f VCF input file #' @param n Threshold value (optional), default is the recommended GATK value. #' @returns #' A plot ... #' @export #' @example #' QD(system.file("extdata", "HG001_GRCh38.table", package = "vcfqc"))
From this header, documentation for the function is saved in \man
and the imported or exported objects are detailed in NAMESPACE
.
ggplot is used for plotting the distribution of quality annotation using a simple line:
ggplot2::ggplot(vcf, ggplot2::aes(QD)) + ggplot2::geom_density()
The user gets a message for the numbers of variants not annotated (NA) and a table output that summarizes the data. Lastly, a text is printed with information about the expected plot from GTAK guidelines.
Additional files needed for the package are added under the directory /inst
, when the package is installed all child directories of /inst
will be installed with it. Databases in this directories are documented in /R
. Files that are not needed for the package are added to .Rbuildignore
so they're not copied to the user system when the package is installed. A vignette was also created to demonstarte the purpose of the package and give background information. Using use_vignette()
created a vignette directory with the R markdown file. Finally, a website was created feo the package using pkgdown.
Future updates of the package can include more annotation options and a flexible annotation field where the user can enter costum values based on the variant calling pipeline they're using.
\newpage
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.