stopifnot(
    requireNamespace("pander")
)
library(knitr)
opts_chunk$set( # Set these first, to make them default
    message = FALSE,
    warning=FALSE,
    error=FALSE
)
optChunkDefault <- opts_chunk$get()

Introduction {#Introduction}

The VCF Tool Box (r Biocpkg("TVTB")) offers S4 classes and methods to filter, summarise and visualise genetic variation data stored in VCF files pre-processed by the Ensembl Variant Effect Predictor (VEP) [@RN1]. An RStudio/Shiny web-application, the Shiny Variant Explorer (tSVE), provides a convenient interface to demonstrate those functionalities integrated in a programming-free environment.

Currently, major functionalities in the r Biocpkg("TVTB") package include:

A class to store recurrent parameters of genetic analyses

Genotype counts and allele frequencies

Classes of VCF filter rules

Installation {#Installation}

Instructions to install the VCF Tool Box are available here.

Once installed, the package can be loaded and attached as follows:

library(TVTB)

Recurrent settings: TVTBparam {#TVTBparam}

Most functionalities in r Biocpkg("TVTB") require recurrent information such as:

\bioccomment{ INFO key suffixes should be considered carefully to avoid conflicts with INFO keys already present in the VCF object. }

To reduce the burden of repetition during programming, and to facilitate analyses using consistent sets of parameters, r Biocpkg("TVTB") implements the TVTBparam class. The TVTBparam class offer a container for parameters recurrently used across the package. A TVTBparam object may be initialised as follows:

tparam <- TVTBparam(Genotypes(
    ref = "0|0",
    het = c("0|1", "1|0", "0|2", "2|0", "1|2", "2|1"),
    alt = c("1|1", "2|2")),
    ranges = GenomicRanges::GRangesList(
        SLC24A5 = GenomicRanges::GRanges(
            seqnames = "15",
            IRanges::IRanges(
                start = 48413170, end = 48434757)
            )
        )
    )

TVTBparam objects have a convenient summary view and accessor methods:

tparam

In this example:

Default values are provided for all slots except genotypes, as these may vary more frequently from one data set to another (e.g. phased, unphased, imputed).

Data import {#Import}

Genetic variants {#ImportVariants}

Functionalities in r Biocpkg("TVTB") support CollapsedVCF and ExpandedVCF objects (both extending the virtual class VCF) of the r Biocpkg("VariantAnnotation") package.

Typically, CollapsedVCF objects are produced by the r Biocpkg("VariantAnnotation") readVcf method after parsing a VCF file, and ExpandedVCF objects result of the r Biocpkg("VariantAnnotation") expand method applied to a CollapsedVCF object.

Any information that users deem relevant for the analysis may be imported from VCF files and stored in VCF objects passed to r Biocpkg("TVTB") methods. However, to enable the key functionalities of the package, the slots of a VCF object should include at least the following information:

List of genomic ranges {#ImportGRangesList}

In the near future, r Biocpkg("TVTB") functionalities are expected to produce summary statistics and plots faceted by meta-features, each potentially composed of multiple genomic ranges.

For instance, burden tests may be performed on a set of transcripts, considering only variants in their respective sets of exons. The r Biocpkg("GenomicRanges") GRangesList class is an ideal container in example, as each GRanges in the GRangesList would represent a transcript, and each element in the GRanges would represent an exon.

Furthermore, TVTBparam objects may be supplied as the param argument of the r Biocpkg("VariantAnnotation") readVcf. In this case, the TVTBparam object is used to import only variants overlapping the relevant genomic regions. Moreover, the readVcf method also ensured that the vep slot of the TVTBparam object is present in the header of the VCF file.

svp <- as(tparam, "ScanVcfParam")
svp

Phenotypes {#ImportPhenotypes}

Although VCF objects may be constructed without attached phenotype data, phenotype information is critical to interpret and compare genetic variants between groups of samples (e.g. burden of damaging variants in different phenotype levels).

VCF objects accept phenotype information (as r Biocpkg("S4Vectors") DataFrame) in the colData slot. This practice has the key advantage of keeping phenotype and genetic information synchronised through operation such as subsetting and re-ordering, limiting workspace entropy and confusion.

Example {#ImportExample}

An ExpandedVCF object that contains the minimal data necessary for the rest of the vignette can be created as follows:

Step 1: Import phenotypes

phenoFile <- system.file(
    "extdata", "integrated_samples.txt", package = "TVTB")
phenotypes <- S4Vectors::DataFrame(
    read.table(file = phenoFile, header = TRUE, row.names = 1))

Step 2: Define the VCF file to parse

vcfFile <- system.file(
    "extdata", "chr15.phase3_integrated.vcf.gz", package = "TVTB")
tabixVcf <- Rsamtools::TabixFile(file = vcfFile)

Step 3: Define VCF import parameters

VariantAnnotation::vcfInfo(svp(tparam)) <- vep(tparam)
VariantAnnotation::vcfGeno(svp(tparam)) <- "GT"
svp(tparam)

Of particular interest in the above chunk of code:

Step 4: Import and pre-process variants

# Import variants as a CollapsedVCF object
vcf <- VariantAnnotation::readVcf(
    tabixVcf, param = tparam, colData = phenotypes)
# Expand into a ExpandedVCF object (bi-allelic records)
vcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)

Of particular interest in the above chunk of code, the readVcf method is given:

The result is an ExpandedVCF object that includes variants in the targeted genomic range(s) and samples:

vcf

Adding allele frequencies {#Frequencies}

Although interesting figures and summary tables may be obtained as soon as the first ExpandedVCF object is created (see section Summarising Ensembl VEP predictions), those methods may benefit from information added to additional INFO keys after data import, either manually by the user, or through various methods implemented in the r Biocpkg("TVTB") package.

Adding overall frequencies {#FrequenciesOverall}

For instance, the method addOverallFrequencies uses the reference homozoygote (REF), heterozygote (HET), and homozygote alternate (ALT) genotypes defined in the TVTBparam object stored in the VCF metadata to obtain the count of each genotype in an ExpandedVCF object. Immediately thereafter, the method uses those counts to calculate alternate allele frequency (AAF) and minor allele frequency (MAF). Finally, the method stores the five calculated values (REF, HET, ALT, AAF, and MAF) in INFO keys defined by suffixes also declared in the TVTBparam object.

initialInfo <- colnames(info(vcf))
vcf <- addOverallFrequencies(vcf = vcf)
setdiff(colnames(info(vcf)), initialInfo)

Notably, the addOverallFrequencies method is synonym to the addFrequencies method missing the argument phenos:

vcf <- addFrequencies(vcf = vcf, force = TRUE)

\bioccomment{ The optional argument force = TRUE is used here to overwrite INFO keys created in the previous chunk of code. }

Adding frequencies within phenotype level(s) {#FrequenciesPhenoLevel}

Similarly, the method addPhenoLevelFrequencies obtains the count of each genotype in samples associated with given level(s) of given phenotype(s), and stores the calculated values in INFO keys defined as <pheno>_<level>_<suffix>, with suffixes defined in the TVTBparam object stored in the VCF metadata.

initialInfo <- colnames(info(vcf))
vcf <- addPhenoLevelFrequencies(
    vcf = vcf, pheno = "super_pop", level = "AFR")
setdiff(colnames(info(vcf)), initialInfo)

Notably, the addPhenoLevelFrequencies method is synonym to the addFrequencies method called with the argument phenos given as a list where names are phenotypes, and values are character vectors of levels to process within each phenotype:

initialInfo <- colnames(info(vcf))
vcf <- addFrequencies(
    vcf,
    list(super_pop = c("EUR", "SAS", "EAS", "AMR"))
)
setdiff(colnames(info(vcf)), initialInfo)

In addition, the addFrequencies method can be given a character vector of phenotypes as the phenos argument, in which case frequencies are calculated for all levels of the given phenotypes:

vcf <- addFrequencies(vcf, "pop")
head(grep("^pop_[[:alpha:]]+_REF", colnames(info(vcf)), value = TRUE))

Filtering variants {#Filter}

Although VCF objects are straightforward to subset using either indices and row names (as they inherit from the r Biocpkg("SummarizedExperiment") RangedSummarizedExperiment class), users may wish to identify variants that pass combinations of criteria based on information in their fixed slot, info slot, and Ensembl VEP predictions, a non-trivial task due to those pieces of information being stored in different slots of the VCF object, and the 1:N relationship between variants and EnsemblVEP predictions.

Definition of VCF filter rules {#FilterDefine}

To facilitate the definition of VCF filter rules, and their application to VCF objects, r Biocpkg("TVTB") extends the r Biocpkg("S4Vectors") FilterRules class in four new classes of filter rules:

motivations <- readRDS(
    system.file("misc", "motivations_rules.rds", package = "TVTB")
)
pander::pandoc.table(
    motivations,
    paste(
        "Motivation for each of the new classes extending `FilterRules`,
        to define VCF filter rules."
    ),
    style = "multiline",
    split.cells = c(15, 45),
    split.tables = c(Inf)
)

Note that FilterRules objects themselves are applicable to VCF objects, with two important difference from the above specialised classes:

S4Vectors::FilterRules(list(
    mixed = function(x){
        VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" &
            VariantAnnotation::info(x)[,"MAF"] >= 0.05
    }
))

Instances of those classes may be initialised as follows:

VcfFixedRules

fixedR <- VcfFixedRules(list(
    pass = expression(FILTER == "PASS"),
    qual = expression(QUAL > 20)
))
fixedR

VcfInfoRules

infoR <- VcfInfoRules(
    exprs = list(
        rare = expression(MAF < 0.01 & MAF > 0),
        common = expression(MAF > 0.05),
        mac_ge3 = expression(HET + 2*ALT >= 3)),
    active = c(TRUE, TRUE, FALSE)
)
infoR

The above code chunk illustrates useful features of FilterRules:

VcfVepRules

vepR <- VcfVepRules(exprs = list(
    missense = expression(Consequence %in% c("missense_variant")),
    CADD_gt15 = expression(CADD_PHRED > 15)
    ))
vepR

VcfFilterRules

VcfFilterRules combine VCF filter rules of different types in a single object.

vcfRules <- VcfFilterRules(fixedR, infoR, vepR)
vcfRules

This vignette offers only a brief peek into the utility and flexibility of VCF filter rules. More (complex) examples are given in a separate vignette, including filter rules using functions and pattern matching. The documentation of the r Biocpkg("S4Vectors") package---where the parent class FilterRules is defined---can also be a source of inspiration.

Control of VCF filter rules {#FilterControl}

As the above classes of VCF filter rules inherit from the r Biocpkg("S4Vectors") FilterRules class, they also benefit from its accessors and methods. For instance, VCF filter rules can easily be toggled between active and inactive states:

active(vcfRules)["CADD_gt15"] <- FALSE
active(vcfRules)

A separate vignette describes in greater detail the use of classes that contain VCF filter rules.

Evaluation of VCF filter rules {#FilterEval}

Once defined, the above filter rules can be applied to ExpandedVCF objects, in the same way as FilterRules are evaluated in a given environment (see the r Biocpkg("S4Vectors") documentation):

summary(eval(expr = infoR, envir = vcf))
summary(eval(expr = vcfRules, envir = vcf))
summary(evalSeparately(expr = vcfRules, envir = vcf))

\bioccomment{ Note how the inactive rules (i.e. CADD, pass) returns TRUE for all variants, irrespective of the filter result. }

Visualising data {#Visualise}

Visualise INFO data by phenotype level on a genomic axis

Let us show the alternate allele frequency (AAF) of common variants, estimated in each super-population, in the context of the transcripts ovelapping the region of interest.

In the MAF track:

plotInfo(
        subsetByFilter(vcf, vcfRules["common"]), "AAF",
        range(GenomicRanges::granges(vcf)),
        EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75,
        "super_pop",
        zero.rm = FALSE
    )

Alternatively, the minor allele frequency (MAF) of missense variants (as estimated from the entire data set) may be visualised in the same manner. However, due to the nature of those variants, the zero.rm argument may be set to TRUE to hide all data points showing a MAF of 0; thereby variants actually detected in each super-population are emphasised even at low frequencies.

plotInfo(
        subsetByFilter(vcf, vcfRules["missense"]), "MAF",
        range(GenomicRanges::granges(vcf)),
        EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75,
        "super_pop",
        zero.rm = TRUE
    )

Pairwise comparison of INFO data between phenotype levels

Using the r CRANpkg("GGally") ggpairs method, let us make a matrix of plots for common variants, showing:

[^1]: Pearson, by default

pairsInfo(subsetByFilter(vcf, vcfRules["common"]), "AAF", "super_pop")

Note that the ellipsis ... allows a high degree of customisation, as it passes additional arguments to the underlying ggpairs method.

A taste of future...

This section presents upcoming features.

Summarising Ensembl VEP predictions {#Summarise}

As soon as genetic and phenotypic information are imported into an ExpandedVCF object, or after the object was extended with additional information, the scientific value of the data may be revealed by a variety of summary statistics and graphical representations. This section will soon present several ideas being implemented in r Biocpkg("TVTB"), for instance:

Acknowledgement

Dr. Stefan Gräf and Mr. Matthias Haimel for advice on the VCF file format and the Ensembl VEP script. Prof. Martin Wilkins for his trust and support. Dr. Michael Lawrence for his helpful code review and suggestions.

Last but not least, the amazing collaborative effort of the rep("many",n) Bioconductor developers whose hard work appears through the dependencies of this package.

Session info {#SessionInfo}

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References {#References}



kevinrue/TVTB documentation built on July 9, 2024, 11:42 p.m.