VCF filter rules

Motivation {#Motivation}

Background

VCF objects of the r Biocpkg("VariantAnnotation") package contain a plethora of information imported from specific fields of source VCF files and stored in dedicated slots (e.g. fixed, info, geno), as well as optional Ensembl VEP predictions [@RN1] stored under a given key of their INFO slot.

This information may be used to identify and filter variants of interest for further analysis. However, the size of genetic data sets and the variety of filter rules---and their combinatorial explosion---create considerable challenges in terms of workspace memory and entropy (i.e. size and number of objects in the workspace, respectively).

The FilterRules class implemented in the r Biocpkg("S4Vectors") package provides a powerful tool to create flexible and lightweight filter rules defined in the form of expression and function objects that can be evaluated within given environments. The r Biocpkg("TVTB") package extends this FilterRules class into novel classes of VCF filter rules, applicable to information stored in the distinct slots of VCF objects (i.e. CollapsedVCF and ExpandedVCF classes), as described below:

motivations <- readRDS(
    system.file("misc", "motivations_rules.rds", package = "TVTB")
)
pander::pandoc.table(
    motivations,
    paste(
        "Motivation for each of the new classes extending `FilterRules`,
        to define VCF filter rules."
    ),
    style = "multiline",
    split.cells = c(15, 45),
    split.tables = c(Inf)
)

Table: Motivation for each of the new classes extending FilterRules to define VCF filter rules.

Note that FilterRules objects themselves are applicable to VCF objects, with two important difference from the above specialised classes:

fr <- S4Vectors::FilterRules(list(
    mixed = function(x){
        VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" &
            VariantAnnotation::info(x)[,"MAF"] >= 0.05
    }
))
fr

Features {#Features}

As they inherit from the FilterRules class, these new classes benefit from accessors and methods defined for their parent class, including:

To account for the more complex structure of VCF objects, some of the new VCF filter rules classes implemented in the r Biocpkg("TVTB") package require additional information stored in new dedicated slots, associated with the appropriate accessors and setters. For instance:

Demonstration data {#DemoData}

For the purpose of demonstrating the utility and usage of VCF filter rules, a set of variants and associated phenotype information was obtained from the 1000 Genomes Project Phase 3 release. It can be imported as a CollapsedVCF object using the following code:

library(TVTB)
extdata <- system.file("extdata", package = "TVTB")
vcfFile <- file.path(extdata, "chr15.phase3_integrated.vcf.gz")
tabixVcf <- Rsamtools::TabixFile(file = vcfFile)
vcf <- VariantAnnotation::readVcf(file = tabixVcf)

VCF filter rules may be applied to ExpandedVCF objects equally:

evcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)

CollapsedVCF and ExpandedVCF {#VcfClasses}

As described in the documentation of the r Biocpkg("VariantAnnotation") package, the key difference between CollapsedVCF and ExpandedVCF objects ---both extending the VCF class---is the expansion of multi-allelic records into bi-allelic records, respectively. In other words (quoting the r Biocpkg("VariantAnnotation") documentation):

"CollapsedVCF objects contains the ALT data as a DNAStringSetList allowing for multiple alleles per variant. In contrast, the ExpandedVCF stores the ALT data as a DNAStringSet where the ALT column has been expanded to create a flat form of the data with one row per variant-allele combination."

This difference has implications for filter rules using the "ALT" field of the info slot, as demonstrated in a later section.

Fields available for the definition of filter rules {#VcfFields}

First, let us examine which fields (i.e. column names) are available in the VCF objects to create VCF filter rules:

fixedVcf <- colnames(fixed(vcf))
fixedVcf
infoVcf <- colnames(info(vcf))
infoVcf
csq <- ensemblVEP::parseCSQToGRanges(x = evcf)
vepVcf <- colnames(mcols(csq))
vepVcf

\bioccomment{ Although multi-allelic records present in CollapsedVCF objects are expanded into bi-allelic records in ExpandedVCF objects, names fields in the different slots are identical in the corresponding collapsed and expanded objects. }

Usage of VCF filter rules {#FilterUsage}

Filter rules using a single field {#SingleFieldFilter}

The value of a particular field can be used to define expressions that represent simple filter rules based on that value alone. Multiple rules may be stored in any one FilterRules objects. Ideally, VCF filter rules should be named to facilitate their use, but also as a reminder of the purpose of each particular rule. For instance, in the chunk of code below, two filter rules are defined using fields of the fixed slot:

fixedRules <- VcfFixedRules(exprs = list(
    pass = expression(FILTER == "PASS"),
    qual20 = expression(QUAL >= 20)
))
active(fixedRules)["qual20"] <- FALSE
summary(evalSeparately(fixedRules, vcf))

In the example above, all variants pass the active "pass" filter, while the deactivated rules "qual20") automatically returns TRUE for all variants.

\bioccomment{ active(fixedRules["qual20"]) <- FALSE would be wrong syntax, as it does not change the active state of the fixedRules object; instead, it changes the active state of a temporary object made of the selected filter rule. As a consequence, the fixedRules object would be left unchanged after this command. }

Filter rules using multiple fields {#MultipleFieldsFilter}

It is also possible for VCF filter rules to use multiple fields (of the same VCF slot) in a single expression. In the chunk of code below, the VCF filter rule identifies variants for which both the "REF" and "ALT" values (in the INFO slot) are one of the four nucleotides (i.e. a simple definition of single nucleotide polymorphisms; SNPs):

nucleotides <- c("A", "T", "G", "C")
SNPrule <- VcfFixedRules(exprs = list(
    SNP = expression(
    as.character(REF) %in% nucleotides &
        as.character(ALT) %in% nucleotides)
))
summary(evalSeparately(SNPrule, evcf, enclos = .GlobalEnv))

Some considerations regarding the above filter rule:

Calculations in filter rules {#CalculationFilter}

Expressions that define filter rules may also include calculations. In the chunk of code below, two simple VCF filter rules are defined using fields of the info slot:

infoRules <- VcfInfoRules(exprs = list(
    samples = expression(NS > (0.9 * ncol(evcf))),
    avgSuperPopAF = expression(
        (EAS_AF + EUR_AF + AFR_AF + AMR_AF + SAS_AF) / 5 > 0.05
    )
))
summary(evalSeparately(infoRules, evcf, enclos = .GlobalEnv))

\bioccomment{ Here, the global environment must be supplied as the filter rule requires the evcf object itself to access its number of columns (not only as the environment in which the rules must be evaluated). }

Functions in filter rules {#FunctionFilter}

It may be more convenient to define filters as function objects. For instance, the chunk of code below:

AFcutoff <- 0.05
popCutoff <- 2/3
filterFUN <- function(envir){
    # info(vcf) returns a DataFrame; rowSums below requires a data.frame
    df <- as.data.frame(envir)
    # Identify fields storing allele frequency in super-populations
    popFreqCols <- grep("[[:alpha:]]{3}_AF", colnames(df))
    # Count how many super-population have an allele freq above the cutoff
    popCount <- rowSums(df[,popFreqCols] > AFcutoff)
    # Convert the cutoff ratio to a integer count
    popCutOff <- popCutoff * length(popFreqCols)
    # Identifies variants where enough super-population pass the cutoff
    testRes <- (popCount > popCutOff)
    # Return a boolean vector, required by the eval method
    return(testRes)
}
funFilter <- VcfInfoRules(exprs = list(
    commonSuperPops = filterFUN
))
summary(evalSeparately(funFilter, evcf))

\bioccomment{ Filter rules written as functions may use values in the the global environment, without the need to supply it as the enclosing environment. }

Notably, the filterFUN function may also be applied separately to the info slot of VCF objects:

summary(filterFUN(info(evcf)))

Pattern matching in filter rules {#PatternFilter}

The grepl function is particularly suited for the purpose of FilterRules as they return a logical vector:

missenseFilter <- VcfVepRules(
    exprs = list(
        exact = expression(Consequence == "missense_variant"),
        grepl = expression(grepl("missense", Consequence))
        ),
    vep = "CSQ")
summary(evalSeparately(missenseFilter, evcf))

In the above chunk of code:

Using ALT data in the fixed slot of VCF objects {#AltField}

As detailed in an earlier section introducing the demonstration data, and more thoroughly in the documentation of the r Biocpkg("VariantAnnotation") package, CollapsedVCF and ExpandedVCF classes differ in the class of data stored in the "ALT" field of their respective fixed slot. As as result, VCF filter rules using data from this field must take into account the VCF class in order to handle the data appropriately:

ExpandedVCF objects {#ALTExpandedVCF}

A key aspect of ExpandedVCF objects is that the "ALT" field of their fixed slot may store only a single allele per record as a DNAStringSet object.

For instance, in an earlier section that demonstrated Filter rules using multiple raw fields, ALT data of the fixed slot in an ExpandedVCF object had to be re-typed from DNAStringSet to character before the %in% function could be applied.

Nevertheless, VCF filter rules may also make use of methods associated with the DNAStringSet class. For instance, genetic insertions may be identified using the fields "REF" and "ALT" fields of the fixed slot:

fixedInsertionFilter <- VcfFixedRules(exprs = list(
    isInsertion = expression(
        Biostrings::width(ALT) > Biostrings::width(REF)
    )
))
evcf_fixedIns <- subsetByFilter(evcf, fixedInsertionFilter)
as.data.frame(fixed(evcf_fixedIns)[,c("REF", "ALT")])

Here, the above VcfFixedRules is synonym to a distinct VcfVepRules using the Ensembl VEP prediction "VARIANT_CLASS":

vepInsertionFilter <- VcfVepRules(exprs = list(
    isInsertion = expression(VARIANT_CLASS == "insertion")
))
evcf_vepIns <- subsetByFilter(evcf, vepInsertionFilter)
as.data.frame(fixed(evcf_vepIns)[,c("REF", "ALT")])

CollapsedVCF objects {#ALTCollapsedVCF}

In contrast to ExpandedVCF, CollapsedVCF may contain more than one allele per record in their "ALT" field (fixed slot), represented by a DNAStringSetList object.

As a result, VCF filter rules using the "ALT" field of the info slot in CollapsedVCF objects may use methods dedicated to DNAStringSetList to handle the data. For instance, multi-allelic variants may be identified by the following VcfFixedRules:

multiallelicFilter <- VcfFixedRules(exprs = list(
    multiallelic = expression(lengths(ALT) > 1)
))
summary(eval(multiallelicFilter, vcf))

Combination of multiple types of VCF filter rules {#VcfFilterRules}

Any number of VcfFixedRules, VcfInfoRules, and VcfVepRules---or even VcfFilterRules themselves---may be combined into a larger object of class VcfFilterRules. Notably, the active state of each filter rule is transferred to the combined object. Even though the VcfFilterRules class acts as a container for multiple types of VCF filter rules, the resulting VcfFilterRules object also extends the FilterRules class, and as a result can be evaluated and used to subset VCF objects identically to any of the specialised more specialised classes.

During the creation of VcfFixedRules objects, each VCF filter rule being combined is marked with a type value, indicating the VCF slot in which the filter rule must be evaluated. This information is stored in the new type slot of VcfFixedRules objects. For instance, it is possible to combine two VcfFixedRules (containing two and one filter rules, respectively), one VcfInfoRules, and one VcfVepRules defined earlier in this vignette:

vignetteRules <- VcfFilterRules(
    fixedRules,
    SNPrule,
    infoRules,
    vepInsertionFilter
)
vignetteRules
active(vignetteRules)
type(vignetteRules)
summary(evalSeparately(vignetteRules, evcf, enclos = .GlobalEnv))

Clearly[^1], the VCF filter rules SNP and isInsertion are mutually exclusive, which explains the final 0 variants left after filtering. Conveniently, either of these rules may be deactivated before evaluating the remaining active filter rules:

[^1]: This statement below would be more evident if the summary method was displaying the result of evalSeparately in this vignette as it does it in an R session.

active(vignetteRules)["SNP"] <- FALSE
summary(evalSeparately(vignetteRules, evcf, enclos = .GlobalEnv))

As a result, the deactivated filter rule ("SNP") now returns TRUE for all variants, leaving a final 2 variants[^2] pass the remaining active filter rules:

[^2]: Again, this statement would benefit from the result of evalSeparately being displayed identically to an R session.

Finally, the following chunk of code demonstrates how VcfFilterRules may also be created from the combination of VcfFilterRules, either with themselves or with any of the classes that define more specific VCF filter rules. Notably, when VcfFilterRules objects are combined, the type and active value of each filter rule is transferred to the combined object:

Combine VcfFilterRules with VcfVepRules

combinedFilters <- VcfFilterRules(
    vignetteRules, # VcfFilterRules
    missenseFilter # VcfVepRules
)
type(vignetteRules)
type(combinedFilters)
active(vignetteRules)
active(missenseFilter)
active(combinedFilters)

Combine multiple VcfFilterRules with VcfFilterRules (and more)

To demonstrate this action, another VcfFilterRules must first be created. This can be achieve by simply re-typing a VcfVepRules defined earlier:

secondVcfFilter <- VcfFilterRules(missenseFilter)
secondVcfFilter

It is now possible to combine the two VcfFilterRules. Let us even combine another VcfInfoRules object in the same step:

manyRules <- VcfFilterRules(
    vignetteRules, # VcfFilterRules
    secondVcfFilter, # VcfFilterRules
    funFilter # VcfInfoRules
)
manyRules
active(manyRules)
type(manyRules)
summary(evalSeparately(manyRules, evcf, enclos = .GlobalEnv))

Critically, users must be careful to combine rules all compatible with the class of VCF object in which it will be evaluated (i.e. CollapsedVCF or ExpandedVCF).

Session info {#SessionInfo}

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References {#References}



Try the TVTB package in your browser

Any scripts or data that you put into this service are public.

TVTB documentation built on Nov. 8, 2020, 6:09 p.m.