knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

\newpage

What is Ideafix?

Ideafix is a package for variant refinement designed specifically for FFPE DNA sequencing data.

DNA obtained from FFPE samples suffers from significant levels of fragmentation, denaturation, cross-linking and chemical modifications, all of which can contribute to sequence artefacts. One of the most prevalent artefacts is deamination of cytosine residues to uracil, which, as a consequence of successive PCR amplification rounds, results in the C$>$T (or G$>$A anti-sense strand) variant. So far, there is no gold-standard technique to distinguish real variants from FFPE artifacts.

Ideafix is a decision tree-based variant refinement tool that identifies cytosine deaminations in lists of variants (vcf files) found in DNA sequencing data from FFPE specimens. It takes as an input a vcf file resulting from variant calling on an bam file from FFPE data, looks for C:G $>$ T:A variants with a VAF $<$ 30\% and identifies which of them are formalin-induced deaminations or artefacts. Ideafix outputs a table indicating the class label and the score for each variant, and allows the user to finally append the annotations to the original vcf file or simply output the table to a separate text file.

Ideafix is a tool based on fully supervised learning and in fact uses two different models to do the variant classification: XGBoost or Random Forest (RF). Both models show outstanding performance (AUC $>$ 0.95) so the user can decide which one to use indistinctively.

As a fully supervised model, Ideafix is trained over a series of 23 variant descriptors related to allele-frequency, sequencing quality, fragment length and read-orientation bias, among others. One of the main contributing feature to the model is FDeamC, which is a numeric descriptor that reflects read-pair orientation bias. The following figure summarizes what read-pair orientation bias is, why it arises and how it can be relevant for identifying cytosine deaminations:

Origin and display of read pair orientation bias in sequencing reads along time{width=100%}

Ideafix is described in detail in an article that is currently under review and we will provide the link to it as long as it gets published.

Installation

Ideafix can be installed from Github as follows:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("mmaitenat/ideafix", build_vignettes = TRUE)

Beware that devtools::install_github() does not build vignettes by default so we need to force it by setting build_vignettes to TRUE.

Note that installation may take a while as the size of the package is in the order of hundreds of Mbs.

Requirements

Ideafix needs the following programs to run:

It also depends on the following packages:

It is important to note that the h2o version must necessarily be that one. To install this specific version, we recommend that you install the version from source as follows:

if (!require("devtools")) install.packages("devtools")
devtools::install_version("h2o", version = "3.22.1.1", repos = "http://cran.us.r-project.org")
h2o_pck_url <- "https://cran.r-project.org/src/contrib/Archive/h2o/h2o_3.22.1.1.tar.gz"
install.packages(h2o_pck_url, repos = NULL, type = "source")

Ideafix also needs the following files:

Finally, Ideafix is only available for Linux and Mac systems.

Workflow

Ideafix classifies the C:G $>$ T:A variants with a VAF $<$ 30\% present in a vcf file into deaminations or non-deaminations following these steps:

  1. Variant calling using Mutect2
  2. Extraction of variant descriptors
  3. Classification of variants
  4. Analysis of the results (optional)
  5. Annotation of class label into original vcf file or to separate file (optional)

Ideafix allows to perform steps 1-4, and so the user needs to proceed previously with step 0. Below are explained these steps in more detail.

Step 0: Variant calling using Mutect2

Ideafix requires that variants in the FFPE sample to be analyzed have been called using Mutect2 caller with strand bias annotation enabled. Mutect2 can be run either in tumor-only or tumor-normal paired mode. For tumor-only mode, this would be as follows:

```{bash eval=FALSE} gatk4="java -jar /opt/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar" REF=~/Data/hg19/ucsc.hg19.fasta bamfile=mysample.bam sample_id=mysample vcf_filename=mysample.vcf

$gatk4 Mutect2 \ -R $REF \ -I $bamfile \ -tumor $sample_id \ -O mysample0.vcf \ --annotation StrandBiasBySample

$gatk4 FilterMutectCalls \ -V mysample0.vcf \ -O $vcf_filename

The code is almost the same in case we are doing a paired variant calling:

```{bash eval=FALSE}
tumor_bamfile=mysample_tumor.bam
tumor_sample_id=mysample_tumor
normal_bamfile=mysample_normal.bam
normal_sample_id=mysample_normal
vcf_filename=mysample.vcf

$gatk4 Mutect2 \
    -R $REF \
    -I $tumor_bamfile \
    -tumor $tumor_sample_id \
    -I $normal_bamfile \
    -normal $normal_sample_id \
    -O mysample0.vcf \
    --annotation StrandBiasBySample

$gatk4 FilterMutectCalls \
     -V mysample0.vcf \
     -O $vcf_filename         

The Mutect2 command runs the variant calling; the FilterMutectCalls applies filter-marks to the raw output of Mutect2. Note that FilterMutectCalls does not remove any variant from the vcf file; it simply adds a filter-mark to each of them: if it considers they are putative variants, it will be PASS, and if not PASS, the reason why it considers it is an artefact.

Ideafix only works on the PASS-filtering variants, as the rest are already marked as artifacts by the variant calling algorithm.

Step 1: Extraction of variant descriptors

Once the vcf file is ready, we can start off using Ideafix. Let us load the library:

library(ideafix)

The first step in the variant refinement is to extract the descriptors of the variants found in the vcf file. Ideafix extracts these descriptors from two files: the vcf file itself, and the fasta file to which data was aligned to.

For this tutorial, we have created two toy files: a vcf file (generated by selecting the variants in chromosome 15 from the vcf file generated as in Step 0) and the fasta file. As they are too heavy to be included in the package, we have made them available to you through this link.

Once you download them, you first need to extract the fasta file, which is compressed in tar.gz format. For doing so, you just need to type the following in the terminal, from the SampleData directory:

```{bash, eval=FALSE} tar -xzvf chr15.fasta.tar.gz

Or, if you prefer to use R:

```r
sampleData_dir <- "~/Downloads/SampleData" # change here to the correct SampleData directory
untar_cmd <- sprintf("tar -xzvf %s", file.path(sampleData_dir, "chr15.fasta.tar.gz"))
system(untar_cmd)

Once data is extracted, you must provide the full path to both files in the variables vcf_filename and ref_genome, as in this example:

vcf_filename <- "~/Downloads/SampleData/mysample_chr15.vcf"
ref_genome <- "~/Downloads/SampleData/chr15.fasta"

The final step is to obtain the descriptors as follows:

descriptors <- get_descriptors(vcf_filename = vcf_filename, fasta_filename = ref_genome)
descriptors <- readRDS(system.file("extdata", "descriptors.RDS", package = "ideafix"))

In this example, the resulting object is a data frame consisting of 28 columns or features for each variant in the vcf file. You can have a quick look on the data with the following command:

dplyr::glimpse(descriptors)

One can note that the descriptors variable contains 2,508 variants instead of the original 21,699 in mysample_chr15.vcf. This is because only these 2,508 were GATK PASS-filtering C:G > T:A variants with a VAF < 30%.

Step 2: Classification of variants

Now we are ready to classify the variants into deaminations or non-deaminations. As said, there are two ways to do this: using the XGBoost or the Random Forest model.

If we use the RF model:

predictions_RF <- classify_variants(variant_descriptors = descriptors, algorithm = "RF")

Equivalently, with XGBoost:

predictions_XGBoost <- classify_variants(variant_descriptors = descriptors, algorithm = "XGBoost")

In both cases, if everything went well, you will see the message "Classification went successfully".

The resulting object is again a data frame with 6 columns (chromosome, variant position in chromosome, reference allele, alternate allele, deamination score and class label). Of interest are the last two:

The DEAM_SCORE column contains a numeric number that reflects the certainty about the classification as deamination, 1 being the highest certainty. The DEAMINATION column is a categorical classification based on the score. Note that the threshold used for this classification follows the particular criterion of minimizing the number of false positives while keeping a minimum sensitivity of 0.99.

Each variant has a value for each of those six columns.

predictions_XGBoost
predictions_RF

The classification is finished. You can now optionally analyze the obtained labels (step 3) or simply write the results to a external file (step 4).

Step 3: Analysis of the results (optional)

We can now do some analyses on the classification output.

Let us first compare the classification produced by the two models:

table(factor(predictions_RF$DEAMINATION, levels = c("deamination", "non-deamination")), 
      factor(predictions_XGBoost$DEAMINATION, levels = c("deamination", "non-deamination")))

As can be seen, most of the variants have been classified as deaminations and the two models are in high agreement on their classification.

More interesting than directly relying the class labels (which are dependant on the previously mentioned criterion for the threshold) is working with the rankings.

Subsetting the set of non-deaminations

Let us suppose, for instance, that we are doing a molecular analysis on a sample and we need to analyze the relevance of the mutations on the sample. Moreover, the disease the sample comes from is known to be enriched in C:G $>$ T:A mutations and so we cannot directly discard all the mutations of this type. In this scenario, we want to further validate the results in the lab, but unfortunately the resources are limited and we can only validate, say, 10 variants. In such case, the best approach would be to select the 10 variants with lowest DEAM_SCORE and, thus, with the highest certainty of not being a deamination.

As there are two models, an option that combines the results of the two of them is to select those 10 variants based on the consensus of the models on the variant ranks:

library(dplyr)
predictions_XGBoost %>%
  arrange(DEAM_SCORE) %>%
  head(n = 10)

predictions_RF %>%
  arrange(DEAM_SCORE) %>%
  head(n = 10)
id <- order(rank(predictions_XGBoost$DEAM_SCORE) + rank(predictions_RF$DEAM_SCORE), decreasing = FALSE)
predictions_XGBoost[id[1:10], ] # same as predictions_RF[id[1:10], ]

Enrichment analysis

Enrichment Analysis seeks for pathways enriched in ranked gene lists. An enrichment analysis on those variants ordered in an increasing order by DEAM_SCORE could provide an overview of the dysfunctional pathways or mechanisms in that sample.

Using alternative classification thresholds (recommended)

The DEAMINATION column in the predictions object contains the class labels obtained after applying to the scores (DEAM_SCORE) the threshold that maximizes $F_1$ score in the dataset used to train the models. However, no classification threshold suits all users' needs and this is why we recommend that each one uses a threshold depending on its own needs.

The following lines show one possible alternative to select a custom threshold. First of all, we will observe the distribution of the scores:

library(ggplot2)
library(dplyr)
all_predictions <- bind_rows(list(RF = predictions_RF, XGBoost = predictions_XGBoost), .id = 'model')
all_predictions %>%
  group_by(model) %>%
  arrange(desc(DEAM_SCORE)) %>%
  mutate(idx = row_number()) %>%
  ungroup() %>%
  ggplot(aes(x = idx, y = DEAM_SCORE, colour = model)) +
  geom_line() +
  theme_minimal()
all_predictions %>%
  group_by(model) %>%
  arrange(desc(DEAM_SCORE)) %>%
  mutate(idx = row_number()) %>%
  ungroup() %>%
  ggplot(aes(x = DEAM_SCORE, colour = model)) +
  geom_histogram(binwidth = 0.01, position = "dodge") +
  theme_minimal()

As DEAM_SCORE is proportional to the degree of certainty on a variant being a deamination, this plot shows us that there is a large number of variants that are highly likely deaminations and that the two models make very similar predictions. An interesting alternative threshold would be that at which the score starts to drop rapidly, but the final election should be influenced by the steps to be taken with the variants hereafter. For instance, if the idea is to continue with a biological enrichment analysis, we can be quite permissive with the threshold as the analysis will filter additional variants. In such case, taking a close look at the curves, the threshold could be around 0.97.

thr_str <- 0.97
predictions_XGBoost_str <- predictions_XGBoost %>%
  mutate(DEAMINATION = ifelse(DEAM_SCORE >= thr_str, "deamination", "non-deamination"))
predictions_RF_str <- predictions_RF %>%
  mutate(DEAMINATION = ifelse(DEAM_SCORE >= thr_str, "deamination", "non-deamination"))

table(factor(predictions_RF_str$DEAMINATION, levels = c("deamination", "non-deamination")), 
      factor(predictions_XGBoost_str$DEAMINATION, levels = c("deamination", "non-deamination")))

If, instead, there are no additional filtering steps, it is a good idea to opt for a more conservative threshold, such as 0.75.

thr_sft <- 0.75
predictions_XGBoost_sft <- predictions_XGBoost %>%
  mutate(DEAMINATION = ifelse(DEAM_SCORE >= thr_sft, "deamination", "non-deamination"))
predictions_RF_sft <- predictions_RF %>%
  mutate(DEAMINATION = ifelse(DEAM_SCORE >= thr_sft, "deamination", "non-deamination"))

table(factor(predictions_RF_sft$DEAMINATION, levels = c("deamination", "non-deamination")), 
      factor(predictions_XGBoost_sft$DEAMINATION, levels = c("deamination", "non-deamination")))

We can compare the contingency tables between these labels and the ones originally provided (in which a thresholds between 0.95 and 0.96 were used):

table(factor(predictions_RF$DEAMINATION, levels = c("deamination", "non-deamination")), 
      factor(predictions_XGBoost$DEAMINATION, levels = c("deamination", "non-deamination")))

As expected, as we raise the threshold, less deaminations are called in favour of non-deaminations.

Step 4: Annotation of class label into original vcf file or to separate file (optional)

There are two options to output the class scores and labels. The most straightforward is to append them to the original vcf file. vcf format is the most common variant format and it is supported by almost all tools that work with this kind of data, so this is probably the easiest way of working with ideafix labels from here on. This can be done as follows:

outname <- "mysample_classified"
outfolder <- "."
annotate_deaminations(classification = predictions_XGBoost, format = "vcf", vcf_filename = vcf_filename, outfolder = outfolder, outname = outname)

The resulting file will have two new elements (related to the deamination score and deamination label) in the INFO field of each PASS-filtering C:G > T:A variant, as in the following example:

cmd <- sprintf("grep PASS %s.vcf | grep DEAM | head -1", file.path(outfolder, outname))
cat(system(cmd, intern = TRUE))
cmd <- sprintf("grep PASS %s | grep DEAM | head -1", system.file("extdata", "mysample_classified.vcf", package = "ideafix"))
cat(system(cmd, intern = TRUE))

The two elements are described in the header of the resulting vcf file:

cmd <- sprintf("bcftools view -h %s.vcf | grep DEAM", file.path(outfolder, outname))
system(cmd, intern = TRUE)
cmd <- sprintf("bcftools view -h %s | grep DEAM", system.file("extdata", "mysample_classified.vcf", package = "ideafix"))
system(cmd, intern = TRUE)

Instead, we can simply write the predictions_XGBoost(or the predictions_RF) object as it is to a different text file:

annotate_deaminations(classification = predictions_XGBoost, format = "tsv", outname = "mysample_classified")

If we do not specify any file name in outname, the filename will default to ideafix_labels.txt

annotate_deaminations(classification = predictions_XGBoost, format = "tsv")

Session Info

sessionInfo()


mmaitenat/ideafix documentation built on Sept. 18, 2021, 7:55 a.m.