test_ctDNA: Tests the ctDNA positivity of a sample

Description Usage Arguments Details Value See Also Examples

View source: R/test_ctDNA.R

Description

Given a set of reporter mutation, this functions counts the reads matching the reporter mutations in the sample to be tested, estimates the mismatch rate for the sample to be tested, and then runs a Monte Carlo simulation test to determine whether the tested sample is positive or negative.

Usage

1
2
3
4
5
6
7
8
test_ctDNA(mutations, bam, targets, reference, tag = "",
  ID_column = NULL, black_list = NULL, substitution_specific = TRUE,
  vaf_threshold = 0.1, min_base_quality = 30, max_depth = 1e+05,
  min_mapq = 40, bam_list = NULL, bam_list_tags = rep("",
  length(bam_list)), min_alt_reads = 1,
  min_samples = ceiling(length(bam_list)/10), n_simulations = 10000,
  pvalue_threshold = 0.05, seed = 123,
  informative_reads_threshold = 10000)

Arguments

mutations

A data frame with the reporter mutations. Should have the columns CHROM, POS, REF, ALT.

bam

path to bam file

targets

a data frame with the target regions. Must have three columns: chr, start and end

reference

the reference genome in BSgenome format

tag

the RG tag if the bam has more than one sample

ID_column

The name of the column that contains the ID of mutations in phase. All mutations in Phase should have the same ID in that column.

black_list

a character vector of genomic loci to filter. The format is chr_pos if substitution_specific is false or chr_pos_ref_alt if substitution_specific is true. If given, will override bam_list.

substitution_specific

logical, whether to have the loci of black_list by substitutions.

vaf_threshold

When calculating the background rate, the bases with higher than this VAF threshold will be ignored (real mutations/SNPs).

min_base_quality

minimum base quality for a read to be counted

max_depth

maximum depth above which sampling will happen

min_mapq

the minimum mapping quality for a read to be counted

bam_list

A vector containing the paths to bam files used to filter mutations. Mutations that have more than min_alt_reads in more than min_samples will be filtered. Using black_list is more recommended.

bam_list_tags

the RG tags for the bams included in bams list.

min_alt_reads

When bam_list is provided, this sets the minimum number of alternative allele reads for a sample to be counted.

min_samples

When number of samples having more than min_alt_reads exceeds this number, the mutation will be filtered.

n_simulations

the number of Monte Carlo simulations.

pvalue_threshold

the p-value threshold used to decide positivity or negativity.

seed

the random seed to make the Monte Carlo simulations reproducible.

informative_reads_threshold

the number of informative reads (unique reads mapping to specified mutations) under which the test will be undetermined.

Details

This is the main function to test minimal residual disease by ctDNA positivity in a follow-up sample (e.g. after treatment). The inputs include a bam file for the follow-up sample to be tested, a list of reporter mutations (detected for example before treatment in a ctDNA positive sample), and an optional black_list (recommended to use) computed from a list of bam files of healthy-like samples or bam_list of the bam_files to use instead of black_list.

The workflow includes the following steps:

1.

Filtering mutations (optional but recommended): The mutations in the input will be filtered removing the ones reported in the black list. If bam_list is provided, the mutations will be filtered according to the min_samples and min_alt_reads parameters. See filter_mutations.

2.

The background rate will be computed for the input bam. The black list will be plugged in to exclude the black-listed loci when computing the background rate. The black list can be substitution_specific or not, but in both cases, the background rate will be adjusted accordingly. See get_background_rate.

3.

Counting reference and alternative alleles of the reporter mutations in the bam file.

4.

Merging mutations in phase (optional but recommended): If the ID_column is specified in the mutations input, mutations with the same ID (in phase) will be merged. While doing so, it is expected that real traces of mutations will be exhibited jointly in the reads spanning phased mutations, whereas artifacts will not show in all the covered mutations in phase. Therefore, the mismatches that map only to a subset of the mutations in phase but not the others (which are covered and show reference allele) will be removed. This will lead to reduction of the observed mismatches, and therefore, the computed background rate will be adjusted accordingly: new rate = old rate * (1 - purification_probability). See merge_mutations_in_phase.

5.

Monte Carlo test: this approach was used by Newman et al., Nature Biotechnology 2016.

  • Given N reporter mutations each with depth Di, randomly sample variant allele reads Xi under the background rate p using binomial distribution Xi ~ Binom(Di, p).

  • Repeat n_simuations times.

  • Count the number of simulations, where simulated data equal or exceed observed data in jointly two measurements: (1) the average VAF for the N mutations, and (2) the number of mutations with non-zero VAF.

  • Compute an empirical p-value as (#successes + 1)/(#simulations + 1)

6.

Make a decision: If number of informative reads is lower than informative_reads_threshold parameter, the decision will be undetermined. Otherwise, the pvalue_threshold parameters will be used to determine positivity or negativity.

Value

a data frame with the following columns:

See Also

get_background_rate merge_mutations_in_phase create_black_list create_background_panel filter_mutations

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
## Load example data
data("mutations", package = "ctDNAtools")
data("targets", package = "ctDNAtools")
bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools")
bamT2 <- system.file("extdata", "T2.bam", package = "ctDNAtools")
bamN1 <- system.file("extdata", "N1.bam", package = "ctDNAtools")
bamN2 <- system.file("extdata", "N2.bam", package = "ctDNAtools")
bamN3 <- system.file("extdata", "N3.bam", package = "ctDNAtools")

## Use human reference genome from BSgenome.Hsapiens.UCSC.hg19 library
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))

## basic usage
test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100
)

## More options
test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, min_base_quality = 20, min_mapq = 30,
  vaf_threshold = 0.05
)

## Use phasing information
test_ctDNA(
  mutations = mutations, bam = bamT2,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING"
)

## Use a black list based on loci
bg_panel <- create_background_panel(
  bam_list = c(bamN1, bamN2, bamN3),
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  substitution_specific = FALSE
)

bl1 <- create_black_list(bg_panel,
  mean_vaf_quantile = 0.99,
  min_samples_one_read = 2, min_samples_two_reads = 1
)

test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING", black_list = bl1,
  substitution_specific = FALSE
)

## Use a substitution-specific black list
bg_panel <- create_background_panel(
  bam_list = c(bamN1, bamN2, bamN3),
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  substitution_specific = TRUE
)

bl2 <- create_black_list(bg_panel,
  mean_vaf_quantile = 0.99,
  min_samples_one_read = 2, min_samples_two_reads = 1
)

test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING", black_list = bl2,
  substitution_specific = TRUE
)

ctDNAtools documentation built on March 26, 2020, 7:39 p.m.