tallyVariants: Tally the positions in a BAM file

tallyVariantsR Documentation

Tally the positions in a BAM file

Description

Tallies the bases, qualities and read positions for every genomic position in a BAM file. By default, this only returns the positions for which an alternate base has been detected. The typical usage is to pass a BAM file, the genome, the (fixed) readlen and (if the variant calling should consider quality) an appropriate high_base_quality cutoff.

Passing a which argument allows computing on only a subregion of the genome. which is a ‘RangesList’ or something coercible to one that limits the tally to that range or set of ranges. By default, the entire genome is processed.

For parallel evaluation (see BPPARAM): Specifically, which can be a ‘GenomicRanges’ or a ‘GRangesList’. If which is a ‘GenomicRanges’ and has length 1 it is tiled to create chunks for parallel evaluation. If it is longer than 1, each range becomes a chunk for parallel evaluation. If which is a ‘GRangesList’, each element (i.e. each ‘GenomicRanges’) becomes a chunk. The latter can be useful to ensure balanced worker load, e.g. in the case of regions covering multiple sequences(see equisplit).

Usage

## S4 method for signature 'BamFile'
tallyVariants(x, param = TallyVariantsParam(...), ...,
                                  BPPARAM = defaultBPPARAM())
## S4 method for signature 'BamFileList'
tallyVariants(x, ...)
## S4 method for signature 'character'
tallyVariants(x, ...)
TallyVariantsParam(genome,
                   read_pos_breaks = NULL,
                   high_base_quality = 0L,
                   minimum_mapq = 13L,
                   variant_strand = 1L, ignore_query_Ns = TRUE,
                   ignore_duplicates = TRUE,
                   mask = GRanges(), keep_extra_stats = TRUE,
                   read_length = NA_integer_,
                   read_pos = !is.null(read_pos_breaks),
                   high_nm_score = NA_integer_,
                   ...)

Arguments

x

An indexed BAM file, either a path, BamFile or BamFileList object. If the latter, the tallies are computed separately for each file, and the results are stacked with stackSamples into a single VRanges.

param

The parameters for the tallying process, as a BamTallyParam, typically constructed with TallyVariantsParam, see arguments below.

...

For tallyVariants, arguments to pass to TallyVariantsParam, listed below. For TallyVariantsParam, arguments to pass to BamTallyParam.

genome

The genome, either a GmapGenome or something coercible to one.

read_pos_breaks

The breaks used for tabulating the read positions (read positions) at each position. If this information is included (not NULL), qaVariants will use it during filtering.

high_base_quality

The minimum cutoff for whether a base is counted as high quality. By default, callVariants will use the high quality counts in the likelihood ratio test. Note that bam_tally will shift your quality scores by 33 no matter what type they are. If Illumina (pre 1.8) this will result in a range of 31-71. If Sanger/Illumina1.8 this will result in a range of 0-40/41. The default counts all bases as high quality. We typically use 56 for old Illumina, 23 for Sanger/Illumina1.8.

minimum_mapq

Minimum MAPQ of a read for it to be included in the tallies. This depend on the aligner; the default is reasonable for gsnap.

variant_strand

On how many strands must an alternate base be detected for a position to be returned. Highly recommended to set this to at least 1 (otherwise, the result is huge and includes many uninteresting reference rows).

ignore_query_Ns

Whether to ignore N calls in the reads. Usually, there is no reason to set this to FALSE. If it is FALSE, beware of low quality datasets returning enormous results.

ignore_duplicates

whether to ignore reads flagged as PCR/optical duplicates

mask

A GRanges specifyin a mask; all variants falling within the mask are discarded.

read_length

The expected read length, used for calculating the “median distance from nearest” end statistic. If not specified, an attempt is made to guess the read length from a random sample of the BAM file. If read length is found to be variable, statistics depending on the read length are not calculated.

read_pos

Whether to tally read positions, which can be computationally intensive.

high_nm_score

If not NA, counts of reads with NM (mismatch count) score equal to or greater are returned in the count.high.nm and count.high.nm.ref columns.

keep_extra_stats

Whether to keep various summary statistics generated from the tallies; setting this to FALSE will save memory. The extra statistics are most useful for algorithm diagnostics and development.

BPPARAM

A BiocParallelParam object specifying the resources and strategy for parallelizing the tally operation over the chromosomes.

Value

For tallyVariants, the tally GRanges.

For TallyVariantsParam, an object with parameters suitable for variant calling.

Note

The VariantTallyParam constructor is DEPRECATED.

Author(s)

Michael Lawrence, Jeremiah Degenhardt

Examples

if (requireNamespace("gmapR")) {
    tally.param <- TallyVariantsParam(gmapR::TP53Genome(), 
                                      high_base_quality = 23L,
                                      which = gmapR::TP53Which())
    bams <- LungCancerLines::LungCancerBamFiles()
    raw.variants <- tallyVariants(bams$H1993, tally.param)
}

lawremi/VariantTools documentation built on March 4, 2024, 11:54 a.m.