pileup: Use filters and output formats to calculate pile-up...

Description Usage Arguments Details Value Author(s) See Also Examples

Description

pileup uses PileupParam and ScanBamParam objects to calculate pileup statistics for a BAM file. The result is a data.frame with columns summarizing counts of reads overlapping each genomic position, optionally differentiated on nucleotide, strand, and position within read.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
pileup(file, index=file, ..., scanBamParam=ScanBamParam(),
       pileupParam=PileupParam())

## PileupParam constructor
PileupParam(max_depth=250, min_base_quality=13, min_mapq=0,
    min_nucleotide_depth=1, min_minor_allele_depth=0,
    distinguish_strands=TRUE, distinguish_nucleotides=TRUE,
    ignore_query_Ns=TRUE, include_deletions=TRUE, include_insertions=FALSE,
    left_bins=NULL, query_bins=NULL, cycle_bins=NULL)

phred2ASCIIOffset(phred=integer(),
    scheme= c("Illumina 1.8+", "Sanger", "Solexa", "Illumina 1.3+",
              "Illumina 1.5+"))

Arguments

file

character(1) or BamFile; BAM file path.

index

When file is character(1), an optional character(1) of BAM index file path; see scanBam.

...

Additional arguments, perhaps used by methods.

scanBamParam

An instance of ScanBamParam.

pileupParam

An instance of PileupParam.

max_depth

integer(1); maximum number of overlapping alignments considered for each position in the pileup.

min_base_quality

integer(1); minimum ‘QUAL’ value for each nucleotide in an alignment. Use phred2ASCIIOffset to help translate numeric or character values to these offsets.

min_mapq

integer(1); minimum ‘MAPQ’ value for an alignment to be included in pileup.

min_nucleotide_depth

integer(1); minimum count of each nucleotide (independent of other nucleotides) at a given position required for said nucleotide to appear in the result.

min_minor_allele_depth

integer(1); minimum count of all nucleotides other than the major allele at a given position required for a particular nucleotide to appear in the result.

distinguish_strands

logical(1); TRUE if result should differentiate between strands.

distinguish_nucleotides

logical(1); TRUE if result should differentiate between nucleotides.

ignore_query_Ns

logical(1); TRUE if non-determinate nucleotides in alignments should be excluded from the pileup.

include_deletions

logical(1); TRUE to include deletions in pileup.

include_insertions

logical(1); TRUE to include insertions in pileup.

left_bins

numeric; all same sign; unique positions within a read to delimit bins. Position within read is based on counting from the 5' end regardless of strand. Minimum of two values are required so at least one range can be formed. NULL (default) indicates no binning. Use negative values to count from the opposite end. Sorted order not required. Floating-point values are coerced to integer.

If you want to delimit bins based on sequencing cycles to, e.g., discard later cycles, query_bins probably gives the desired behavior.

query_bins

numeric; all same sign; unique positions within a read to delimit bins. Position within a read is based on counting from the 5' end when the read aligns to plus strand, counting from the 3' end when read aligns to minus strand. Minimum of two values are required so at least one range can be formed. NULL (default) indicates no binning. Use negative values to count from the opposite end. Sorted order not required. Floating-point values are coerced to integer.

phred

Either a numeric() (coerced to integer()) PHRED score (e.g., in the range (0, 41) for the ‘Illumina 1.8+’ scheme) or character() of printable ASCII characters. When phred is character(), it can be of length 1 with 1 or more characters, or of any length where all elements have exactly 1 character.

scheme

Encoding scheme, used to translate numeric() PHRED scores to their ASCII code. Ignored when phred is already character().

cycle_bins

DEPRECATED. See left_bins for identical behavior.

Details

NB: Use of pileup assumes familiarity with ScanBamParam, and use of left_bins and query_bins assumes familiarity with cut.

pileup visits each position in the BAM file, subject to constraints implied by which and flag of scanBamParam. For a given position, all reads overlapping the position that are consistent with constraints dictated by flag of scanBamParam and pileupParam are used for counting. pileup also automatically excludes reads with flags that indicate any of:

If no which argument is supplied to the ScanBamParam, behavior reflects that of scanBam: the entire file is visited and counted.

Use a yieldSize value when creating a BamFile instance to manage memory consumption when using pileup with large BAM files. There are some differences in how pileup behavior is affected when the yieldSize value is set on the BAM file. See more details below.

Many of the parameters of the pileupParam interact. A simple illustration is ignore_query_Ns and distinguish_nucleotides, as mentioned in the ignore_query_Ns section.

Parameters for pileupParam belong to two categories: parameters that affect only the filtering criteria (so-called ‘behavior-only’ policies), and parameters that affect filtering behavior and the schema of the results (‘behavior+structure’ policies).

distinguish_nucleotides and distinguish_strands when set to TRUE each add a column (nucleotide and strand, respectively) to the resulting data.frame. If both are TRUE, each combination of nucleotide and strand at a given position is counted separately. Setting only one to TRUE behaves as expected; for example, if only nucleotide is set to TRUE, each nucleotide at a given position is counted separately, but the distinction of on which strand the nucleotide appears is ignored.

ignore_query_Ns determines how ambiguous nucloetides are treated. By default, ambiguous nucleotides (typically ‘N’ in BAM files) in alignments are ignored. If ignore_query_Ns is FALSE, ambiguous nucleotides are included in counts; further, if ignore_query_Ns is FALSE and distinguish_nucleotides is TRUE the ‘N’ nucleotide value appears in the nucleotide column when a base at a given position is ambiguous.

By default, deletions with respect to the reference genome to which the reads were aligned are included in the counts in a pileup. If include_deletions is TRUE and distinguish_nucleotides is TRUE, the nucleotide column uses a ‘-’ character to indicate a deletion at a position.

The ‘=’ nucleotide code in the SEQ field (to mean ‘identical to reference genome’) is supported by pileup, such that a match with the reference genome is counted separately in the results if distinguish_nucleotides is TRUE.

CIGAR support: pileup handles the extended CIGAR format by only heeding operations that contribute to counts (‘M’, ‘D’, ‘I’). If you are confused about how the different CIGAR operations interact, see the SAM versions of the BAM files used for pileup unit testing for a fairly comprehensive human-readable treatment.

The “bin” arguments query_bins and left_bins allow users to differentiate pileup counts based on arbitrary non-overlapping regions within a read. pileup relies on cut to derive bins, but limits input to numeric values of the same sign (coerced to integers), including +/-Inf. If a “bin” argument is set pileup automatically excludes bases outside the implicit outer range. Here are some important points regarding bin arguments:

pileup obeys yieldSize on BamFile objects with some differences from how scanBam responds to yieldSize. Here are some points to keep in mind when using pileup in conjunction with yieldSize:

Value

For pileup a data.frame with 1 row per unique combination of differentiating column values that satisfied filter criteria, with frequency (count) of unique combination. Columns seqnames, pos, and count always appear; optional strand, nulceotide, and left_bin / query_bin columns appear as dictated by arguments to PileupParam.

Column details:

See scanBam for more detailed explanation of SAM fields.

If a which argument is specified for the scanBamParam, a which_label column (factor in the form ‘rname:start-end’) is automatically included. The which_label column is used to maintain grouping of results, such that two queries of the same genomic region can be differentiated.

Order of rows in data.frame is first by order of seqnames column based on the BAM index file, then non-decreasing order on columns pos, then nucleotide, then strand, then left_bin / query_bin.

PileupParam returns an instance of PileupParam class.

phred2ASCIIOffset returns a named integer vector of the same length or number of characters in phred. The names are the ASCII encoding, and the values are the offsets to be used with min_base_quality.

Author(s)

Nathaniel Hayden nhayden@fredhutch.org

See Also

For the relationship between PHRED scores and ASCII encoding, see https://en.wikipedia.org/wiki/FASTQ_format#Encoding.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
## The examples below apply equally to pileup queries for specific
## genomic ranges (specified by the 'which' parameter of 'ScanBamParam')
## and queries across entire files; the entire-file examples are
## included first to make them easy to find. The more detailed examples
## of pileup use queries with a 'which' argument.

library("RNAseqData.HNRNPC.bam.chr14")

fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
## Minimum base quality to be tallied
p_param <- PileupParam(min_base_quality = 10L)


## no 'which' argument to ScanBamParam: process entire file at once
res <- pileup(fl, pileupParam = p_param)
head(res)
table(res$strand, res$nucleotide)



## no 'which' argument to ScanBamParam with 'yieldSize' set on BamFile
bf <- open(BamFile(fl, yieldSize=5e4))
repeat {
    res <- pileup(bf, pileupParam = p_param)
    message(nrow(res), " rows in result data.frame")
    if(nrow(res) == 0L)
        break
}
close(bf)


## pileup for the first half of chr14 with default Pileup parameters
## 'which' argument: process only specified genomic range(s)
sbp <- ScanBamParam(which=GRanges("chr14", IRanges(1, 53674770)))
res <- pileup(fl, scanBamParam=sbp, pileupParam = p_param)
head(res)
table(res$strand, res$nucleotide)

## specify pileup parameters: include ambiguious nucleotides
## (the 'N' level in the nucleotide column of the data.frame)
p_param <- PileupParam(ignore_query_Ns=FALSE, min_base_quality = 10L)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
table(res$strand, res$nucleotide)

## Don't distinguish strand, require a minimum frequency of 7 for a
## nucleotide at a genomic position to be included in results.

p_param <- PileupParam(distinguish_strands=TRUE,
                       min_base_quality = 10L,
                       min_nucleotide_depth=7)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
table(res$nucleotide, res$strand)

## Any combination of the filtering criteria is possible: let's say we
## want a "coverage pileup" that only counts reads with mapping
## quality >= 13, and bases with quality >= 10 that appear at least 4
## times at each genomic position
p_param <- PileupParam(distinguish_nucleotides=FALSE,
                       distinguish_strands=FALSE,
                       min_mapq=13,
                       min_base_quality=10,
                       min_nucleotide_depth=4)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
res <- res[, c("pos", "count")] ## drop seqnames and which_label cols
plot(count ~ pos, res, pch=".")

## ASCII offsets to help specify min_base_quality, e.g., quality of at
## least 10 on the Illumina 1.3+ scale
phred2ASCIIOffset(10, "Illumina 1.3+")

## Well-supported polymorphic positions (minor allele present in at
## least 5 alignments) with high map quality
p_param <- PileupParam(min_minor_allele_depth=5,
                       min_mapq=40,
                       min_base_quality = 10,
                       distinguish_strand=FALSE)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
dim(res) ## reduced to our biologically interesting positions
head(xtabs(count ~ pos + nucleotide, res))

## query_bins



## basic use of positive bins: single pivot; count bases that appear in
## first 10 cycles of a read separately from the rest
p_param <- PileupParam(query_bins=c(0, 10, Inf), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)

## basic use of positive bins: simple range; only include bases in
## cycle positions 6-10 within read
p_param <- PileupParam(query_bins=c(5, 10), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)

## basic use of negative bins: single pivot; count bases that appear in
## last 3 cycle positions of a read separately from the rest.
p_param <- PileupParam(query_bins=c(-Inf, -4, -1), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)

## basic use of negative bins: drop nucleotides in last two cycle
## positions of each read
p_param <- PileupParam(query_bins=c(-Inf, -3), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)


## typical use: beginning, middle, and end segments; because of the
## nature of sequencing technology, it is common for bases in the
## beginning and end segments of each read to be less reliable. pileup
## makes it easy to count each segment separately.

## Assume determined ahead of time that for the data 1-7 makes sense for
## beginning, 8-12 middle, >=13 end (actual cycle positions should be
## tailored to data in actual BAM files).
p_param <- PileupParam(query_bins=c(0, 7, 12, Inf), ## note non-symmetric
                       min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt)
t(t(xt) / colSums(xt)) ## cheap normalization for illustrative purposes

## 'implicit outer range': in contrast to c(0, 7, 12, Inf), suppose we
##  still want to have beginning, middle, and end segements, but know
##  that the first three cycles and any bases beyond the 16th cycle are
##  irrelevant. Hence, the implicit outer range is (3,16]; all bases
##  outside of that are dropped.
p_param <- PileupParam(query_bins=c(3, 7, 12, 16), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt)
t(t(xt) / colSums(xt))


## single-width bins: count each cycle within a read separately.
p_param <- PileupParam(query_bins=seq(1,20), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt[ , c(1:3, 18:19)]) ## fit on one screen
print(t(t(xt) / colSums(xt))[ , c(1:3, 18:19)])

Rsamtools documentation built on Nov. 8, 2020, 8:11 p.m.