windowCounts: Count reads overlapping each window

View source: R/windowCounts.R

windowCountsR Documentation

Count reads overlapping each window

Description

Count the number of extended reads overlapping a sliding window at spaced positions across the genome.

Usage

windowCounts(bam.files, spacing=50, width=spacing, ext=100, shift=0,
	filter=10, bin=FALSE, param=readParam(), BPPARAM=SerialParam())

Arguments

bam.files

A character vector containing paths to sorted and indexed BAM files. Alternatively, a list of BamFile objects.

spacing

An integer scalar specifying the distance between consecutive windows.

width

An integer scalar specifying the width of the window.

ext

An integer scalar or a list of two integer scalars/vectors, containing the average length(s) of the sequenced fragments in each library.

shift

An integer scalar specifying how much the start of each window should be shifted to the left.

filter

An integer scalar for the minimum count sum across libraries for each window.

bin

A logical scalar indicating whether binning should be performed.

param

A readParam object containing read extraction parameters.

BPPARAM

A BiocParallelParam specifying how parallelization is to be performed across files.

Value

A RangedSummarizedExperiment object is returned containing one integer matrix. Each entry of the matrix contains the count for each library (column) at each window (row). The coordinates of each window are stored as the rowRanges. The total number of reads in each library are stored as totals in the colData, along with the read (rlen) and extension lengths (ext) for each library. Other window counting parameters (e.g., spacing, width, param) are stored in the metadata.

Defining the sliding windows

A window is defined as a genomic interval of size equal to width. The value of width can be interpreted as the width of the contact area between the DNA and protein. In practical terms, it determines the spatial resolution of the analysis. Larger windows count reads over a larger region which results in larger counts. This results in greater detection power at the cost of resolution.

By default, the first window on a chromosome starts at base position 1. This can be shifted to the left by specifying an appropriate value for shift. New windows are defined by sliding the current window to the right by the specified spacing. Increasing spacing will reduce the frequency at which counts are extracted from the genome. This results in some loss of resolution but it may be necessary when machine memory is limited.

If bin is set, settings are internally adjusted so that all reads are counted into non-overlapping adjacent bins of size width. Specifically, spacing is set to width and filter is capped at a maximum value of 1 (empty bins can be retained with filter=0). Only the 5' end of each read or the midpoint of each fragment (for paired-end data) is used in counting.

Read extraction and counting

Read extraction from the BAM files is governed by the param argument. This specifies whether reads are to be read in single- or paired-end mode, whether to apply a threshold to the mapping quality, and so on – see ?readParam for details. The strandedness of the output rowRanges is set based on the strand(s) from which the reads are extracted and counted. This is determined by the value of the forward slot in the input param object.

Fragments are inferred from reads by directional extension in single-end data (see below) or by identifying proper pairs in paired-end data (see readParam and getPESizes for more details). The number of fragments overlapping the window for each library is then counted for each window position. Windows will be removed if the count sum across all libraries is below filter. This reduces the memory footprint of the output by not returning empty or near-empty windows, which are usually uninteresting anyway.

Elaborating on directional extension

For single-end reads, directional extension is performed whereby each read is extended from its 3' end to the average fragment length, i.e., ext. This obtains a rough estimate of the interval of the fragment from which the read was derived. It is particularly useful for TF data, where extension specifically increases the coverage of peaks that exhibit strand bimodality. No extension is performed if ext is set to NA, such that the read length is used as the fragment length in that library.

If libraries have different fragment lengths, this can be accommodated by supplying a list of 2 elements to ext. The first element (named init.ext here, for convenience) should be an integer vector specifying the extension length for each library. The second element (final.ext) should be an integer scalar specifying the final fragment length. All reads are directionally extended by init.ext, and the resulting fragment is resized to final.ext by shrinking or expanding from the fragment midpoint. For a bimodal peak, scaling effectively aligns the subpeaks on a given strand across all libraries to a common location. This removes the most obvious differences in widths.

If any element of init.ext is NA, no extension is performed for the corresponding library. If final.ext is set to NA, no rescaling is performed from the library-specific fragment lengths. Values of init.ext are stored as the ext field in the colData of the output object, while final.ext is stored in the metadata.

Comments on ext for paired-end data

Directional extension is not performed for paired-end data, so the values in ext are not used directly. Hwoever, rescaling can still be performed to standardize fragment lengths across libraries by resizing each fragment from its midpoint. This will use the second element of ext as final.ext, if ext is specified as a list of length 2.

On a similar note, some downstream functions will use the extension length in the output colData as the average fragment length. Thus, to maintain compatibility, the ext field in colData is set to the average of the inferred fragment lengths for valid read pairs. These values will not be used in windowCounts, but instead, in functions like getWidths.

Author(s)

Aaron Lun

See Also

correlateReads, readParam, getPESizes

Examples

# A low filter is only used here as the examples have very few reads.
bamFiles <- system.file("exdata", c("rep1.bam", "rep2.bam"), package="csaw")
windowCounts(bamFiles, filter=1)
windowCounts(bamFiles, width=100, filter=1)

# Multiple extension lengths.
windowCounts(bamFiles, ext=list(c(50, 100), NA), filter=1)
windowCounts(bamFiles, ext=list(c(50, 100), 80), filter=1)

# Loading PE data.
bamFile <- system.file("exdata", "pet.bam", package="csaw")
windowCounts(bamFile, param=readParam(pe="both"), filter=1)
windowCounts(bamFile, param=readParam(pe="first"), filter=1)
windowCounts(bamFile, param=readParam(max.frag=100, pe="both"), filter=1)
windowCounts(bamFile, param=readParam(max.frag=100, pe="both", restrict="chrA"), filter=1)

LTLA/csaw documentation built on Dec. 11, 2023, 5:11 a.m.