pileLettersAt: Pile the letters of a set of aligned reads on top of a set of...

View source: R/pileLettersAt.R

pileLettersAtR Documentation

Pile the letters of a set of aligned reads on top of a set of genomic positions

Description

pileLettersAt extracts the letters/nucleotides of a set of reads that align to a set of genomic positions of interest. The extracted letters are returned as "piles of letters" (one per genomic position of interest) stored in an XStringSet (typically DNAStringSet) object.

Usage

pileLettersAt(x, seqnames, pos, cigar, at)

Arguments

x

An XStringSet (typically DNAStringSet) object containing N unaligned read sequences (a.k.a. the query sequences) reported with respect to the + strand.

seqnames

A factor-Rle parallel to x. For each i, seqnames[i] must be the name of the reference sequence of the i-th alignment.

pos

An integer vector parallel to x. For each i, pos[i] must be the 1-based position on the reference sequence of the first aligned letter in x[[i]].

cigar

A character vector parallel to x. Contains the extended CIGAR strings of the alignments.

at

A GPos object containing the genomic positions of interest. seqlevels(at) must be identical to levels(seqnames).

If at is not a GPos object, pileLettersAt will first try to turn it into one by calling the GPos() constructor function on it. So for example at can be a GRanges object (or any other GenomicRanges derivative), and, in that case, each range in it will be interpreted as a run of adjacent genomic positions. See ?GPos in the GenomicRanges package for more information.

Details

x, seqnames, pos, cigar must be 4 parallel vectors describing N aligned reads.

Value

An XStringSet (typically DNAStringSet) object parallel to at (i.e. with 1 string per genomic position).

Author(s)

Hervé Pagès

See Also

  • The pileup and applyPileups functions defined in the Rsamtools package, as well as the SAMtools mpileup command (available at http://samtools.sourceforge.net/ as part of the SAMtools project), for more powerful flexible alternatives.

  • The stackStringsFromGAlignments function for stacking the read sequences (or their quality strings) stored in a GAlignments object or a BAM file.

  • DNAStringSet objects in the Biostrings package.

  • GPos objects in the GenomicRanges package.

  • GAlignments objects.

  • cigar-utils for the CIGAR utility functions used internally by pileLettersAt.

Examples

## Input

##   - A BAM file:
bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
seqinfo(bamfile)  # to see the seqlevels and seqlengths
stackStringsFromBam(bamfile, param="seq1:1-21")  # a quick look at
                                                 # the reads

##   - A GPos object containing Genomic Positions Of Interest:
my_GPOI <- GPos(c("seq1:1-5", "seq1:21-21", "seq1:1575-1575",
                  "seq2:1513-1514"))

## Some preliminary massage on 'my_GPOI'

seqinfo(my_GPOI) <- merge(seqinfo(my_GPOI), seqinfo(bamfile))
seqlevels(my_GPOI) <- seqlevelsInUse(my_GPOI)

## Load the BAM file in a GAlignments object. Note that we load only
## the reads aligned to the sequences in 'seqlevels(my_GPOI)'. Also,
## in order to be consistent with applyPileups() and SAMtools (m)pileup,
## we filter out the following BAM records:
##   - secondary alignments (flag bit 0x100);
##   - reads not passing quality controls (flag bit 0x200);
##   - PCR or optical duplicates (flag bit 0x400).
## See ?ScanBamParam and the SAM Spec for more information. 

which <- as(seqinfo(my_GPOI), "GRanges")
flag <- scanBamFlag(isSecondaryAlignment=FALSE,
                    isNotPassingQualityControls=FALSE,
                    isDuplicate=FALSE)
what <- c("seq", "qual")
param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which)
gal <- readGAlignments(bamfile, param=param)
seqlevels(gal) <- seqlevels(my_GPOI) 

## Extract the read sequences (a.k.a. query sequences) and quality
## strings. Both are reported with respect to the + strand.

qseq <- mcols(gal)$seq
qual <- mcols(gal)$qual

nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
                            my_GPOI)
qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal),
                            my_GPOI)
mcols(my_GPOI)$nucl_piles <- nucl_piles
mcols(my_GPOI)$qual_piles <- qual_piles
my_GPOI 

## Finally, to summarize A/C/G/T frequencies at each position:
alphabetFrequency(nucl_piles, baseOnly=TRUE)

## Note that the pileup() function defined in the Rsamtools package
## can be used to obtain a similar result:
scanbam_param <- ScanBamParam(flag=flag, which=my_GPOI)
pileup_param <- PileupParam(max_depth=5000,
                            min_base_quality=0,
                            distinguish_strands=FALSE)
pileup(bamfile, scanBamParam=scanbam_param, pileupParam=pileup_param)

Bioconductor/GenomicAlignments documentation built on March 28, 2024, 9:59 a.m.