scanBam: Import, count, index, filter, sort, and merge 'BAM' (binary...
In Bioconductor/Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import

BamInput

R Documentation

Import, count, index, filter, sort, and merge ‘BAM’ (binary alignment) files.

Description

Import binary ‘BAM’ files into a list structure, with facilities for selecting what fields and which records are imported, and other operations to manipulate BAM files.

Usage


scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat()))

countBam(file, index=file, ..., param=ScanBamParam())

idxstatsBam(file, index=file, ...)

scanBamHeader(files, ...)
## S4 method for signature 'character'
scanBamHeader(files, ...)

asBam(file, destination=sub("\\.sam(\\.gz)?", "", file), ...)
## S4 method for signature 'character'
asBam(file, destination=sub("\\.sam(\\.gz)?", "", file),
    ..., overwrite=FALSE, indexDestination=TRUE)

asSam(file, destination=sub("\\.bam", "", file), ...)
## S4 method for signature 'character'
asSam(file, destination=sub("\\.bam", "", file),
    ..., overwrite=FALSE)

filterBam(file, destination, index=file, ...)
## S4 method for signature 'character'
filterBam(file, destination, index=file, ...,
    filter=FilterRules(), indexDestination=TRUE,
    param=ScanBamParam(what=scanBamWhat()))
    
sortBam(file, destination, ...)
## S4 method for signature 'character'
sortBam(file, destination, ..., byQname=FALSE,
    maxMemory=512, byTag=NULL, nThreads=1L)

indexBam(files, ...)
## S4 method for signature 'character'
indexBam(files, ...)

mergeBam(files, destination, ...)
## S4 method for signature 'character'
mergeBam(files, destination, ..., region = GRanges(),
    overwrite = FALSE, header = character(), byQname = FALSE,
    addRG = FALSE, compressLevel1 = FALSE, indexDestination = FALSE)

Arguments

`file`	The character(1) file name of the ‘BAM’ ('SAM' for `asBam`) file to be processed.
`files`	The character() file names of the ‘BAM’ file to be processed. For `mergeBam`, must satisfy `length(files) >= 2`.
`index`	The character(1) name of the index file of the 'BAM' file being processed; this is given without the '.bai' extension.
`destination`	The character(1) file name of the location where the sorted, filtered, or merged output file will be created. For `asBam` `asSam`, and `sortBam` this is without the “.bam” file suffix.
`region`	A GRanges() instance with `<= 1` elements, specifying the region of the BAM files to merged.
`...`	Additional arguments, passed to methods.
`overwrite`	A logical(1) indicating whether the destination can be over-written if it already exists.
`filter`	A `FilterRules` instance allowing users to filter BAM files based on arbitrary criteria, as described below.
`indexDestination`	A logical(1) indicating whether the created destination file should also be indexed.
`byQname`	A logical(1) indicating whether the sorted destination file should be sorted by Query-name (TRUE) or by mapping position (FALSE).
`header`	A character(1) file path for the header information to be used in the merged BAM file.
`addRG`	A logical(1) indicating whether the file name should be used as RG (read group) tag in the merged BAM file.
`compressLevel1`	A logical(1) indicating whether the merged BAM file should be compressed to zip level 1.
`maxMemory`	A numerical(1) indicating the maximal amount of memory (in MB) that the function is allowed to use.
`byTag`	A character(1) indicating whether the BAM file should be sorted by the supplied tag value.
`nThreads`	An integer(1) indicating the number of threads the function should use.
`param`	An instance of `ScanBamParam`. This influences what fields and which records are imported.

Details

The scanBam function parses binary BAM files; text SAM files can be parsed using R's scan function, especially with arguments what to control the fields that are parsed.

countBam returns a count of records consistent with param. If param is empty then entire file would be counted.

idxstatsBam visit the index in index(file), and quickly returns the number of mapped and unmapped reads on each seqname.

scanBamHeader visits the header information in a BAM file, returning for each file a list containing elements targets and text, as described below. The SAM / BAM specification does not require that the content of the header be consistent with the content of the file, e.g., more targets may be present that are represented by reads in the file. An optional character vector argument containing one or two elements of what=c("targets", "text") can be used to specify which elements of the header are returned.

asBam converts 'SAM' files to 'BAM' files, equivalent to samtools view -Sb file > destination. The 'BAM' file is sorted and an index created on the destination (with extension '.bai') when indexDestination=TRUE.

asSam converts 'BAM' files to 'SAM' files, equivalent to samtools view file > destination.

filterBam parses records in file. Records satisfying the bamWhich bamFlag and bamSimpleCigar criteria of param are accumulated to a default of yieldSize = 1000000 records (change this by specifying yieldSize when creating a BamFile instance; see BamFile-class). These records are then parsed to a DataFrame and made available for further filtering by user-supplied FilterRules. Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param. Each function must return a logical vector equal to the number of rows of the DataFrame. Return values are used to include (when TRUE) corresponding records in the filtered BAM file. The BAM file is created at destination. An index file is created on the destination when indexDestination=TRUE. It is more space- and time-efficient to filter using bamWhich, bamFlag, and bamSimpleCigar, if appropriate, than to supply FilterRules. filter may be a list of FilterRules instances, in which case destination must be a character vector of equal length. The original file is then separately filtered into destination[[i]], using filter[[i]] as the filter criterion.

sortBam sorts the BAM file given as its first argument, analogous to the “samtools sort” function.

indexBam creates an index for each BAM file specified, analogous to the ‘samtools index’ function.

mergeBam merges 2 or more sorted BAM files. As with samtools, the RG (read group) dictionary in the header of the BAM files is not reconstructed.

Details of the ScanBamParam class are provide on its help page; several salient points are reiterated here. ScanBamParam can contain a field what, specifying the components of the BAM records to be returned. Valid values of what are available with scanBamWhat. ScanBamParam can contain an argument which that specifies a subset of reads to return. This requires that the BAM file be indexed, and that the file be named following samtools convention as <bam_filename>.bai. ScanBamParam can contain an argument tag to specify which tags will be extracted.

Value

The scanBam,character-method returns a list of lists. The outer list groups results from each IntegerRanges list of bamWhich(param); the outer list is of length one when bamWhich(param) has length 0. Each inner list contains elements named after scanBamWhat(); elements omitted from bamWhat(param) are removed. The content of non-null elements are as follows, taken from the description in the samtools API documentation:

qname: This is the QNAME field in SAM Spec v1.4. The query name, i.e., identifier, associated with the read.
flag: This is the FLAG field in SAM Spec v1.4. A numeric value summarizing details of the read. See ScanBamParam and the flag argument, and scanBamFlag().
rname: This is the RNAME field in SAM Spec v1.4. The name of the reference to which the read is aligned.
strand: The strand to which the read is aligned.
pos: This is the POS field in SAM Spec v1.4. The genomic coordinate at the start of the alignment. Coordinates are ‘left-most’, i.e., at the 3' end of a read on the '-' strand, and 1-based. The position excludes clipped nucleotides, even though soft-clipped nucleotides are included in seq.
qwidth: The width of the query, as calculated from the cigar encoding; normally equal to the width of the query returned in seq.
mapq: This is the MAPQ field in SAM Spec v1.4. The MAPping Quality.
cigar: This is the CIGAR field in SAM Spec v1.4. The CIGAR string.
mrnm: This is the RNEXT field in SAM Spec v1.4. The reference to which the mate (of a paired end or mate pair read) aligns.
mpos: This is the PNEXT field in SAM Spec v1.4. The position to which the mate aligns.
isize: This is the TLEN field in SAM Spec v1.4. Inferred insert size for paired end alignments.
seq: This is the SEQ field in SAM Spec v1.4. The query sequence, in the 5' to 3' orientation. If aligned to the minus strand, it is the reverse complement of the original sequence.
qual: This is the QUAL field in SAM Spec v1.4. Phred-encoded, phred-scaled base quality score, oriented as seq.
groupid: This is an integer vector of unique group ids returned when asMates=TRUE in a BamFile object. groupid values are used to create the partitioning for a GAlignmentsList object.
mate_status: Returned (always) when asMates=TRUE in a BamFile object. This is a factor indicating status (mated, ambiguous, unmated) of each record.

idxstatsBam returns a data.frame with columns seqnames, seqlength, mapped (number of mapped reads on seqnames) and unmapped (number of unmapped reads).

scanBamHeader returns a list, with one element for each file named in files. The list contains two element. The targets element contains target (reference) sequence lengths. The text element is itself a list with each element a list corresponding to tags (e.g., ‘@SQ’) found in the header, and the associated tag values.

asBam, asSam return the file name of the destination file.

sortBam returns the file name of the sorted file.

indexBam returns the file name of the index file created.

filterBam returns the file name of the destination file created.

Author(s)

Martin Morgan <mtmorgan@fhcrc.org>. Thomas Unterhiner <thomas.unterthiner@students.jku.at> (sortBam).

References

http://samtools.sourceforge.net/

Examples


fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
                  mustWork=TRUE)

##
## scanBam
##

res0 <- scanBam(fl)[[1]] # always list-of-lists
names(res0)
length(res0[["qname"]])
lapply(res0, head, 3)
table(width(res0[["seq"]])) # query widths
table(res0[["qwidth"]], useNA="always") # query widths derived from cigar
table(res0[["cigar"]], useNA="always")
table(res0[["strand"]], useNA="always")
table(res0[["flag"]], useNA="always")

which <- IRangesList(seq1=IRanges(1000, 2000),
                     seq2=IRanges(c(100, 1000), c(1000, 2000)))
p1 <- ScanBamParam(which=which, what=scanBamWhat())
res1 <- scanBam(fl, param=p1)
names(res1)
names(res1[[2]])

p2 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth"))
res2 <- scanBam(fl, param=p2)
                
p3 <- ScanBamParam(
    what="flag",           # information to query from BAM file 
    flag=scanBamFlag(isMinusStrand=FALSE))
length(scanBam(fl, param=p3)[[1]]$flag)

##
## idxstatsBam
##

idxstatsBam(fl)

##
## filterBam
##

param <- ScanBamParam(
   flag=scanBamFlag(isUnmappedQuery=FALSE),
   what="seq")
dest <- filterBam(fl, tempfile(), param=param)
countBam(dest)  ## 3271 records

## filter to a single file
filter <- FilterRules(list(MinWidth = function(x) width(x$seq) > 35))
dest <- filterBam(fl, tempfile(), param=param, filter=filter)
countBam(dest)  ## 398 records
res3 <- scanBam(dest, param=ScanBamParam(what="seq"))[[1]]
table(width(res3$seq))

## filter 1 file to 2 destinations
filters <- list(
    FilterRules(list(long=function(x) width(x$seq) > 35)),
    FilterRules(list(short=function(x) width(x$seq) <= 35))
)
destinations <- replicate(2, tempfile())
dest <- filterBam(fl, destinations, param=param, filter=filters)
lapply(dest, countBam)

##
## sortBam
##

sorted <- sortBam(fl, tempfile())

##
## scanBamParam re-orders 'which'; recover original order
##

gwhich <- as(which, "GRanges")[c(2, 1, 3)]    # example data
cnt <- countBam(fl, param=ScanBamParam(which=gwhich))
reorderIdx <- unlist(split(seq_along(gwhich), seqnames(gwhich)))
cnt
cnt[reorderIdx,]

Bioconductor/Rsamtools documentation built on June 9, 2025, 9:46 a.m.