Maintain and use BAM files

Share:

Description

Use BamFile() to create a reference to a BAM file (and optionally its index). The reference remains open across calls to methods, avoiding costly index re-loading.

BamFileList() provides a convenient way of managing a list of BamFile instances.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## Constructors

BamFile(file, index=file, ..., yieldSize=NA_integer_, obeyQname=FALSE,
        asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA)
BamFileList(..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE,
            qnamePrefixEnd=NA, qnameSuffixStart=NA)

## Opening / closing

## S3 method for class 'BamFile'
open(con, ...)
## S3 method for class 'BamFile'
close(con, ...)

## accessors; also path(), index(), yieldSize()

## S4 method for signature 'BamFile'
isOpen(con, rw="")
## S4 method for signature 'BamFile'
isIncomplete(con)
## S4 method for signature 'BamFile'
obeyQname(object, ...)
obeyQname(object, ...) <- value
## S4 method for signature 'BamFile'
asMates(object, ...)
asMates(object, ...) <- value
## S4 method for signature 'BamFile'
qnamePrefixEnd(object, ...)
qnamePrefixEnd(object, ...) <- value
## S4 method for signature 'BamFile'
qnameSuffixStart(object, ...)
qnameSuffixStart(object, ...) <- value

## actions

## S4 method for signature 'BamFile'
scanBamHeader(files, ..., what=c("targets", "text"))
## S4 method for signature 'BamFile'
seqinfo(x)
## S4 method for signature 'BamFileList'
seqinfo(x)
## S4 method for signature 'BamFile'
filterBam(file, destination, index=file, ...,
    filter=FilterRules(), indexDestination=TRUE,
    param=ScanBamParam(what=scanBamWhat()))
## S4 method for signature 'BamFile'
indexBam(files, ...)
## S4 method for signature 'BamFile'
sortBam(file, destination, ..., byQname=FALSE, maxMemory=512)
## S4 method for signature 'BamFileList'
mergeBam(files, destination, ...)

## reading

## S4 method for signature 'BamFile'
scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat()))

## counting

## S4 method for signature 'BamFile'
idxstatsBam(file, index=file, ...)
## S4 method for signature 'BamFile'
countBam(file, index=file, ..., param=ScanBamParam())
## S4 method for signature 'BamFileList'
countBam(file, index=file, ..., param=ScanBamParam())
## S4 method for signature 'BamFile'
quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)

Arguments

...

Additional arguments.

For BamFileList, this can either be a single character vector of paths to BAM files, or several instances of BamFile objects. When a character vector of paths, a second named argument ‘index’ can be a character() vector of length equal to the first argument specifying the paths to the index files, or character() to indicate that no index file is available. See BamFile.

con

An instance of BamFile.

x, object, file, files

A character vector of BAM file paths (for BamFile) or a BamFile instance (for other methods).

index

character(1); the BAM index file path (for BamFile); ignored for all other methods on this page.

yieldSize

Number of records to yield each time the file is read from with scanBam. See ‘Fields’ section for details.

asMates

Logical indicating if records should be paired as mates. See ‘Fields’ section for details.

qnamePrefixEnd

Single character (or NA) marking the end of the qname prefix. When specified, all characters prior to and including the qnamePrefixEnd are removed from the qname. If the prefix is not found in the qname the qname is not trimmed. Currently only implemented for mate-pairing (i.e., when asMates=TRUE in a BamFile.

qnameSuffixStart

Single character (or NA) marking the start of the qname suffix. When specified, all characters following and including the qnameSuffixStart are removed from the qname. If the suffix is not found in the qname the qname is not trimmmed. Currently only implemented for mate-pairing (i.e., when asMates=TRUE in a BamFile.

obeyQname

Logical indicating if the BAM file is sorted by qname. In Bioconductor > 2.12 paired-end files do not need to be sorted by qname. Instead use asMates=TRUE for reading paired-end data. See ‘Fields’ section for details.

value

Logical value for setting asMates and obeyQname in a BamFile instance.

what

For scanBamHeader, a character vector specifying that either or both of c("targets", "text") are to be extracted from the header; see scanBam for additional detail.

filter

A FilterRules instance. Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param. Each function must return a logical vector, usually of length equal to the number of rows of the DataFrame. Return values are used to include (when TRUE) corresponding records in the filtered BAM file.

destination

character(1) file path to write filtered reads to.

indexDestination

logical(1) indicating whether the destination file should also be indexed.

byQname, maxMemory

See sortBam.

param

An optional ScanBamParam instance to further influence scanning, counting, or filtering.

rw

Mode of file; ignored.

main.groups.only

See quickBamFlagSummary.

Objects from the Class

Objects are created by calls of the form BamFile().

Fields

The BamFile class inherits fields from the RsamtoolsFile class and has fields:

yieldSize:

Number of records to yield each time the file is read from using scanBam or, when length(bamWhich()) != 0, a threshold which yields records in complete ranges whose sum first exceeds yieldSize. Setting yieldSize on a BamFileList does not alter existing yield sizes set on the individual BamFile instances.

asMates:

A logical indicating if the records should be returned as mated pairs. When TRUE scanBam attempts to mate (pair) the records and returns two additional fields groupid and mate_status. groupid is an integer vector of unique group ids; mate_status is a factor with level mated for records successfully paired by the algorithm, ambiguous for records that are possibly mates but cannot be assigned unambiguously, or unmated for reads that did not have valid mates.

Mate criteria:

  • Bit 0x40 and 0x80: Segments are a pair of first/last OR neither segment is marked first/last

  • Bit 0x100: Both segments are secondary OR both not secondary

  • Bit 0x10 and 0x20: Segments are on opposite strands

  • mpos match: segment1 mpos matches segment2 pos AND segment2 mpos matches segment1 pos

  • tid match

Flags, tags and ranges may be specified in the ScanBamParam for fine tuning of results.

obeyQname:

A logical(0) indicating if the file was sorted by qname. In Bioconductor > 2.12 paired-end files do not need to be sorted by qname. Instead set asMates=TRUE in the BamFile when using the readGAlignmentsList function from the GenomicAlignments package.

Functions and methods

BamFileList inherits additional methods from RsamtoolsFileList and SimpleList.

Opening / closing:

open.BamFile

Opens the (local or remote) path and index (if bamIndex is not character(0)), files. Returns a BamFile instance.

close.BamFile

Closes the BamFile con; returning (invisibly) the updated BamFile. The instance may be re-opened with open.BamFile.

isOpen

Tests whether the BamFile con has been opened for reading.

isIncomplete

Tests whether the BamFile con is niether closed nor at the end of the file.

Accessors:

path

Returns a character(1) vector of BAM path names.

index

Returns a character(0) or character(1) vector of BAM index path names.

yieldSize, yieldSize<-

Return or set an integer(1) vector indicating yield size.

obeyQname, obeyQname<-

Return or set a logical(0) indicating if the file was sorted by qname.

asMates, asMates<-

Return or set a logical(0) indicating if the records should be returned as mated pairs.

Methods:

scanBamHeader

Visit the path in path(file), returning the information contained in the file header; see scanBamHeader.

seqinfo, seqnames, seqlength

Visit the path in path(file), returning a Seqinfo, character, or named integer vector containing information on the anmes and / or lengths of each sequence. Seqnames are ordered as they appear in the file.

scanBam

Visit the path in path(file), returning the result of scanBam applied to the specified path.

countBam

Visit the path(s) in path(file), returning the result of countBam applied to the specified path.

idxstatsBam

Visit the index in index(file), quickly returning a data.frame with columns seqnames, seqlength, mapped (number of mapped reads on seqnames) and unmapped (number of unmapped reads).

filterBam

Visit the path in path(file), returning the result of filterBam applied to the specified path. A single file can be filtered to one or several destinations, as described in filterBam.

indexBam

Visit the path in path(file), returning the result of indexBam applied to the specified path.

sortBam

Visit the path in path(file), returning the result of sortBam applied to the specified path.

mergeBam

Merge several BAM files into a single BAM file. See mergeBam for details; additional arguments supported by mergeBam,character-method are also available for BamFileList.

show

Compactly display the object.

Author(s)

Martin Morgan and Marc Carlson

See Also

  • The readGAlignments, readGAlignmentPairs, and readGAlignmentsList functions defined in the GenomicAlignments package.

  • summarizeOverlaps and findSpliceOverlaps-methods in the GenomicAlignments package for methods that work on a BamFile and BamFileList objects.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
##
## BamFile options.
##

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bf <- BamFile(fl)
bf

## When 'asMates=TRUE' scanBam() reads the data in as
## pairs. See 'asMates' above for details of the pairing
## algorithm.
asMates(bf) <- TRUE

## When 'yieldSize' is set, scanBam() will iterate
## through the file in chunks.
yieldSize(bf) <- 500 

## Some applications append a filename (e.g., NCBI Sequence Read 
## Archive (SRA) toolkit) or allele identifier to the sequence qname.
## This may result in a unique qname for each record which presents a
## problem when mating paired-end reads (identical qnames is one
## criteria for paired-end mating). 'qnamePrefixEnd' and 
## 'qnameSuffixStart' can be used to trim an unwanted prefix or suffix.
qnamePrefixEnd(bf) <- "/"
qnameSuffixStart(bf) <- "." 

##
## Reading Bam files.
##

fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
                  mustWork=TRUE)
(bf <- BamFile(fl))
head(seqlengths(bf))                    # sequences and lengths in BAM file

if (require(RNAseqData.HNRNPC.bam.chr14)) {
    bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
    bfl
    bfl[1:2]                            # subset
    bfl[[1]]                            # select first element -- BamFile
    ## merged across BAM files
    seqinfo(bfl)
    head(seqlengths(bfl))
}


length(scanBam(fl)[[1]][[1]])  # all records

bf <- open(BamFile(fl))        # implicit index
bf
identical(scanBam(bf), scanBam(fl))
close(bf)

## Use 'yieldSize' to iterate through a file in chunks.
bf <- open(BamFile(fl, yieldSize=1000)) 
while (nrec <- length(scanBam(bf)[[1]][[1]]))
    cat("records:", nrec, "\n")
close(bf)

## Repeatedly visit multiple ranges in the BamFile. 
rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
bf <- open(BamFile(fl))
sapply(seq_len(length(rng)), function(i, bamFile, rng) {
    param <- ScanBamParam(which=rng[i], what="seq")
    bam <- scanBam(bamFile, param=param)[[1]]
    alphabetFrequency(bam[["seq"]], baseOnly=TRUE, collapse=TRUE)
}, bf, rng)
close(bf)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.