library("Rsamtools")
Many users will find that the r Biocpkg("GenomicAlignments")
package provides
a more useful representation of BAM
files in R; the
r Biocpkg("GenomicFiles")
package is also useful for iterating through BAM
files.
The r Biocpkg("Rsamtools")
package provides an interface to BAM
files. BAM
files are produced by samtools and other software, and represent a flexible
format for storing 'short' reads aligned to reference genomes. BAM
files
typically contain sequence and base qualities, and alignment coordinates
and quality measures. BAM
files are appealing for several reasons. The
format is flexible enough to represent reads generated and aligned using
diverse technologies. The files are binary so that file access is
relatively efficient. BAM
files can be indexed, allowing ready access
to localized chromosomal regions. BAM
files can be accessed remotely,
provided the remote hosting site supports such access and a local index
is available. This means that specific regions of remote files can be
accessed without retrieving the entire (large!) file. A full description
is available in the BAM
format specification
(http://samtools.sourceforge.net/SAM1.pdf)
The main purpose of the r Biocpkg("Rsamtools")
is to import BAM
files into
R.r Biocpkg("Rsamtools")
also provides some facility for file access such as
record counting, index file creation, and filtering to create new files
containing subsets of the original. An important use case for
r Biocpkg("Rsamtools")
is as a starting point for creating objects suitable
for a diversity of work flows, e.g., AlignedRead objects in the
r Biocpkg("ShortRead")
package (for quality assessment and read manipulation),
or GAlignments objects in r Biocpkg("GenomicAlignments")
package (for RNA-seq and other applications). Those desiring more functionality
are encouraged to explore samtools and related software efforts.
ScanBamParam
The essential capability provided by r Biocpkg("Rsamtools")
is BAM
input.
This is accomplished with the scanBam
function. scanBam
takes as input the
name of the BAM
file to be parsed. In addition, the param
argument
determines which genomic coordinates of the BAM
file, and what components of
each record, will be input. Rparam is an instance of the ScanBamParam class.
To create a param object, call ScanBamParam
. Here we create a param
object
to extract reads aligned to three distinct ranges (one on seq1
, two on
seq2
). From each of read in those ranges, we specify that we would like to
extract the reference name (rname
, e.g., seq1
), strand, alignment position,
query (i.e., read) width, and query sequence:
which <- GRanges(c( "seq1:1000-2000", "seq2:100-1000", "seq2:1000-2000" )) ## equivalent: ## GRanges( ## seqnames = c("seq1", "seq2", "seq2"), ## ranges = IRanges( ## start = c(1000, 100, 1000), ## end = c(2000, 1000, 2000) ## ) ## ) what <- c("rname", "strand", "pos", "qwidth", "seq") param <- ScanBamParam(which=which, what=what)
Additional information can be found on the help page for ScanBamParam
. Reading
the relevant records from the BAM
file is accomplished with
bamFile <- system.file("extdata", "ex1.bam", package="Rsamtools") bam <- scanBam(bamFile, param=param)
Like scan
, scanBam
returns a list
of values. Each element of the list
corresponds to a range specified by the which
argument to ScanBamParam
.
class(bam) names(bam)
Each element is itself a list, containing the elements
specified by the what
and tag
arguments to ScanBamParam
.
class(bam[[1]]) names(bam[[1]])
The elements are either basic R or IRanges data types
sapply(bam[[1]], class)
A paradigm for collapsing the list-of-lists into a single list is
.unlist <- function (x) { ## do.call(c, ...) coerces factor to integer, which is undesired x1 <- x[[1L]] if (is.factor(x1)) { structure(unlist(x), class = "factor", levels = levels(x1)) } else { do.call(c, x) } } bam <- unname(bam) # names not useful in unlisted result elts <- setNames(bamWhat(param), bamWhat(param)) lst <- lapply(elts, function(elt) .unlist(lapply(bam, "[[", elt)))
This might be further transformed, e.g., to a DataFrame, with
head(do.call("DataFrame", lst))
Often, an alternative is to use a ScanBamParam object with desired fields
specified in what
as an argument to GenomicAlignments::readGAlignments
; the
specified fields are added as columns to the returned GAlignments .
BAM
index filesThe BAM
file in the previous example includes an index, represented by
a separate file with extension .bai
:
list.files(dirname(bamFile), pattern="ex1.bam(.bai)?")
Indexing provides two significant benefits. First, an index allows a BAM
file
to be efficiently accessed by range. A corollary is that providing a which
argument to scanBamPram
requires an index. Second, coordinates for extracting
information from a BAM
file can be derived from the index, so a portion of a
remote BAM
file can be retrieved with local access only to the index. For
instance, provided an index file exists on the local computer, it is possible to
retrieve a small portion of a BAM
file residing on the 1000 genomes HTTP
server. The url
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam
points to the BAM
file corresponding to individual NA19240 chromosome 6 Solexa
(Illumina) sequences aligned using MAQ. The remote file is very large (about 10
GB), but the corresponding index file is small (about 500 KB). With na19240url
set to the above address, the following retrieves just those reads in the
specified range
which <- GRanges("6:100000-110000") param <- ScanBamParam(which=which, what=scanBamWhat()) na19240bam <- scanBam(na19240url, param=param)
Invoking scanBam
without an index file, as above, first retrieves the index
file from the remote location, and then queries the remote file using the index;
for repeated queries, it is more efficient to retrieve the index file first
(e.g., with download.file
) and then use the local index as an argument to
scanBam
. Many BAM
files were created in a way that causes scanBam
to
report that the "EOF marker is absent"; this message can safely be ignored.
BAM
filesBAM
files may be read by functions in packages other than
r Biocpkg("Rsamtools")
, in particular the readGAlignments
family of
functions in r Biocpkg("GenomicAlignments")
.
Additional ways of interacting with BAM
files include scanBamHeader
(to
extract header information) and countBam
(to count records matching param
).
filterBam
filters reads from the source file according to the criteria of the
ScanBamParam parameter, writing reads passing the filter to a new file. The
function sorts a previously unsorted BAM
, while The function indexBam
creates an index file from a sorted BAM
file. readPileup
reads a pileup
file created by , importing SNP, indel, or all variants into a GRanges object.
BAM
files can be large, containing more information on more genomic regions
than are of immediate interest or than can fit in memory. The first strategy for
dealing with this is to select, using the what
and which
arguments to
scanBamParam
, just those portions of the BAM
file that are essential to the
current analysis, e.g., specifying what=c('rname', 'qname', 'pos')
when
wishing to calculate coverage of ungapped reads.
When selective input of BAM
files is still too memory-intensive, the file can
be processed in chunks, with each chunk distilled to the derived information of
interest. Chromosomes will often be the natural chunk to process. For instance,
here we write a summary function that takes a single sequence name (chromosome)
as input, reads in specific information from the BAM
file, and calculates
coverage over that sequence.
summaryFunction <- function(seqname, bamFile, ...) { param <- ScanBamParam( what=c('pos', 'qwidth'), which=GRanges(seqname, IRanges(1, 1e7)), flag=scanBamFlag(isUnmappedQuery=FALSE) ) x <- scanBam(bamFile, ..., param=param)[[1]] coverage(IRanges(x[["pos"]], width=x[["qwidth"]])) }
This might be used as follows; it is an ideal candidate for evaluation in
parallel, e.g., using the parallel package and srapply
function in
r Biocpkg("ShortRead")
.
seqnames <- paste("seq", 1:2, sep="") cvg <- lapply(seqnames, summaryFunction, bamFile) names(cvg) <- seqnames cvg
The result of the function (a coverage vector, in this case) will often be much
smaller than the input.
The r Biocpkg("GenomicFiles")
package implements strategies for iterating
through BAM
and other files, including in parallel.
The functions described in the previous section import data in to R. However,
sequence data can be very large, and it does not always make sense to read the
data in immediately. Instead, it can be useful to marshal references to the
data into a container and then act on components of the container. The
BamViews class provides a mechanism for creating 'views' into a set of BAM
files. The view itself is light-weight, containing references to the relevant
BAM
files and metadata about the view (e.g., the phenotypic samples
corresponding to each BAM
file).
One way of understanding a instance is as a rectangular data structure.
The columns represent BAM
files (e.g., distinct samples). The rows
represent ranges (i.e., genomic coordinates). For instance, a ChIP-seq
experiment might identify a number of peaks of high read counts.
To illustrate, suppose we have an interest in caffeine metabolism in humans. The 'rows' contain coordinates of genomic regions associated with genes in a KEGG caffeine metabolism pathway. The 'columns' represent individuals in the 1000 genomes project.
To create the 'rows', we identify possible genes that KEGG associates
with caffeine metabolism. Using the r Biocpkg("KEGGREST")
package,
the steps are
## uses KEGGREST, dplyr, and tibble packages org <- "hsa" caffeine_pathway <- KEGGREST::keggList("pathway", org) tibble::enframe(name = "pathway_id", value = "pathway") dplyr::filter(startsWith(.data$pathway, "Caffeine metabolism")) egid <- KEGGREST::keggLink(org, "pathway") %>% tibble::enframe(name = "pathway_id", value = "gene_id") dplyr::left_join(x = caffeine_pathway, by = "pathway_id") dplyr::mutate(gene_id = sub("hsa:", "", gene_id)) pull(gene_id)
At the time of writing, genes in the caffeine metabolism pathway are
egid <- c("10", "1544", "1548", "1549", "7498", "9")
Then we use the appropriate r Biocpkg("TxDb")
package to translate Entrez
identifiers to obtain ranges of interest (one could also use
r Biocpkg("biomaRt")
to retrieve coordinates for non-model organisms, perhaps
making a TxDb
object as outlined in the
r Biocpkg("GenomicFeatures")
vignette). We'll find that the names used for
chromosomes in the alignments differ from those used at
Ensembl, so seqlevels<-
is used to map between naming schemes and to drop
unused levels.
library(TxDb.Hsapiens.UCSC.hg18.knownGene) bamRanges <- transcripts( TxDb.Hsapiens.UCSC.hg18.knownGene, filter=list(gene_id=egid) ) seqlevels(bamRanges) <- # translate seqlevels sub("chr", "", seqlevels(bamRanges)) lvls <- seqlevels(bamRanges) # drop unused levels seqlevels(bamRanges) <- lvls[lvls %in% as.character(seqnames(bamRanges))] bamRanges
The bamRanges
'knows' the genome for which the ranges are
defined
unique(genome(bamRanges))
Here we retrieve a vector of BAM
file URLs (slxMaq09
)
from the package.
slxMaq09 <- local({ fl <- system.file("extdata", "slxMaq09_urls.txt", package="Rsamtools") readLines(fl) })
We now assemble the BamViews instance from these objects; we
also provide information to annotated the BAM
files (with the
bamSamples
function argument, which is a DataFrame
instance with each row corresponding to a BAM
file) and the
instance as a whole (with bamExperiment
, a simple named
list containing information structured as the user sees fit).
bamExperiment <- list(description="Caffeine metabolism views on 1000 genomes samples", created=date()) bv <- BamViews( slxMaq09, bamRanges=bamRanges, bamExperiment=bamExperiment ) metadata(bamSamples(bv)) <- list(description="Solexa/MAQ samples, August 2009", created="Thu Mar 25 14:08:42 2010") bv
The BamViews object can be queried for its component parts, e.g.,
bamExperiment(bv)
More usefully, methods in r Biocpkg("Rsamtools")
are designed to work with
BamViews objects, retrieving data from all files in the view. These operations
can take substantial time and require reliable network access.
To illustrate, the following code (not evaluated when this vignette was created)
downloads the index files associated with the bv
object
bamIndexDir <- tempfile() indexFiles <- paste(bamPaths(bv), "bai", sep=".") dir.create(bamIndexDir) bv <- BamViews( slxMaq09, file.path(bamIndexDir, basename(indexFiles)), # index file location bamRanges=bamRanges, bamExperiment=bamExperiment ) idxFiles <- mapply( download.file, indexFiles, bamIndicies(bv), MoreArgs=list(method="curl") )
and then queries the 1000 genomes project for reads overlapping our transcripts.
library(GenomicAlignments) olaps <- readGAlignments(bv)
The resulting object is about 11 MB in size. To avoid having to download this data each time the vignette is run, we instead load it from disk
library(GenomicAlignments) load(system.file("extdata", "olaps.Rda", package="Rsamtools")) olaps head(olaps[[1]])
There are r length(olaps[[1]])
reads in
r names(olaps)[[1]]
overlapping at least one of our
transcripts. It is easy to explore this object, for instance
discovering the range of read widths in each individual.
head(t(sapply(olaps, function(elt) range(qwidth(elt)))))
Suppose we were particularly interested in the first transcript, which
has a transcript id
r mcols(bamRanges(bv))[["tx_name"]][1]
. Here we
extract reads overlapping this transcript from each of our samples. As
a consequence of the protocol used, reads aligning to either strand
could be derived from this transcript. For this reason, we set the
strand of our range of interest to *
. We use the
endoapply
function, which is like lapply
but
returns an object of the same class (in this case,
r as.vector(class(olaps))
) as its first argument.
rng <- bamRanges(bv)[1] strand(rng) <- "*" olap1 <- endoapply(olaps, subsetByOverlaps, rng) olap1 <- lapply(olap1, "seqlevels<-", value=as.character(seqnames(rng))) head(olap1[[24]])
To carry the example a little further, we calculate coverage of each sample:
minw <- min(sapply(olap1, function(elt) min(start(elt)))) maxw <- max(sapply(olap1, function(elt) max(end(elt)))) cvg <- endoapply( olap1, coverage, shift=-start(ranges(bamRanges[1])), width=width(ranges(bamRanges[1])) ) cvg[[1]]
Since the example includes a single region of uniform width across all samples, we can easily create a coverage matrix with rows representing nucleotide and columns sample and, e.g., document variability between samples and nucleotides
m <- matrix(unlist(lapply(cvg, lapply, as.vector)), ncol=length(cvg)) summary(rowSums(m)) summary(colSums(m))
This vignette has summarized facilities in the r Biocpkg("Rsamtools")
package.
Important additional packages include r Biocpkg("GenomicRanges")
(for
representing and manipulating gapped alignments), r Biocpkg("ShortRead")
for
I/O and quality assessment of ungapped short read alignments,
r Biocpkg("Biostrings")
and r Biocpkg("BSgenome")
for DNA sequence and
whole-genome manipulation, r Biocpkg("IRanges")
for range-based manipulation,
and r Biocpkg("rtracklayer")
for I/O related to the UCSC genome browser. Users
might also find the interface to the integrative genome browser (IGV) in
r Biocpkg("SRAdb")
useful for visualizing BAM
files.
packageDescription("Rsamtools") sessionInfo()
Note: The following operations were performed at the time the
vignette was written; location of on-line resources, in particular the
organization of the 1000 genomes BAM
files, may have changed.
We are interested in collecting the URLs of a number of BAM
files
from the 1000 genomes project. Our first goal is to identify files
that might make for an interesting comparison. First, let's visit the
1000 genomes FTP site and discover available files. We'll use the
r CRANpkg("RCurl")
package to retrieve the names of all files in an
appropriate directory
library(RCurl) ftpBase <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/" indivDirs <- strsplit(getURL(ftpBase, ftplistonly=TRUE), "\n")[[1]] alnDirs <- paste(ftpBase, indivDirs, "/alignment/", sep="") urls0 <- strsplit(getURL(alnDirs, dirlistonly=TRUE), "\n")
From these, we exclude directories without any files in them, select
only the BAM
index (extension .bai
) files, and choose those
files that exactly six '.'
characters in their name.
urls <- urls[sapply(urls0, length) != 0] fls0 <- unlist(unname(urls0)) fls1 <- fls0[grepl("bai$", fls0)] fls <- fls1[sapply(strsplit(fls1, "\\."), length)==7]
After a little exploration, we focus on those files obtained form Solexa
sequencing, aligned using MAQ
, and archived in August, 2009; we remove the
.bai
extension, so that the URL refers to the underlying file rather than
index. There are 24 files.
urls1 <- Filter( function(x) length(x) != 0, lapply(urls, function(x) x[grepl("SLX.maq.*2009_08.*bai$", x)]) ) slxMaq09.bai <- mapply(paste, names(urls1), urls1, sep="", USE.NAMES=FALSE) slxMaq09 <- sub(".bai$", "", slxMaq09.bai)
As a final step to prepare for using a file, we create local copies of the index files. The index files are relatively small (about 190 Mb total).
bamIndexDir <- tempfile() dir.create(bamIndexDir) idxFiles <- mapply( download.file, slxMaq09.bai, file.path(bamIndexDir, basename(slxMaq09.bai)), MoreArgs=list(method="curl") )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.