In the former chapter, we have had an initial look at computer optimisation. In the present chapter, we will look into how read alignments are commonly manipulated and what are the principles followed (and caveats faced) by standard tools.
options(digits=2) library(RnaSeqTutorial) bamfiles <- dir(file.path(extdata(),"BAM"), pattern="*.bam$",full.names=TRUE)
Most down-stream analysis of short read sequences is based on reads
aligned to reference genomes. There are many aligners available,
including BWA
[@pmid20080505],[@pmid19451168],
Bowtie2
[@pmid19261174],
GSNAP
[@Wu:2005p5857],
STAR
[@Dobin:2013p5293];
merits of which are discussed in the literature. There are also
alignment algorithms implemented in Bioconductor (e.g.,
matchPDict
in the Biostrings
package
and the gmapR
, Rbowtie
,
Rsubread
packages); matchPDict
is
particularly useful for flexible alignment of moderately sized subsets
of data.
The following sections introduce core tools for working with high-throughput sequence data; key packages for representing reads and alignments are summarized in the following table.
| Package | Description |
|---------|-------------|
| ShortRead
| In addition to functionalities to manipulate raw read files, e.g.: the ShortReadQ
class and functions for manipulating fastq
files; this package offers the possibility to load numerous HTS formats classes. These are mostly sequencer manufacturer specific e.g.: sff
for 454 or pre-BAM aligner proprietary formats, e.g.: MAQ
or bowtie
. These functionalities rely heavily on Biostrings
and somewhat on Rsamtools
. |
| GenomicAlignments
| GAlignments
and GAlignmentPairs
store single- and paired-end aligned reads and extend on the GenomicRanges
functionalities. |
| Rsamtools
| Provides access to BAM alignment and other large sequence-related files. |
| rtracklayer
| Input and output of bed
, wig
and similar files. |
Read the man page of the GAlignments
and
GAlignmentPairs
classes and pay attention to the very
important comments on multi-reads and paired-end processing.
Really just ?GAlignments
. However, KEEP these details
in mind as they are essential and likely source of erroneous conclusions.
Remember the example of this morning lecture about RNA editing.
ShortRead
packageEarlier, you looked at the ShortRead
package to
manipulate raw reads and to perform Quality Assessment (QA) on raw data
files e.g.: fastq formatted files. These are not the only
functionalities from the ShortRead
package, which
offers as well the possibility to read in alignments files in many
different formats.
Two files of the pasilla dataset have been aligned using
bowtie
[@Langmead:2009p309], locate
them in the extdata folder of the RnaSeqTutorial
package.
bwtFiles <- dir(path=file.path(extdata(),"bowtie"), pattern="*.bwt$",full.names=TRUE)
Have a pick at one of the file and try to decipher its format. Hint: it
is a tab delimited format, so check the read.delim
function. As you may not want to read all the lines to get an idea,
lookup an appropriate argument for that.
You might want to check the bowtie manual for checking whether your guesses were correct. Here is how to read 10 lines of the first file.
read.delim(file=file.path(extdata(),"bowtie","SRR074430.bwt"), header=FALSE,nrows=10)
now, as was presented in the lecture use the
readAligned
function to read in the bowtie alignment
files.
alignedRead <- readAligned(dirPath=file.path(extdata(),"bowtie"), pattern="*.bwt$",type="Bowtie")
What is peculiar about the returned object? Determine its class. Can you tell where the data from both input files are?
We obtained a single object of the AlignedRead
class.
By looking at the documentation, i.e.: ?readAligned
in the Value section, we are told that all files are concatenated in a
single object with NO guarantee of order in which files are read. This
is convenient when we want to merge several sequencing runs of the same
sample but we need to be cautious and process independent sample by
individually calling the readAligned
function for
every sample.
Finally, taking another look at the lecture, select only the reads that
align to chromosome 2L. Hint, use the appropriate
SRFilter
filter.
alignedRead2L <- readAligned(dirPath=file.path(extdata(),"bowtie"), pattern="*.bwt$",type="Bowtie", filter=chromosomeFilter("2L"))
This concludes the overview of the ShortRead
package.
As the BAM format has become a de-facto standard, it is unlikely that
you end up using that package to process reads in R, but rather the
Rsamtools
package presented next.
Rsamtools
packageMost main-stream aligners produce output in SAM (text-based) or BAM format. A SAM file is a text file, with one line per aligned read, and fields separated by tabs. Here is an example of a single SAM line, split into fields.
fl <- system.file("extdata", "ex1.sam", package="Rsamtools") strsplit(readLines(fl, 1), "\t")[[1]]
Fields in a SAM file are summarized in the following table:
| Field | Name | Value | |-----------|-------|-----------------------------------------| | 1 | QNAME | Query (read) NAME | | 2 | FLAG | Bitwise FLAG, e.g., strand of alignment | | 3 | RNAME | Reference sequence NAME | | 4 | POS | 1-based leftmost POSition of sequence | | 5 | MAPQ | MAPping Quality (Phred-scaled) | | 6 | CIGAR | Extended CIGAR string | | 7 | MRNM | Mate Reference sequence NaMe | | 8 | MPOS | 1-based Mate POSition | | 9 | ISIZE | Inferred insert SIZE | | 10 | SEQ | Query SEQuence on the reference strand | | 11 | QUAL | Query QUALity | | 12+ | OPT | OPTional fields, format TAG:VTYPE:VALUE |
We recognize from the FASTQ format introduced earlier, the identifier
string, read sequences and qualities. The alignment is to a chromosome
‘seq1’ starting at position 1. The strand of alignment is encoded in the
‘flag’ field. The alignment record also includes a measure of mapping
quality, and a CIGAR string describing the nature of the alignment. In
this case, the CIGAR is 36M, indicating that the alignment consisted of
36 M
atches or mismatches, with no indels or gaps; indels are
represented by I
and D
; gaps (e.g., from alignments spanning
introns) by N
.
BAM files encode the same information as SAM files, but in a format that is more efficiently parsed by software; BAM files are the primary way in which aligned reads are imported in to R.
As introduced in the previous section, there are three different packages to read alignments in R:
ShortRead
GenomicAlignments
Rsamtools
The last two will be described in more details in the next paragraphs.
The readGAlignments
function from the
GenomicAlignments
package reads essential information
from a BAM file into an instance of the
GAlignments
class. The GAlignments
class has been designed to allow useful manipulation of many reads
(e.g., 20 million) under moderate memory requirements (e.g., 4 GB).
Use the readGAlignments
to read in the “ex1.bam” that
can be found in the “extdata” folder of the Rsamtools
package.
alnFile <- system.file("extdata", "ex1.bam", package="Rsamtools") aln <- readGAlignments(alnFile)
head(aln, 3)
The readGAlignments
function takes an additional
argument: param
that allows to streamline the
information retrieved from a BAM files, e.g.: to retrieve alignments
to known gene coordinates only.
A GAlignments
instance is like a data frame, but with
accessors as suggested by the column names. It is easy to query, e.g.,
the distribution of reads aligning to each strand, the length of reads (a.k.a.
width), or the alignment cigar strings
Summarize the strand, width and CIGAR information from the aln
object.
table(strand(aln)) table(width(aln)) head(sort(table(cigar(aln)), decreasing=TRUE))
The GenomicAlignments
package readGAlignments
function only parses some of the fields of a BAM file, and that may not
be appropriate for every usage. In such cases the scanBam
function in Rsamtools
provides greater flexibility.
The idea is to view BAM files as a kind of data base. Particular regions
of interest can be selected, and the information in the selection
restricted to particular fields. These operations are determined by the
values of a ScanBamParam
object, passed as the named
param
argument to scanBam
.
Consult the help page for ScanBamParam
, and construct
an object that restricts the information returned by a
scanBam
query to the aligned read DNA sequence. Your
solution will use the what
parameter of the
ScanBamParam
function.
Use the ScanBamParam
object to query a BAM file, and
calculate the GC content - using the gcFunction
function - of all aligned
reads. Summarize the GC content as a histogram:
param <- ScanBamParam(what="seq") seqs <- scanBam(bamfiles[[1]], param=param) readGC <- gcFunction(seqs[[1]][["seq"]]) hist(readGC)
The Rsamtools
package has more advanced
functionalities:
functions to count, index, filter, sort BAM files
functions to access the header only
the possibility to access SAM attributes (tags)
functions to manipulate the CIGAR string
the possibility to combine BAM libraries into a study set (BamViews)
…
Find out the function that permit to scan the BAM header and retrieve the header of the first BAM file in the bigdata() bam subfolder.
It contains the reference sequence length and names as well as the name, version and command line of the tool used for performing the alignments.
scanBamHeader(bamfiles[1])
The SAM/BAM format contains a tag: “NH” that defines the total number of valid alignments reported for a read. How can you extract that information from the same first bam file and plot it as an histogram?
param <- ScanBamParam(tag="NH") nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
So it seems a majority of our reads have multiple alignments! Some
processing might be required to deal with these; i.e.: if reads were
aligned to the transcriptome there exist tools that can deconvoluate the
transcript specific expression, for example MMSEQ [@Turro:2011p4762],
BitSeq [@bitseq], that last one existing as an R package too: BitSeq
.
Otherwise if reads were aligned to the genome, one should consider
filtering these multiple alignments to avoid introducing artifactual
noise in the subsequent analyses.
The CIGAR string contains interesting information, in particular, whether or not a given match contain indels. Using the first bam file, can you get a matrix of all seven CIGAR operations? And find out the intron size distribution?
param <- ScanBamParam(what="cigar") cigars <- scanBam(bamfiles[[1]], param=param)[[1]]$cigar cigar.matrix <- cigarOpTable(cigars) intron.size <- cigar.matrix[,"N"] intron.size[intron.size>0] plot(density(intron.size[intron.size>0])) hist(log10(intron.size[intron.size>0]),xlab="intron size (log10 bp)")
Look up the documentation for the BamViews
and using the leeBamViews
package, create a BamViews instance. Afterwards, use some of the
accessors of that object to retrieve e.g. the file paths or the sample
names
library(leeBamViews) bpaths = dir(system.file("bam", package="leeBamViews"), full=TRUE, patt="bam$") gt<-do.call(rbind,strsplit(basename(bpaths),"_"))[,1] geno<-substr(gt,1,nchar(gt)-1) lane<-substr(gt,nchar(gt),nchar(gt)) pd = DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep=".")) bs1 = BamViews(bamPaths=bpaths, bamSamples=pd, bamExperiment=list(annotation="org.Sc.sgd.db")) bamPaths(bs1) bamSamples(bs1)
Finally, extract the coverage for the locus 861250:863000 on chromosome “Scchr13” for every sample in the object
sel <- GRanges(seqnames = "Scchr13", IRanges(start = 861250, end = 863000), strand="+") covex = RleList(lapply(bamPaths(bs1), function(x){ coverage(readGAlignments(x))[sel][["Scchr13"]]}))
This offer an interesting way to process multiple sample at the same time when you’re interested in a particular locus.
There are extensive vignettes for Rsamtools
and
GenomicAlignments
packages.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.