Description Usage Arguments Details Value Author(s) See Also Examples
View source: R/sequenceLayer.R
sequenceLayer
can lay strings that belong to a given space (e.g.
the "query"
space) alongside another space (e.g. the
"reference"
space) by removing/injecting substrings from/into them,
using the supplied CIGARs.
Its primary use case is to lay the read sequences stored in a BAM file
(which are considered to belong to the "query"
space) alongside
the "reference"
space. It can also be used to remove the parts
of the read sequences that correspond to soft-clipping. More generally
it can lay strings that belong to any supported space alongside any other
supported space. See the Details section below for the list of supported
spaces.
1 2 3 | sequenceLayer(x, cigar, from="query", to="reference",
D.letter="-", N.letter=".",
I.letter="-", S.letter="+", H.letter="+")
|
x |
An XStringSet object containing strings that belong to a given space. |
cigar |
A character vector or factor of the same length as |
from, to |
A single string specifying one of the 8 supported spaces listed in the
Details section below. |
D.letter, N.letter, I.letter, S.letter, H.letter |
A single letter used as a filler for injections. More on this in the Details section below. |
The 8 supported spaces are: "reference"
,
"reference-N-regions-removed"
, "query"
,
"query-before-hard-clipping"
, "query-after-soft-clipping"
,
"pairwise"
, "pairwise-N-regions-removed"
,
and "pairwise-dense"
.
Each space can be characterized by the extended CIGAR
operations that are visible in it. A CIGAR operation is
said to be visible in a given space if it "runs along it",
that is, if it's associated with a block of contiguous positions in
that space (the size of the block being the length of the operation).
For example, the M/=/X operations are visible in all spaces,
the D/N operations are visible in the "reference"
space
but not in the "query"
space, the S operation is visible
in the "query"
space but not in the "reference"
or in the "query-after-soft-clipping"
space, etc...
Here are the extended CIGAR operations that are visible in each space:
reference: M, D, N, =, X
reference-N-regions-removed: M, D, =, X
query: M, I, S, =, X
query-before-hard-clipping: M, I, S, H, =, X
query-after-soft-clipping: M, I, =, X
pairwise: M, I, D, N, =, X
pairwise-N-regions-removed: M, I, D, =, X
pairwise-dense: M, =, X
sequenceLayer
lays a string that belongs to one space alongside
another by (1) removing the substrings associated with operations that
are not visible anymore in the new space, and (2) injecting
substrings associated with operations that become visible in the
new space. Each injected substring has the length of the operation
associated with it, and its content is controlled via the corresponding
*.letter
argument.
For example, when going from the "query"
space to the
"reference"
space (the default), the I- and S-substrings (i.e.
the substrings associated with I/S operations) are removed, and
substrings associated with D/N operations are injected. More precisely,
the D-substrings are filled with the letter specified in D.letter
,
and the N-substrings with the letter specified in N.letter
.
The other *.letter
arguments are ignored in that case.
An XStringSet object of the same class and length as x
.
Hervé Pagès
The stackStringsFromBam
function
for stacking the read sequences (or their quality strings)
stored in a BAM file on a region of interest.
The readGAlignments
function for loading read
sequences from a BAM file (via a GAlignments object).
The extractAt
and
replaceAt
functions in the Biostrings
package for extracting/replacing arbitrary substrings from/in a
string or set of strings.
cigar-utils for the CIGAR utility functions used internally
by sequenceLayer
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ## ---------------------------------------------------------------------
## A. FROM "query" TO "reference" SPACE
## ---------------------------------------------------------------------
## Load read sequences from a BAM file (they will be returned in a
## GAlignments object):
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="seq")
gal <- readGAlignments(bamfile, param=param)
qseq <- mcols(gal)$seq # the read sequences (aka query sequences)
## Lay the query sequences alongside the reference space. This will
## remove the substrings associated with insertions to the reference
## (I operations) and soft clipping (S operations), and will inject new
## substrings (filled with "-") where deletions from the reference (D
## operations) and skipped regions from the reference (N operations)
## occurred during the alignment process:
qseq_on_ref <- sequenceLayer(qseq, cigar(gal))
## A typical use case for doing the above is to compute 1 consensus
## sequence per chromosome. The code below shows how this can be done
## in 2 extra steps.
## Step 1: Compute one consensus matrix per chromosome.
qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
cm_by_chrom <- lapply(names(pos_by_chrom),
function(seqname)
consensusMatrix(qseq_on_ref_by_chrom[[seqname]],
as.prob=TRUE,
shift=pos_by_chrom[[seqname]]-1,
width=seqlengths(gal)[[seqname]]))
names(cm_by_chrom) <- names(pos_by_chrom)
## 'cm_by_chrom' is a list of consensus matrices. Each matrix has 17
## rows (1 per letter in the DNA alphabet) and 1 column per chromosome
## position.
## Step 2: Compute the consensus string from each consensus matrix.
## We'll put "+" in the strings wherever there is no coverage for that
## position, and "N" where there is coverage but no consensus.
cs_by_chrom <- lapply(cm_by_chrom,
function(cm) {
## Because consensusString() doesn't like consensus matrices
## with columns that contain only zeroes (and you will have
## columns like that for chromosome positions that don't
## receive any coverage), we need to "fix" 'cm' first.
idx <- colSums(cm) == 0
cm["+", idx] <- 1
DNAString(consensusString(cm, ambiguityMap="N"))
})
## consensusString() provides some flexibility to let you extract
## the consensus in different ways. See '?consensusString' in the
## Biostrings package for the details.
## Finally, note that the read quality strings can also be used as
## input for sequenceLayer():
param <- ScanBamParam(what="qual")
gal <- readGAlignments(bamfile, param=param)
qual <- mcols(gal)$qual # the read quality strings
qual_on_ref <- sequenceLayer(qual, cigar(gal))
## Note that since the "-" letter is a valid quality code, there is
## no way to distinguish it from the "-" letters inserted by
## sequenceLayer().
## ---------------------------------------------------------------------
## B. FROM "query" TO "query-after-soft-clipping" SPACE
## ---------------------------------------------------------------------
## Going from "query" to "query-after-soft-clipping" simply removes
## the substrings associated with soft clipping (S operations):
qseq <- DNAStringSet(c("AAAGTTCGAA", "TTACGATTAN", "GGATAATTTT"))
cigar <- c("3H10M", "2S7M1S2H", "2M1I1M3D2M4S")
clipped_qseq <- sequenceLayer(qseq, cigar,
from="query", to="query-after-soft-clipping")
sequenceLayer(clipped_qseq, cigar,
from="query-after-soft-clipping", to="query")
sequenceLayer(clipped_qseq, cigar,
from="query-after-soft-clipping", to="query",
S.letter="-")
## ---------------------------------------------------------------------
## C. BRING QUERY AND REFERENCE SEQUENCES TO THE "pairwise" or
## "pairwise-dense" SPACE
## ---------------------------------------------------------------------
## Load read sequences from a BAM file:
library(RNAseqData.HNRNPC.bam.chr14)
bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq",
which=GRanges("chr14", IRanges(1, 25000000)))
gal <- readGAlignments(bamfile, param=param)
qseq <- mcols(gal)$seq # the read sequences (aka query sequences)
## Load the corresponding reference sequences from the appropriate
## BSgenome package (the reads in RNAseqData.HNRNPC.bam.chr14 were
## aligned to hg19):
library(BSgenome.Hsapiens.UCSC.hg19)
rseq <- getSeq(Hsapiens, as(gal, "GRanges")) # the reference sequences
## Bring 'qseq' and 'rseq' to the "pairwise" space.
## For 'qseq', this will remove the substrings associated with soft
## clipping (S operations) and inject substrings (filled with "-")
## associated with deletions from the reference (D operations) and
## skipped regions from the reference (N operations). For 'rseq', this
## will inject substrings (filled with "-") associated with insertions
## to the reference (I operations).
qseq2 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise")
rseq2 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise")
## Sanity check: 'qseq2' and 'rseq2' should have the same shape.
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2)))
## A closer look at reads with insertions and deletions:
cigar_op_table <- cigarOpTable(cigar(gal))
head(cigar_op_table)
I_idx <- which(cigar_op_table[ , "I"] >= 2) # at least 2 insertions
qseq2[I_idx]
rseq2[I_idx]
D_idx <- which(cigar_op_table[ , "D"] >= 2) # at least 2 deletions
qseq2[D_idx]
rseq2[D_idx]
## A closer look at reads with skipped regions:
N_idx <- which(cigar_op_table[ , "N"] != 0)
qseq2[N_idx]
rseq2[N_idx]
## A variant of the "pairwise" space is the "pairwise-dense" space.
## In that space, all indels and skipped regions are removed from 'qseq'
## and 'rseq'.
qseq3 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise-dense")
rseq3 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise-dense")
## Sanity check: 'qseq3' and 'rseq3' should have the same shape.
stopifnot(identical(elementNROWS(qseq3), elementNROWS(rseq3)))
## Insertions were removed:
qseq3[I_idx]
rseq3[I_idx]
## Deletions were removed:
qseq3[D_idx]
rseq3[D_idx]
## Skipped regions were removed:
qseq3[N_idx]
rseq3[N_idx]
## ---------------------------------------------------------------------
## D. SANITY CHECKS
## ---------------------------------------------------------------------
SPACES <- c("reference",
"reference-N-regions-removed",
"query",
"query-before-hard-clipping",
"query-after-soft-clipping",
"pairwise",
"pairwise-N-regions-removed",
"pairwise-dense")
cigarWidth <- list(
function(cigar) cigarWidthAlongReferenceSpace(cigar),
function(cigar) cigarWidthAlongReferenceSpace(cigar,
N.regions.removed=TRUE),
function(cigar) cigarWidthAlongQuerySpace(cigar),
function(cigar) cigarWidthAlongQuerySpace(cigar,
before.hard.clipping=TRUE),
function(cigar) cigarWidthAlongQuerySpace(cigar,
after.soft.clipping=TRUE),
function(cigar) cigarWidthAlongPairwiseSpace(cigar),
function(cigar) cigarWidthAlongPairwiseSpace(cigar,
N.regions.removed=TRUE),
function(cigar) cigarWidthAlongPairwiseSpace(cigar, dense=TRUE)
)
cigar <- c("3H2S4M1D2M2I1M5N3M6H", "5M1I3M2D4M2S")
seq <- list(
BStringSet(c(A="AAAA-BBC.....DDD", B="AAAAABBB--CCCC")),
BStringSet(c(A="AAAA-BBCDDD", B="AAAAABBB--CCCC")),
BStringSet(c(A="++AAAABBiiCDDD", B="AAAAAiBBBCCCC++")),
BStringSet(c(A="+++++AAAABBiiCDDD++++++", B="AAAAAiBBBCCCC++")),
BStringSet(c(A="AAAABBiiCDDD", B="AAAAAiBBBCCCC")),
BStringSet(c(A="AAAA-BBiiC.....DDD", B="AAAAAiBBB--CCCC")),
BStringSet(c(A="AAAA-BBiiCDDD", B="AAAAAiBBB--CCCC")),
BStringSet(c(A="AAAABBCDDD", B="AAAAABBBCCCC"))
)
stopifnot(all(sapply(1:8,
function(i) identical(width(seq[[i]]), cigarWidth[[i]](cigar))
)))
sequenceLayer2 <- function(x, cigar, from, to)
sequenceLayer(x, cigar, from=from, to=to, I.letter="i")
identical_XStringSet <- function(target, current)
{
ok1 <- identical(class(target), class(current))
ok2 <- identical(names(target), names(current))
ok3 <- all(target == current)
ok1 && ok2 && ok3
}
res <- sapply(1:8, function(i) {
sapply(1:8, function(j) {
target <- seq[[j]]
current <- sequenceLayer2(seq[[i]], cigar,
from=SPACES[i], to=SPACES[j])
identical_XStringSet(target, current)
})
})
stopifnot(all(res))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.