Description Usage Arguments Details Value Author(s) References See Also Examples
Perform a multiple sequence alignment of reads in the same UMI group, i.e., originating from the same DNA molecule.
1 2 | multiReadAlign(reads, groups, max.error=NA, match=0, mismatch=-1,
gapOpening=5, gapExtension=1, bandwidth=100, BPPARAM=SerialParam())
|
reads |
A DNAStringSet or QualityScaledDNAStringSet object containing read sequences. |
groups |
A factor specifying which reads belong to the same UMI group, e.g., from |
max.error |
A numeric scalar specifying the maximum base-calling error, above which bases will be masked during alignment. Currently ignored. |
match, mismatch |
Numeric scalars specifying the match and mismatch scores. |
gapOpening, gapExtension |
Numeric scalars specifying the gap opening and extension penalties. |
bandwidth |
Integer scalar specifying the bandwidth for banded alignment, corresponding to the maximum net difference in the numbers of insertions and deletions. |
BPPARAM |
A BiocParallelParam object controlling how paralellization should be performed. |
This function performs a multiple sequence alignment of reads from the same DNA molecule.
Reads should be grouped beforehand based on similarity in the UMI sequences, as described in umiGroup
.
The aim is to use the alignments to form a consensus sequence using consensusReadSeq
.
This allows us to obtain accurate sequences of the original molecule, overcoming the high error rate of ONT.
The banded T-Coffee algorithm from RSeqAn is used to perform efficient multiple sequence alignment of many reads.
Note that, for DNA sequences, we assume that all reads are on the same strand.
This should automatically be the case if adaptorAlign
was used with two different adaptors on either end of the read.
However, if the same adaptors were used on both ends, the strand of the sequence is unknown and may require other information to discern.
T-Coffee is not aware of quality strings, so we mask low-quality bases prior to the multiple sequence alignment.
All bases with error probabilities above max.error
are set to "N"
.
This ensures that poor base calls do not compromise the alignment quality.
If min.qual=NA
, no masking is performed.
All masks are removed in the output strings unless keep.mask=TRUE
, in which case the strings may contain "N"
's.
A DataFrame containing alignments
, a List of character vectors.
Each vector corresponds to a read group in groups
and contains the multiple sequence alignment strings for that group.
If reads
is a QualityScaledDNAStringSet, the output DataFrame will also contain qualities
, a List of quality score objects.
Each object contains the quality scores for each sequence in the MSA in the corresponding entry of alignments
.
Florian Bieberich, with modifications from Aaron Lun
Reinert K, Dadi TH, Ehrhardt M et al. (2017). The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J. Biotechnol. 261:157-168
Notredame C, Higgins DG, Heringa J (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1):205-17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | reads <- DNAStringSet(c("ACACTGGTTCAGGT",
"ACACGGTTCAGGT",
"CGGACTGACACGGT",
"CGGGCTGACACGGT"))
multiReadAlign(reads, c(1,1,2,2))
qreads <- QualityScaledDNAStringSet(reads,
PhredQuality(c("23849723948733",
"*(&^&23498342",
")(*!@!343928&^",
"2($*&$*A323128")))
multiReadAlign(qreads, c(1,1,2,2))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.