multiReadAlign: Multiple read alignment
In MarioniLab/sarlacc: Pipeline for Oxford Nanopore RNA-Seq Data Analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

Perform a multiple sequence alignment of reads in the same UMI group, i.e., originating from the same DNA molecule.

1 2	multiReadAlign(reads, groups, max.error=NA, match=0, mismatch=-1, gapOpening=5, gapExtension=1, bandwidth=100, BPPARAM=SerialParam())

`reads`	A DNAStringSet or QualityScaledDNAStringSet object containing read sequences.
`groups`	A factor specifying which reads belong to the same UMI group, e.g., from `umiGroup`. Alternatively, a list of indices specifying reads in the same group.
`max.error`	A numeric scalar specifying the maximum base-calling error, above which bases will be masked during alignment. Currently ignored.
`match, mismatch`	Numeric scalars specifying the match and mismatch scores.
`gapOpening, gapExtension`	Numeric scalars specifying the gap opening and extension penalties.
`bandwidth`	Integer scalar specifying the bandwidth for banded alignment, corresponding to the maximum net difference in the numbers of insertions and deletions.
`BPPARAM`	A BiocParallelParam object controlling how paralellization should be performed.

This function performs a multiple sequence alignment of reads from the same DNA molecule. Reads should be grouped beforehand based on similarity in the UMI sequences, as described in umiGroup. The aim is to use the alignments to form a consensus sequence using consensusReadSeq. This allows us to obtain accurate sequences of the original molecule, overcoming the high error rate of ONT.

The banded T-Coffee algorithm from RSeqAn is used to perform efficient multiple sequence alignment of many reads. Note that, for DNA sequences, we assume that all reads are on the same strand. This should automatically be the case if adaptorAlign was used with two different adaptors on either end of the read. However, if the same adaptors were used on both ends, the strand of the sequence is unknown and may require other information to discern.

T-Coffee is not aware of quality strings, so we mask low-quality bases prior to the multiple sequence alignment. All bases with error probabilities above max.error are set to "N". This ensures that poor base calls do not compromise the alignment quality. If min.qual=NA, no masking is performed. All masks are removed in the output strings unless keep.mask=TRUE, in which case the strings may contain "N"'s.

A DataFrame containing alignments, a List of character vectors. Each vector corresponds to a read group in groups and contains the multiple sequence alignment strings for that group.

If reads is a QualityScaledDNAStringSet, the output DataFrame will also contain qualities, a List of quality score objects. Each object contains the quality scores for each sequence in the MSA in the corresponding entry of alignments.

Florian Bieberich, with modifications from Aaron Lun

Reinert K, Dadi TH, Ehrhardt M et al. (2017). The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J. Biotechnol. 261:157-168

Notredame C, Higgins DG, Heringa J (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1):205-17

umiGroup, consensusReadSeq

reads <- DNAStringSet(c("ACACTGGTTCAGGT", 
    "ACACGGTTCAGGT",
    "CGGACTGACACGGT",
    "CGGGCTGACACGGT"))

multiReadAlign(reads, c(1,1,2,2))

qreads <- QualityScaledDNAStringSet(reads,
    PhredQuality(c("23849723948733", 
    "*(&^&23498342",
    ")(*!@!343928&^",
    "2($*&$*A323128")))

multiReadAlign(qreads, c(1,1,2,2))