adaptorAlign: Align adaptors to reads

Description Usage Arguments Details Value Flipping reads by strand Author(s) See Also Examples

View source: R/adaptorAlign.R

Description

Perform pairwise alignments of adaptor sequences to the ends of the read sequences.

Usage

1
2
3
adaptorAlign(adaptor1, adaptor2, filepath, tolerance=250, gapOpening=5, 
    gapExtension=1, qual.type=c("phred", "solexa", "illumina"), 
    number=1e5, BPPARAM=SerialParam())

Arguments

adaptor1, adaptor2

A string or DNAString object containing the 5'-to-3' sequences of the adaptors on each end of the read.

filepath

A string containing the path to the FASTQ file, or a connection object to a FASTQ file.

tolerance

An integer scalar specifying the length of the ends of the reads to search for adaptors.

gapOpening

A numeric scalar specifying the gap opening penalty.

gapExtension

A numeric scalar specifying the gap extension penalty.

qual.type

String specifying the type of quality scores in filepath.

number

Integer scalar specifying the number of records to read at once from the FASTQ file, see ?FastqStreamer.

BPPARAM

A BiocParallelParam object specifying whether alignment should be parallelized.

Details

This function performs a local-global alignment of each adaptor to the ends of the read sequences (local for the read, global for the adaptor). This aims to identify the adaptors within the read sequence for trimming and/or UMI identification. Alignment is performed using a quality-aware algorithm akin to that in the pairwiseAlignment function from the Biostrings package. The default alignment parameters are chosen to account for the Nanopore's relatively high indel rate.

The adaptor sequences should be provided 5'-to-3', corresponding to the sequence on the ssDNA molecule. Reverse-complementing is performed automatically within the function to search both strands of the read sequence. There should usually be some experimental distinction between adaptor 1 and 2. For example, in RNA sequencing, adaptor 1 may be used for reverse transcription while adaptor 2 may be used for second-strand synthesis.

Any stretch of ambiguous IUPAC bases in the adaptor sequence is identified as a subsequence of interest. This can represent unique molecular identifiers or barcode sequences that may be read-specific. For each read, adaptorAlign will report the corresponding read subsequence that aligns to this stretch. These are reported in the same orientation as the adaptor sequence and can be directly compared across reads.

Value

A DataFrame where each row corresponds to a read in filepath and is named with the read name. It contains the following fields:

read.width

Integer vector containing the length of each read.

adaptor1:

A DataFrame of alignment information for adaptor 1, including the start and end positions of the alignment on the read; the alignment score; and the read subsequences aligned to any ambiguous bases in the adaptor.

adaptor2:

A similar DataFrame of alignment information for adaptor 2.

reversed:

A logical vector indicating whether each read has been reverse-complemented, see below.

The filepath, qual.type and tolerance used are stored in the metadata of the overall DataFrame. Alignment parameters and reference adaptor sequences are stored in the metadata of the each internal adaptor* DataFrame.

Flipping reads by strand

We standardize the alignment statistics in terms of the “canonical cDNA orientation”. In this canonical orientation, the sequence of adaptor 1 is located on the 5' end while the reverse complement of adaptor 2 is located on the 3' end. Any reads observed in the opposite orientation are conceptually reverse-complemented, as marked by the reversed field in the output. This standardization simplifies downstream analyses and comparisons between read sequences, e.g., in consensusReadSeq.

We decide whether or not a read is in the canonical orientation by considering the sum of bounded alignment scores for each orientation. To explain:

  1. Let us conceptually combine the two alignments of adaptor 1 and 2 to the (input) read sequence. This can be treated as a single combined alignment, by imagining a stretch of Ns between the two adaptors that contribute nothing to the score.

  2. The score of the combined alignment is equal to the sum of scores for the individual alignments, due to the additive nature of the Smith-Waterman procedure. However, if either individual score is negative, we set it to zero rather than adding a negative value. This mimics the effect of local alignment, where an adaptor is ignored on either end if it does not improve the score.

  3. We do the same for the scores for the reverse-complemented read sequence. If this is higher than the original score, we decide that the read needs to be reverse-complemented.

Start and end positions of the adaptor alignments are reported relative to the canonical orientation.

The interpretation of the alignment strings is slightly different, as they are reported relative to the supplied adaptor sequences. The starts of the alignment strings always refer to the 5' end of the reference adaptor sequence, regardless of which adaptor is involved. This simplifies the interpretation of the read sequence in downstream processing steps.

Author(s)

Florian Bieberich, Aaron Lun

See Also

tuneAlignment to pick the alignment parameters.

filterReads to remove reads with low-quality adaptor alignments.

realizeReads to extract the actual read sequence.

Examples

1
2
3
4
5
6
7
8
# Mocking up a data set.
a1 <- "AACGGGTCGNNNNNNNACGTACGTNNNNACGA" 
a2 <- "CGTGCTGCATCG"
fout <- tempfile(fileext=".fastq")
ref <- sarlacc:::mockReads(a1, a2, fout, nmolecules=3)

# Aligning it.
(out <- adaptorAlign(adaptor1=a1, adaptor2=a2, filepath=fout))

florian0512/SarlaccSeq documentation built on May 28, 2019, 8:39 p.m.