set.seed(1000)
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(sarlacc)

Introduction

The r Biocpkg("sarlacc") package is designed to perform error correction on long reads such as those generated with the Oxford Nanpore Technology (ONT). This is done by considering the unique molecular identifier (UMI) associated with each read. Reads with the same UMI are groups together and used to construct a consensus sequence, thereby correcting for any sequencing errors unique to each read. In this manner, we can overcome the higher error rate of ONT data while still retaining the long-read capabilities.

Setting up sequence data

First, we define the adaptor sequences that should be present on each end of the read sequences. Here, adaptor 1 contains two stretches of Ns, corresponding to the sample barcode (first stretch) and UMI (second stretch). Other IUPAC bases are also supported in situations where adaptors are constructed with some degree of degeneracy. Of course, it is also possible for adaptor 2 to contain barcodes and UMIs.

adaptor1 <- "ACGCAGATCGATCGATNNNNNNNNNNNNCGCGCGAGCTGACTNNNNGCACGACTCTGGTTTTTTTTTTTT"
adaptor2 <- "AAGGCCTTTTCCGACTCATGAA"

We also define the specific barcode sequences that could be present in the first stretch of Ns in adaptor1. Here, we assume that we have three different barcodes in our experiment:

all.barcodes <- c("AAGGAATTAAGG", "GGCCAACCGGTT", "CCGGTTGGCCAA")

To demonstrate the various r Biocpkg("sarlacc") functions in this vignette, we will mock up some data and save it as a FASTQ file.

set.seed(1000)
fastq <- tempfile(fileext=".fastq")
ref.seq <- sarlacc:::mockReads(adaptor1=adaptor1, adaptor2=adaptor2, 
    filepath=fastq, all.barcodes=all.barcodes) 
fastq 

Needless to say, this is not necessary for users with sequencing data from real experiments.

Aligning adaptors

Identifying alignment parameters

Our first task is to align the adaptors to the end of the reads. We determine suitable alignment parameters using the tuneAlignment function. This will identify the parameter combination that maximizes the separation between the real alignment scores and those for scrambled input sequences.

tuning <- tuneAlignment(adaptor1, adaptor2, fastq)
tuning$parameters

We confirm that these parameter choices are sensible by verifying the separation in the distribution of scores for the real alignments compared to the scrambled sequences.

real.dist <- density(tuning$scores$reads)
scram.dist <- density(tuning$scores$scrambled)
plot(real.dist, xlim=range(real.dist$x, scram.dist$x),
    ylim=range(real.dist$y, scram.dist$y), col="blue")
lines(scram.dist, col="red")

This step is time-consuming but does not need to be routinely performed in every analysis. It is only necessary to give an intuition for the suitable alignment parameters for a batch of data.

Performing the alignment

We now align the adaptor sequences to all of the reads using the adaptorAlign function^[We are using fixed parameters chosen from one run of the tuneAlignment. We could also supply the tuned parameters directly but will not do so for simplicity.]. For speed, the alignment will only consider the 250 bp on either end of the read, under the assumption that the adaptors should occur at the read ends. This can be further sped up by distributing jobs across multiple cores with the BPPARAM= argument.

aln.out <- adaptorAlign(adaptor1, adaptor2, fastq, gapOpening=4, gapExtension=1)
colnames(aln.out)

The output is a DataFrame that contains alignment information for each adaptor. For example, the adaptor1 field contains a nested DataFrame with alignment scores and positions for adaptor1 on the read. It also contains the read subsequences that aligned to the ambiguous regions in adaptor1 (i.e., the UMI and barcode).

aln.out$adaptor1

Note that the positions refer to coordinates on a "canonical orientation" of the read where the adaptor 1 occurs at the 5' end. Reads are marked as being reverse-complemented (if necessary) to match this canonical orientation, to ensure that adaptor 1 and 2 are located on the 5' and 3' ends respectively. Whether or not a read sequence was reverse-complemented is indicataed by the aln.out$reversed field:

summary(aln.out$reversed)

Filtering for adaptor-containing reads

We use the getAdaptorThresholds function to choose an appropriate score threshold for the presence of an adaptor. This involves scrambling the start and end sequences and repeating the alignment, with the aim of choosing a threshold that distinguishes between the real and scrambled scores. To expedite this process, we only perform this for a random set of 10000 reads.

thresh <- getAdaptorThresholds(aln.out, number=1e4)
thresh$threshold1
thresh$threshold2

We verify that the thresholds are suitable for adaptor 1:

real.dist <- density(thresh$scores1$reads)
scram.dist <- density(thresh$scores1$scrambled)
plot(real.dist, xlim=range(real.dist$x, scram.dist$x),
    ylim=range(real.dist$y, scram.dist$y), col="blue")
lines(scram.dist, col="red")
abline(v=thresh$threshold1, col="grey", lty=2)

... and for adaptor 2:

real.dist <- density(thresh$scores2$reads)
scram.dist <- density(thresh$scores2$scrambled)
plot(real.dist, xlim=range(real.dist$x, scram.dist$x),
    ylim=range(real.dist$y, scram.dist$y), col="blue")
lines(scram.dist, col="red")
abline(v=thresh$threshold2, col="grey", lty=2)

Based on these thresholds, we retain only those reads that have well-aligned adaptors on both sides^[We'll round the thresholds to fixed values, for simplicity and to simplify reproducibility.].

filtered <- filterReads(aln.out, 6, 11)
nrow(filtered)

Demultiplexing reads

Sample barcodes may be present in the adaptor to enable multiplexed sequencing of many samples. In this case, the first stretch of Ns represents the barcode so we take subseq$Sub1:

(obs.barcodes <- filtered$adaptor1$subseq$Sub1)

We assign reads back to their sample of origin by aligning the observed barcodes to the references in all.barcodes:

(debarcoded <- barcodeAlign(obs.barcodes, all.barcodes,
    gapOpening=4, gapExtension=1))

The function returns the read sequence at the barcode location, the assigned barcode, its alignment score and the gap to the score of the next-best aligned barcode. An unambiguous assignment to a single barcode should manifest as a high score and a large gap. We define thresholds for both of these metrics with getBarcodeThresholds:

(barcode.thresh <- getBarcodeThresholds(debarcoded, nmads=2))

These thresholds are designed to remove reads with low outlier scores or gaps. We use a lower nmads than the default to be more stringent. We check that the thresholds are reasonable:

par(mfrow=c(1,2))
hist(debarcoded$score, xlab="Barcode alignment score", col="grey80")
abline(v=barcode.thresh["score"], col="red", lty=2)
hist(debarcoded$gap, xlab="Barcode alignment gap", col="grey80")
abline(v=barcode.thresh["gap"], col="red", lty=2)

We then filter the reads to only retain those that were unambiguously assigned to a barcode:

was.assigned <- debarcoded$score > barcode.thresh["score"] &
    debarcoded$gap > barcode.thresh["gap"]
table(debarcoded$barcode, was.assigned)

... and split them into their samples of origin:

by.sample <- split(filtered[was.assigned,], debarcoded$barcode[was.assigned])
names(by.sample)

The rest of the analysis should proceed on each element of by.sample separately, as reads from different samples cannot originate from the same molecule. For simplicity, we will focus our analysis on the barcode with the most reads:

n.per.sample <- unlist(lapply(by.sample, nrow))
my.sample <- by.sample[[which.max(n.per.sample)]]
nrow(my.sample)

Defining read groups

Overview

We want to define groups of reads that are likely to have originated from the same cDNA molecule. This requires that (i) the reads have similar sequences, and (ii) the reads have similar UMIs. Thus, we group together reads by applying thresholds to the percentage of read sequence identity and to the edit distances between UMIs.

Note that there is always a balance between specificity and sensitivity when choosing thresholds:

For the purpose of error correction, more conservative thresholds are generally preferable. We do not want to use reads from multiple molecules for correction, especially if molecules differ in their base composition due to, e.g., allelic variation. (On the other hand, more relaxed thresholds may be desirable if the aim is to eliminate PCR duplicates.)

Based on the read sequence

We first extract the read sequence using realizeReads(). This standardizes the strand orientation and trims adaptors in preparation for further processing.

read.seq <- realizeReads(my.sample)
read.seq

A simple grouping strategy is to align the reads to a reference and define groups of reads based on their overlap with annotated features. For example, we could use minimap2 to align reads to a reference transcriptome:

minimap2 -d index.mmi transcriptome.fa

# Standardized orientation to reverse strand.
minimap2 -ax splice --rev-only index.mmi reads.fastq

Each set of reads mapping to the same transcript would then be defined as a single group. We would use the name of the reference sequence as a grouping factor, which can be extracted from SAM files using the sam2ranges() function.

A more complex approach would be to cluster reads together using software like vsearch. Each cluster is defined based on a minimum sequence similarity threshold to a centroid sequence. This is more computationally intensive but avoids the need for a pre-defined reference. We recommend using:

vsearch --cluster_fast reads.fastq -id 0.8 -iddef=0 --uc out.txt

where the output file out.txt can be easily read in using read.table() to create a grouping factor.

Most of these strategies require software that are not available as R packages. Thus, we need to save the read sequences to a FASTQ file for external processing, which can be done using writeXStringSet().

tmp.file <- tempfile(fileext=".fastq")
writeXStringSet(read.seq, qualities=quality(read.seq), 
    format="fastq", file=tmp.file)

For demonstration purposes, we will use groupings of reads based on their MOLECULE_ tags. Obviously, this would not be available in real data analyses, but setting up an alignment to a real reference transcriptome is beyond the scope of this vignette. The main point here is that a pre.groups vector of group IDs should be constructed, with one value per read indicating the group to which the read belongs.

pre.groups <- sub("MOLECULE_([0-9]+).*", "\\1", rownames(my.sample))
table(pre.groups)

Based on the UMI

The second ambiguous stretch on adaptor 1 represents the UMI for each read. All reads derived from the same transcript molecule should have the same UMI (give or take some differences due to sequencing errors).

(my.umis <- my.sample$adaptor1$subseq$Sub2)

We apply a threshold on the edit distance between UMI sequences to determine whether two reads have the same UMI. To obtain an appropriate threshold, we use the expectedDist function to compute the distribution of edit distances between (nominally identical) adaptor sequences:

# Identify the constant region flanking the UMI on adaptor 1. 
substr(adaptor1, 47, 50)

# Obtaining the aligned read sequence for a sample of 1000 reads.
chosen <- sample(nrow(aln.out), min(nrow(aln.out), 1000))
(constants <- extractSubseq(aln.out[chosen,], subseq1=list(starts=47, ends=50)))

# Computing the expected distance.
edist <- expectedDist(constants$adaptor1$Sub1)
hist(edist, xlab="Edit distances", col="grey80")

... and we set the threshold to the median edit distance (or 1, if the median is less than zero):

(ethresh <- max(1, median(edist)))

We group the UMIs for reads within each level of pre.groups:

groups <- umiGroup(my.umis, threshold1=ethresh, groups=pre.groups)
summary(lengths(groups))

If a second UMI is present (e.g., on the other adaptor), these can be handled by obtaining the subsequence from my.sample$adaptor2 repeating the expectedDist step. The second set of sequences and threshold can then be supplied as additional arguments to umiGroup.

Performing error correction

We perform multiple sequence alignments (MSAs) for each read group using the r Biocpkg("muscle") package. For large data sets, this is often the most time-consuming step and can be parallelized with BPPARAM.

msa.out <- multiReadAlign(read.seq, groups)

# Peeking at the largest MSA.
by.size <- order(lengths(msa.out$alignments), decreasing=TRUE)
lapply(msa.out$alignments[by.size[1]], subseq, 1, 70) 

We create consensus sequences from these MSAs, representing the error-corrected sequence. The quality scores are constructed from the qualities of the individual read sequences. Higher consensus qualities for a position indicate that many reads are in agreement.

cons.out <- consensusReadSeq(msa.out)
cons.out

This can be saved to a FASTQ file with writeXStringSet for further analyses.

Session information

sessionInfo()


MarioniLab/sarlacc documentation built on May 13, 2019, 12:51 p.m.