knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
Many contemporary metabarcoding workflows ignore quality information provided in FASTQ data, this restricts their ability to accurately resolve low frequency amplicon diversity.
DADA2 is an approach (paper here) gaining popularity that integrates the quality information to build an model of sequence errors that allowing for very precise inference of amplicon sequences. Here, we show how DADA2 can be implimented in the metabarTOAD wrapper.
This tutorial is designed for paired-end Illumina amplicon data generated from a 2-step PCR method as Bista et al.2017. We are expecting sufficient (>30bp) overlap of the paired end reads. DADA2 learns and performs amplicon inference according to the error model, specific to each Illumina run so samples from different runs should be analysed seperately. This tutorial is designed to analyse data from a single primer set at a time.
For this tutorial you will need the following
Cutadapt should be installed and working on your system. Keep a note of the directory path to the binary or have them in your PATH. Not sure about the PATH? - go here
Make sure you start off with a clean folder in a new working directory. Check there is nothing in your directory with dir()
. Lets start by loading up the required package(s)
require(metabarTOAD) require(dada2)
If you previously used the metabarTOAD workflow to generate 97% OTUs or denoised OTUs with UNOISE3 then the correct folder structure is already in place. Otherwise use the below command to set up some folders for the analysis.
Folders()
Now we have a folder structure lets move our raw Illumina data into the 1.rawreads
. If disk space allows you might wish to copy your raw files into the 0.backup
folder as well in case you want to start the analysis again from scratch.
DADA2 requires sequences that have been stripped of primer regions. We can use the below function to strip primers from both sets of read pairs at the same time. This is important as many downstream applications (such as DADA2) rely on reads appearing in the same order in both the forward and reverse read files. If we strip reads from read pairs individually we may end up filtering out a single read from a read pair resulting in unmatched pairs in our files.
dadaReadPrep(PrimerF = "YOUR_FORWARD_PRIMER", PrimerR = "YOUR_REVERSE_PRIMER", cutadaptdest = "/path/to/cutadapt", ncores = 7 )
The function dadaReadPrep()
will accept ambigous nucleotides in primer data (Y,M,R,N etc.) and will automatically convert inosine (I) bases to N.
You should now find your stripped reads in the directory /7.DADA2/trimmed.reads/
. Have a look and check as below.
list.files("/7.DADA2/trimmed.reads/")
You should see a list of your files. The rest of this workflow follows the usual DADA2 pipeline outlined here.
First lets get the file names of the forward and reverse reads. We assume here your files have the format ****_R1_001.fastq
and ****_R2_001.fastq
for the forward and reverse reads respectivly.
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
We can then extract the files names as below.
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
Now we should inspect the quality of the forward and reverse reads.
plotQualityProfile(fnFs[1:2]) plotQualityProfile(fnRs[1:2])
As DADA2 uses the quality information in it's algorithm we shouldnt worry too much about poor read quality. Here we should truncate when we see the dip in read quality. It is important to ensure we have an overlap of minimum 30bp for merging the reads later in the pipeline. Make sure you have an understanding of how much of an overlap you have and therefore how much you can truncate both reads by.
Next we create objects containing the path for the output files for the next step.
filtFs <- file.path("7.DADA2/filtered", paste0(sample.names, "_F_filt.fastq.gz")) filtRs <- file.path("7.DADA2/filtered", paste0(sample.names, "_R_filt.fastq.gz"))
The next function filters our samples so our dataset contains only low error sequences. Be careful with length truncating genes that have a great deal of length varition (such as ITS). You can change the error using the paramter maxEE
. A low number is more stingent as we are filtering reads that contain more than the listed number of Expected Errors
. What are expected errors? Read Here.
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(250,180), maxN=0, maxEE=c(1,1), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=7)
The next step is to learn the error rate for the sequences. Remember that samples from different runs may have different error models and therefore should not be combined.
errF <- learnErrors(filtFs, multithread=7) errR <- learnErrors(filtRs, multithread=7)
Now we have built a model of the errors we can visualise it as follows.
plotErrors(errF, nominalQ=TRUE)
This diagram details the error rates for each possible base transition (eg. A2G - adenine erronously read as guanine). The black points indiciate the observed transitions in the data and the black line shows the continous error rate that DADA2 has calculated. The red line shows what we would expect if the Illumina Q-Scores were 100% correct, as you can see they rarely are exactly right!
We judge the goodness-of-fit by looking to chekc the black line and black dots are correlated. Sometimes the error model produces parculiar curves but as long as there is a farily good fit you can use the error rate. If things look drastically wrong consider using the option nbases= 1e+09
in the learnErrors()
function to use ten times the number of bases to derive the error model.
Next we dereplicate the sequences to produce a list of observed sequences and their abundance for the core DADA2 function. We also name the samples in this object as required by the algorithm.
derepFs <- derepFastq(list.files("7.DADA2/filtered", pattern="F_filt.fastq.gz", full.names = TRUE), verbose=TRUE) derepRs <- derepFastq(list.files("7.DADA2/filtered", pattern="R_filt.fastq.gz", full.names = TRUE), verbose=TRUE) names(derepFs) <- sapply(strsplit(basename(list.files("7.DADA2/filtered", pattern="F_filt.fastq.gz", full.names = TRUE)), "_"), `[`, 1) names(derepRs) <- sapply(strsplit(basename(list.files("7.DADA2/filtered", pattern="R_filt.fastq.gz", full.names = TRUE)), "_"), `[`, 1)
Now we implement the core akgotirthm on the forward and revser reads seperately.
dadaFs <- dada(derepFs, err=errF, multithread=TRUE) dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
Now we have a our high quality sample by sequence counts we need ot merge the read pairs. We do this as below.
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
Inspect the merger data.frame from the first sample.
head(mergers[[1]])
Now we can construct a sequence table from the merged data.
seqtab <- makeSequenceTable(mergers) dim(seqtab)
Now we can inspect the sequence length distribution of our output data.
table(nchar(getSequences(seqtab)))
Looks like there are some very long and short amplicons that are probably erroneous. We can get rid of them as below. Change the numbers as desired for different amplicons.
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 303:323]
We should also check for chimeric sequences. We can do this as below.
seqtab.nochim <- removeBimeraDenovo(seqtab2, method="consensus", multithread=7, verbose=TRUE) dim(seqtab.nochim)
We can check reads throguh our pipeline as below.
getN <- function(x) sum(getUniques(x)) track <- cbind(sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim)) colnames(track) <- c("denoisedF", "denoisedR", "merged", "nonchim") rownames(track) <- names(dadaFs) head(track)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.