In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
devtools::load_all() # TODO: Switch this to library(neonMicrobe) before publishing
setBaseDirectory(dirname(getwd()))
knitr::opts_knit$set(
  root.dir = NEONMICROBE_DIR_BASE()
)

makeDataDirectories(check_location = FALSE)

This vignette demonstrates how to use the functions and parameters in this package to process the raw NEON ITS sequence data into ASV tables using the DADA2 workflow. It is based on the DADA2 ITS Pipeline Tutorial (1.8). The functions showcased in this vignette only support processing the forward ITS reads, following recommendations by Pauvert et al. (2019).

It is assumed that you have already completed the steps in the "Download NEON Data" vignette, so that there are raw sequence files in the appropriate data directory.

Note that some of the code chunks in this .Rmd file are not actually run, because some DADA2 functions have relatively long runtimes. Therefore, outputs like the taxonomy table are not displayed.

Load libraries

library(neonMicrobe)

library(ShortRead)
library(Biostrings)
library(dada2)
library(dplyr)
library(ggplot2)

Set base directory

Setting the base directory using setBaseDirectory() will create a point of reference for the neonMicrobe package, so that it knows where to look for the raw sequence files, and where to save the processed data. This should be the same base directory that you used in the "Download NEON Data" vignette. If you're continuing to use the same R session, then you don't have to run this again.

print(getwd())
setBaseDirectory()

Select metadata

As explained in the "Download NEON Data" vignette, neonMicrobe is a metadata-first processing pipeline. So we begin by retrieving sequence metadata. Any fastq files that are represented in the metadata will be processed by the functions in the next section, and any fastq files that are not represented in the metadata will be ignored. Therefore, it is recommended that you use the same metadata you used to download the files in the first place, lest you end up with fewer or more processed samples than intended.

data("seqmeta_greatplains_its") # This is external data that comes with the package, much like "data(mtcars)"
meta <- qcMetadata(seqmeta_greatplains_its, pairedReads = "N", rmFlagged = "Y")
meta$date_ymd <- as.Date(format(as.Date(meta$collectDate, format="%Y-%m-%d %H:%M:%S"), "%Y-%m-%d"))

Process reads into ASV tables

It is generally better to use DADA2 to process one sequencing run at a time, because different sequencing runs may have different error rates, and the dada denoising algorithm assumes homogeneous error rates across input samples. Furthermore, it is simpler to show, for demonstration purposes, how you would process just one sequencing run. In practice, the part of this section titled "Denoise reads using the dada algorithm" may be enclosed within a for loop that cycles through all sequencing runs in the metadata.

unique_runs <- unique(meta$sequencerRunID)
meta_onerun <- meta[which(meta$sequencerRunID==unique_runs[1]),]
nrow(meta_onerun)

Although you can specify filenames to process, here we'll just use the filenames included in this subset of the metadata. (Don't worry; any files that don't exist in the file system will be safely ignored.) Note that these filenames are just basenames, i.e. they contain no directory path information. This is fine, as we will see in the subsections below. In addition, we will only use 2 fastq files (i.e. 1 sample) in order to speed up the computation in this RMarkdown document.

fl_nm <- meta_onerun$rawDataFileName[which(
  meta_onerun$rawDataFileName %in% 
    c("BMI_Plate37WellB2_ITS_BNM6G_R1.fastq.gz",
      "BMI_Plate37WellB2_ITS_BNM6G_R2.fastq.gz")
  )]
fl_nm

Note for advanced users: Behind the scenes in the RMarkdown file, we are moving fastq files from the an external data subdirectory into the raw sequences directory. This is necessary because when the vignettes are being built for package installation, the fastq files are not already in their expected locations. View the .Rmd file if you want to see this code; otherwise, feel free to ignore this.

ext_files <- file.path("ITS", c("BMI_Plate37WellB2_ITS_BNM6G_R1.fastq.gz",
                                "BMI_Plate37WellB2_ITS_BNM6G_R2.fastq.gz"))
files <- system.file("extdata", ext_files, package="neonMicrobe")

file.copy(files, file.path(NEONMICROBE_DIR_SEQUENCE(), ext_files), overwrite=FALSE)

Pre-filter ITS sequences

The processing pipeline begins by "pre-filtering" reads, removing any that contain ambiguous base calls ("N"). This step is recommended for improving the identification of primer sequences in the next step.

Selective removal of reads is conducted through qualityFilterITS(), which filters ITS reads by using the dada2 function filterAndTrim(). In qualityFilterITS(), the argument in_subdir is "raw" -- a reserved value that tells the function to look for the input files in a special subdirectory of NEONMICROBE_DIR_SEQUENCE(). Any other value passed to in_subdir would tell it to look for the input files in the corresponding subdirectory of NEONMICROBE_DIR_MIDPROCESS(). To override this behavior and specify an input directory explicitly, pass a directory path to in_explicitdir. The same rules apply for out_subdir (which can be overridden with out_explicitdir), except that it has no specially reserved values.

In addition, qualityFilterITS() returns an integer matrix denoting the number of reads remaining after pre-filtering for each input file.

qualityFilterITS() is also used to filter reads based on other critera after primer trimming. This usage is discussed later in this section. For now, we use it to pre-filter reads with the following code:

prefilter_trackReads <- qualityFilterITS(
  fl_nm, "raw", "1_filtN", meta = seqmeta_greatplains_its, maxN = 0, 
  multithread = TRUE # set multithread = FALSE on Windows computers though
)

Trim primers from the ITS sequences

trimPrimersITS() trims primer sequences from ITS reads by using Cutadapt. Cutadapt allows us to remove not only the primer in its original orientation, but also the opposite end's primer in its reverse-complement orientation, which can appear due to the variable length of the ITS sequencing region. The argument structure of trimPrimersITS() is similar to that of qualityFilterITS(), except for the optional arguments in qualityFilterITS() related to quality filtering parameters.

By default, trimPrimersITS() assumes NEON's standard use of ITS primers ITS1f and ITS2 to be the forward and reverse primers, respectively. If this needs to be changed, new primers can be specified using the primer_its_fwd and primer_its_rev arguments.

Occasionally, sequencing runs will produce reads in mixed orientation, meaning that "R1" and "R2" may sometimes refer to reverse and forward reads, respectively (as opposed to forward and reverse reads). It can be prudent, then, to run trimPrimersITS() twice: once to remove the forward and reverse primers as expected, and another time to remove the reverse primers from R1 reads and the forward primers from R2 reads.

First, install Cutadapt, and replace the following variable with the location of the downloaded cutadapt file. I installed Cutadapt through Anaconda, so it ended up here:

cutadapt <- "/raid/users/claraqin/anaconda3/envs/cutadaptenv/bin/cutadapt" # insert your path here

trimPrimersITS(
  fl_nm, in_subdir = "1_filtN", out_subdir = "2_trimmed_once", 
  meta = seqmeta_greatplains_its, cutadapt_path = cutadapt
)
trimPrimersITS(
  fl_nm, in_subdir = "2_trimmed_once", out_subdir = "2_trimmed", 
  meta = seqmeta_greatplains_its, cutadapt_path = cutadapt,
  primer_ITS_fwd = "GCTGCGTTCTTCATCGATGC", # swapping the primers
  primer_ITS_rev = "CTTGGTCATTTAGAGGAAGTAA"
)

Filter the ITS sequences

Here, we use qualityFilterITS() again, this time to filter reads by their quality scores. You can learn more about these quality filtering parameters at ?dada2::filterAndTrim. Here are the definitions of some of the most commonly used parameters:

truncLen: Default 0 (no truncation). Truncate reads after truncLen bases. Reads shorter than this are discarded. *NOTE: Using truncLen with ITS reads is discouraged because of natural sequence length variation in ITS.
minLen: Default 20. Remove reads with length less than minLen. minLen is enforced after trimming and truncation.
maxEE: Default Inf (no EE filtering). After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: $EE = \sum_{l=1}^{L} 10^{-Q_l/10}$ , where $L$ is the length of the read.

How do you decide what quality filtering parameters to use? One way is to use the dada2 function plotQualityProfile() to visually inspect the quality profiles of a collection of reads. neonMicrobe also offers an extension of plotQualityProfile() called plotEEProfile(), which translates quality scores of reads into cumulative expected errors. This is particularly useful for understanding how truncLen and maxEE interact to affect the read retention rate.

R1 reads:

r1_files <- file.path(NEONMICROBE_DIR_SEQUENCE(), "ITS", grep("_R1", fl_nm, value=TRUE))
plotEEProfile(r1_files, aggregate = TRUE)

R2 reads:

r2_files <- file.path(NEONMICROBE_DIR_SEQUENCE(), "ITS", grep("_R2", fl_nm, value=TRUE))
plotEEProfile(r2_files, aggregate = TRUE)

The reads in these samples are of reasonably high quality, but as is typical with the Illumina MiSeq platform, the reverse reads are lower in quality than the forward reads. We can set different parameters for the forward and reverse reads by entering a vector of two numbers into each of the quality filtering arguments we wish to specify, c([forward-read value], [reverse-read value]).

In typical practice, only the forward ITS reads from Illumina MiSeq are used for ecological inference (Pauvert et al. (2019)). Currently, neonMicrobe only supports processing of the forward ITS reads. Any filenames corresponding to reverse reads in the sequence metadata will be safely ignored in the functions showcased in this vignette.

Here, we require that ITS reads have a minimum length of 50 bp and a maximum of 2 expected errors.

filter_trackReads <- qualityFilterITS(
  fl_nm, in_subdir = "2_trimmed", out_subdir = "3_filtered",
  meta = seqmeta_greatplains_its, minLen = 50, maxEE = 2
)

Denoise reads using the dada algorithm

runDadaITS() picks or "denoises" ITS reads to generate an amplicon sequence variant (ASV) table by using the 'dada' algorithm. It first calculates the expected base-calling error rate of the reads, and then uses the 'dada' divisive partitioning algorithm to iteratively separate reads into groups distinguished by their likelihood of referring to the same true sequence, until a threshold is reached. Once finished, the partitions are considered amplicon sequence variants (ASVs), and are used to populate a sequence abundance table. (Read the DADA2 methods paper, Callahan et al. (2016), for more details.)

runDadaITS() is actually a wrapper for three different dada2 functions that carry out the process just outlined: learnErrors(), derepFastq(), and dada(). Since so many functions are wrapped by runDadaITS(), it may be worth familiarizing yourself with the source code for runDadaITS(), as well as the documentation for the wrapped functions, to ensure it is behaving as expected.

The arguments to runDadaITS() follow the same conventions as trimPrimersITS() and qualityFilterITS() for the input variables, but runDadaITS() requires output file names instead of directory names. out_seqtab is the file path ending in ".Rds" where the ASV table will be written, and out_track is the file path ending in ".csv" where the read tracking table will be written. By default, these refer to special subdirectories in NEONMICROBE_DIR_MIDPROCESS() and NEONMICROBE_DIR_TRACKREADS(), respectively, and the file-writing behavior can be suppressed by setting the corresponding argument to FALSE. These ASV table and read-tracking table will also be returned by this runDadaITS(), but since runDadaITS() takes so long to run, forgetting to assign its output to a variable could result in a lot of lost time... hence, these output file name arguments.

dada_out <- runDadaITS(
  fl_nm, in_subdir = "3_filtered", meta = seqmeta_greatplains_its, 
  verbose = FALSE, multithread = TRUE
)

And there you have it: You've created an ASV table from the NEON ITS sequences!

Combine the read-tracking tables

It can be useful to track the loss of reads at each step in the processing pipeline, as it may allow you to pinpoint where something went wrong. Each of the processing functions shown above (trimPrimersITS(), qualityFilterITS(), and runDadaITS()) returns a read-tracking table. The tables can be combined using the following function.

track <- combineReadTrackingTablesITS(prefilter_trackReads, 
                                      filter_trackReads, 
                                      dada_out$track)
track

write.csv(
  track, 
  file.path(NEONMICROBE_DIR_TRACKREADS(), "ITS",
            paste0("track_its_", sub(" ", "_", gsub(":", "", Sys.time())), ".csv"))
)

At this point, it is safe to clear some variables from the R environment, freeing up memory. This is especially important when processing multiple sequencing runs one after another.

rm(prefilter_trackReads)
rm(filter_trackReads)

Merging ASV tables from across sequencing runs

If you've processed multiple sequencing runs separately (as you should!), then you will have created multiple ASV tables. You can merge ASV tables together using the dada2 function mergeSequenceTables(). The example below shows you how you might use mergeSequenceTables() if you processed multiple sequencing runs. However, if you are following this tutorial, which only processes samples from one sequencing run, then there is no need to mergeSequenceTables().

# If multiple sequencing runs were processed:
seqtab_filenames = list.files(file.path(NEONMICROBE_DIR_MIDPROCESS(), "ITS", "4_seqtabs"), pattern = ".Rds", full.names = TRUE)
seqtab_joined <- mergeSequenceTables(tables = seqtab_filenames) # Also accepts other input formats, e.g. tables provided as in-memory objects

# If following this tutorial, which only processes only one sequencing run:
seqtab_joined <- dada_out$seqtab

Assigning taxonomy

In this final section of the vignette, we assign taxonomy to each of the identified sequences. This is a computationally expensive step - it can take hours, or even days, to run. Because of this, we're going to reduce the size of the dataset, by removing samples with fewer than 1000 reads, as well as any taxa with fewer than 10 counts in the whole dataset. This is optional, and the cutoffs here can be modified.

# Remove low-abundance taxa and low-quality samples
keep_taxa <- which(colSums(seqtab_joined) > 10)
keep_samples <- which(rowSums(seqtab_joined) > 1000)
seqtab_joined <- seqtab_joined[,keep_taxa]
seqtab_joined <- seqtab_joined[keep_samples,]

# Check size of dataset
print(dim(seqtab_joined))

We will be assigning taxonomy using the SILVA reference database. Download the SILVA reference database if necessary.

unite.url <- "https://files.plutof.ut.ee/public/orig/E7/28/E728E2CAB797C90A01CD271118F574B8B7D0DAEAB7E81193EB89A2AC769A0896.gz"
download.file(unite.url, NEONMICROBE_DIR_TAXREF(), basename(unite.url))

Assign taxonomy using the SILVA reference database:

tax <- assignTaxonomy(
  seqtab_joined, file.path(NEONMICROBE_DIR_TAXREF(), basename(unite.url)), 
  verbose = TRUE, multithread = TRUE
)

Save the taxonomy table to file:

saveRDS(
  tax, 
  file.path(NEONMICROBE_DIR_MIDPROCESS(), "ITS", "5_tax", 
            "NEON_ITS_tax_grasslands.Rds")
)

And you're done! To learn how to link together the ASV table and the taxonomy table, as well as some abiotic soil data, view vignette "Add Environmental Variables to 16S Data". At present, there is no analogous vignette for the ITS data, but the 16S vignette lays out some guidelines that can be easily modified to work for the ITS data as well.

claraqin/neonMicrobe documentation built on April 11, 2024, 11:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com