In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(neonMicrobe)
# devtools::load_all() # TODO: Switch this to library(neonMicrobe) before publishing
# setBaseDirectory(dirname(getwd()))
knitr::opts_knit$set(
  root.dir = NEONMICROBE_DIR_BASE()
)

makeDataDirectories(check_location = FALSE)

This vignette demonstrates how to use the functions and parameters in this package to process the raw NEON 16S sequence data into ASV tables using the DADA2 workflow. It is based on the DADA2 Pipeline Tutorial (1.16).

It is assumed that you have already completed the steps in the "Download NEON Data" vignette, so that there are raw sequence files in the appropriate data directory.

Note that some of the code chunks in this .Rmd file are not actually run, because some DADA2 functions have relatively long runtimes. Therefore, outputs like the taxonomy table are not displayed.

Load libraries

library(neonMicrobe)

library(ShortRead)
library(Biostrings)
library(dada2)
library(dplyr)
library(ggplot2)

Set base directory

Setting the base directory using setBaseDirectory() will create a point of reference for the neonMicrobe package, so that it knows where to look for the raw sequence files, and where to save the processed data. This should be the same base directory that you used in the "Download NEON Data" vignette. If you're continuing to use the same R session, then you don't have to run this again.

print(getwd())
setBaseDirectory()

Select metadata

As explained in the "Download NEON Data" vignette, neonMicrobe is a metadata-first processing pipeline. So we begin by retrieving sequence metadata. Any fastq files that are represented in the metadata will be processed by the functions in the next section, and any fastq files that are not represented in the metadata will be ignored. Therefore, it is recommended that you use the same metadata you used to download the files in the first place, lest you end up with fewer or more processed samples than intended.

data("seqmeta_greatplains_16s") # This is external data that comes with the package, much like "data(mtcars)"
meta <- qcMetadata(seqmeta_greatplains_16s, pairedReads = "Y", rmFlagged = "Y")
meta$date_ymd <- as.Date(format(as.Date(meta$collectDate, format="%Y-%m-%d %H:%M:%S"), "%Y-%m-%d"))

Process reads into ASV tables

It is generally better to use DADA2 to process one sequencing run at a time, because different sequencing runs may have different error rates, and the dada denoising algorithm assumes homogeneous error rates across input samples. Furthermore, it is simpler to show, for demonstration purposes, how you would process just one sequencing run. In practice, the part of this section titled "Denoise reads using the dada algorithm" may be enclosed within a for loop that cycles through all sequencing runs in the metadata.

unique_runs <- unique(meta$sequencerRunID)
meta_onerun <- meta[which(meta$sequencerRunID==unique_runs[1]),]
nrow(meta_onerun)

Although you can specify filenames to process, here we'll just use the filenames included in this subset of the metadata. (Don't worry; any files that don't exist in the file system will be safely ignored.) Note that these filenames are just basenames, i.e. they contain no directory path information. This is fine, as we will see in the subsections below. In addition, we will only use 2 fastq files (i.e. 1 sample) in order to speed up the computation in this RMarkdown document.

fl_nm <- meta_onerun$rawDataFileName[which(
  meta_onerun$rawDataFileName %in% 
    c("BMI_Plate37WellA12_16S_BJ8RK_R1.fastq.gz",
      "BMI_Plate37WellA12_16S_BJ8RK_R2.fastq.gz")
  )]
fl_nm

Note for advanced users: Behind the scenes in the RMarkdown file, we are moving fastq files from the an external data subdirectory into the raw sequences directory. This is necessary because when the vignettes are being built for package installation, the fastq files are not already in their expected locations. View the .Rmd file if you want to see this code; otherwise, feel free to ignore this.

ext_files <- file.path("16S",c("BMI_Plate37WellA12_16S_BJ8RK_R1.fastq.gz",
                               "BMI_Plate37WellA12_16S_BJ8RK_R2.fastq.gz"))
files <- system.file("extdata", ext_files, package="neonMicrobe")

file.copy(files, file.path(NEONMICROBE_DIR_SEQUENCE(), ext_files), overwrite=FALSE)

Trim primers from the 16S sequences

trimPrimers16S() trims primer sequences from 16S reads by using the dada2 function filterAndTrim(). trimPrimers16S() assumes that each read begins with its full primer sequence and operates by truncating the beginning of each read by the length of its primer sequence. Should this aspect of the NEON 16S sequence data change, this step will need to be modified.

In trimPrimers16S(), the argument in_subdir is "raw" -- a reserved value that tells the function to look for the input files in a special subdirectory of NEONMICROBE_DIR_SEQUENCE(). Any other value passed to in_subdir would tell it to look for the input files in the corresponding subdirectory of NEONMICROBE_DIR_MIDPROCESS(). To override this behavior and specify an input directory explicitly, pass a directory path to in_explicitdir. The same rules apply for out_subdir (which can be overridden with out_explicitdir), except that it has no specially reserved values.

By default, trimPrimers16S() assumes NEON's standard use of 16S primers Pro341F and Pro805R to be the forward and reverse primers, respectively. If this needs to be changed, new primers can be specified using the primer_16s_fwd and primer_16s_rev arguments.

trimPrimers16S() additionally returns an integer matrix denoting the number of reads remaining after primer-trimming for each input file.

trim_trackReads <- trimPrimers16S(
  fl_nm, in_subdir = "raw", out_subdir = "1_trimmed", meta = seqmeta_greatplains_16s, 
  multithread = TRUE # set multithread = FALSE on Windows computers though
)

Filter the 16S sequences

qualityFilter16S() filters 16S reads, also by using filterAndTrim(). The argument structure of qualityFilter16S() is similar to that of trimPrimers16S(), except for its optional arguments related to quality filtering parameters, which are passed to filterAndTrim(). You can learn more about these quality filtering parameters at ?dada2::filterAndTrim. Here are the definitions of some of the most commonly used parameters:

truncLen: Default 0 (no truncation). Truncate reads after truncLen bases. Reads shorter than this are discarded.
minLen: Default 20. Remove reads with length less than minLen. minLen is enforced after trimming and truncation.
maxEE: Default Inf (no EE filtering). After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: $EE = \sum_{l=1}^{L} 10^{-Q_l/10}$ , where $L$ is the length of the read.

How do you decide what quality filtering parameters to use? One way is to use the dada2 function plotQualityProfile() to visually inspect the quality profiles of a collection of reads. neonMicrobe also offers an extension of plotQualityProfile() called plotEEProfile(), which translates quality scores of reads into cumulative expected errors. This is particularly useful for understanding how truncLen and maxEE interact to affect the read retention rate.

R1 reads:

r1_files <- file.path(NEONMICROBE_DIR_SEQUENCE(), "16S", grep("_R1", fl_nm, value=TRUE))
plotEEProfile(r1_files, aggregate = TRUE)

R2 reads:

r2_files <- file.path(NEONMICROBE_DIR_SEQUENCE(), "16S", grep("_R2", fl_nm, value=TRUE))
plotEEProfile(r2_files, aggregate = TRUE)

The reads in these samples are of reasonably high quality, but as is typical with the Illumina MiSeq platform, the reverse reads are lower in quality than the forward reads. We can set different parameters for the forward and reverse reads by entering a vector of two numbers into each of the quality filtering arguments we wish to specify, c([forward-read value], [reverse-read value]).

filter_trackReads <- qualityFilter16S(
  fl_nm, in_subdir = "1_trimmed", out_subdir = "2_filtered",
  meta = seqmeta_greatplains_16s, truncLen = c(250, 230), maxEE = c(2, 8),
  multithread = TRUE
)

Denoise reads using the dada algorithm

runDada16S() picks or "denoises" 16S reads to generate an amplicon sequence variant (ASV) table by using the 'dada' algorithm. It first calculates the expected base-calling error rate of the reads, and then uses the 'dada' divisive partitioning algorithm to iteratively separate reads into groups distinguished by their likelihood of referring to the same true sequence, until a threshold is reached. Once finished, the partitions are considered amplicon sequence variants (ASVs), and are used to populate a sequence abundance table. (Read the DADA2 methods paper, Callahan et al. (2016), for more details.)

runDada16s() is actually a wrapper for four different dada2 functions that carry out the process just outlined: learnErrors(), derepFastq(), dada(), and mergePairs(). Since so many functions are wrapped by runDada16S(), it may be worth familiarizing yourself with the source code for runDada16(), as well as the documentation for the wrapped functions, to ensure it is behaving as expected.

The arguments to runDada16S() follow the same conventions as trimPrimers16S() and qualityFilter16S() for the input variables, but runDada16S() requires output file names instead of directory names. out_seqtab is the file path ending in ".Rds" where the ASV table will be written, and out_track is the file path ending in ".csv" where the read tracking table will be written. By default, these refer to special subdirectories in NEONMICROBE_DIR_MIDPROCESS() and NEONMICROBE_DIR_TRACKREADS(), respectively, and the file-writing behavior can be suppressed by setting the corresponding argument to FALSE. These ASV table and read-tracking table will also be returned by this runDada16S(), but since runDada16S() takes so long to run, forgetting to assign its output to a variable could result in a lot of lost time... hence, these output file name arguments.

dada_out <- runDada16S(
  fl_nm, in_subdir = "2_filtered", meta = seqmeta_greatplains_16s, 
  verbose = FALSE, multithread = TRUE
)

And there you have it: You've created an ASV table from the NEON 16S sequences!

Combine the read-tracking tables

It can be useful to track the loss of reads at each step in the processing pipeline, as it may allow you to pinpoint where something went wrong. Each of the processing functions shown above (trimPrimers16S(), qualityFilter16S(), and runDada16S()) returns a read-tracking table. The tables can be combined using the following function.

track <- combineReadTrackingTables16S(trim_trackReads, 
                                      filter_trackReads, 
                                      dada_out$track)
track

write.csv(
  track, 
  file.path(NEONMICROBE_DIR_TRACKREADS(), "16S",
            paste0("track_16s_", sub(" ", "_", gsub(":", "", Sys.time())), ".csv"))
)

At this point, it is safe to clear some variables from the R environment, freeing up memory. This is especially important when processing multiple sequencing runs one after another.

rm(trim_trackReads)
rm(filter_trackReads)

Merging ASV tables from across sequencing runs

If you've processed multiple sequencing runs separately (as you should!), then you will have created multiple ASV tables. You can merge ASV tables together using the dada2 function mergeSequenceTables(). The example below shows you how you might use mergeSequenceTables() if you processed multiple sequencing runs. However, if you are following this tutorial, which only processes samples from one sequencing run, then there is no need to mergeSequenceTables().

# If multiple sequencing runs were processed:
seqtab_filenames = list.files(file.path(NEONMICROBE_DIR_MIDPROCESS(), "16S", "3_seqtabs"), pattern = ".Rds", full.names = TRUE)
seqtab_joined <- mergeSequenceTables(tables = seqtab_filenames) # Also accepts other input formats, e.g. tables provided as in-memory objects

# If following this tutorial, which processes only one sequencing run:
seqtab_joined <- dada_out$seqtab

While mergeSequenceTables() checks for duplicate samples, it performs no ASV clustering. Clustering may be prudent when joining sequences across sequencing runs, because the 'dada' algorithm could produce an ASV in one run that is identical to an ASV in another run except that one of the ASVs is one or two base pairs longer. (Anecdotally, this can result in the detection of a mean effect on community composition where it does not exist!) ASVs that differ only in length can be combined using any "100% clustering" algorithm, and dada2 provides one through collapseNoMismatch(). Be warned, though, that at the time of writing (dada2 version 1.12.1), collapseNoMismatch() is not parallelizable, and does not scale well with an increasing number of ASVs.

seqtab_joined_collapsed <- collapseNoMismatch(seqtab_joined)
saveRDS(seqtab_joined_collapsed, 
        file.path(NEONMICROBE_DIR_MIDPROCESS(), "16S", "4_collapsed",
                  "NEON_16S_seqtab_nochim_grasslands_COLLAPSED.Rds"))
seqtab_joined_collapsed <- readRDS(file.path(NEONMICROBE_DIR_MIDPROCESS(), "16S", "4_collapsed", "NEON_16S_seqtab_nochim_grasslands_COLLAPSED.Rds"))

Assigning taxonomy

In this final section of the vignette, we assign taxonomy to each of the identified sequences. This is a computationally expensive step - it can take hours, or even days, to run. Because of this, we're going to reduce the size of the dataset, by removing samples with fewer than 1000 reads, as well as any taxa with fewer than 10 counts in the whole dataset. This is optional, and the cutoffs here can be modified.

# Remove low-abundance taxa and low-quality samples
keep_taxa <- which(colSums(seqtab_joined_collapsed) > 10)
keep_samples <- which(rowSums(seqtab_joined_collapsed) > 1000)
seqtab_joined_collapsed <- seqtab_joined_collapsed[,keep_taxa,drop=FALSE]
seqtab_joined_collapsed <- seqtab_joined_collapsed[keep_samples,,drop=FALSE]

# Check size of dataset
print(dim(seqtab_joined_collapsed))

We will be assigning taxonomy using the SILVA reference database. Download the SILVA reference database if necessary.

silva.url <- "https://zenodo.org/record/3986799/files/silva_nr99_v138_train_set.fa.gz"
download.file(silva.url, file.path(NEONMICROBE_DIR_TAXREF(), basename(silva.url)))

Assign taxonomy using the SILVA reference database:

tax <- assignTaxonomy(
  seqtab_joined_collapsed, file.path(NEONMICROBE_DIR_TAXREF(), basename(silva.url)), 
  verbose = TRUE, multithread = TRUE
)

Save the taxonomy table to file:

saveRDS(
  tax, 
  file.path(NEONMICROBE_DIR_MIDPROCESS(), "16S", "5_tax", 
            "NEON_16S_tax_grasslands_COLLAPSED.Rds")
)

And you're done! To learn how to link together the ASV table and the taxonomy table, as well as some abiotic soil data, view the next vignette in this series: "Add Environmental Variables to 16S Data".

claraqin/neonMicrobe documentation built on April 11, 2024, 11:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com