knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) devtools::load_all() # TODO: Switch this to library(neonMicrobe) before publishing setBaseDirectory(dirname(getwd())) knitr::opts_knit$set( root.dir = NEONMICROBE_DIR_BASE() )
This vignette demonstrates how to use the functions and parameters in this package to download the specific scope NEON soil microbe marker gene sequence data and associated data relevant to your analysis.
library(neonMicrobe)
library(plyr) library(dplyr) library(neonUtilities)
First, set your working directory to be the location where you want neonMicrobe
's directory structure to take root. Then, run setBaseDirectory()
.
dirname(getwd())
setBaseDirectory(getwd())
Set up the directory structure associated with the various NEON data products. This will generate (recursively) the following directory structure
[base directory] ├── data │ ├── raw_sequence │ ├── sequence_metadata │ ├── soil │ └── tax_ref ├── outputs │ ├── mid_process │ └── track_reads └── batch_outputs
Each of these directories may contain yet more subdirectories, but let's not worry about them yet. We'll explore them more in future vignettes.
makeDataDirectories()
neonMicrobe
is a metadata-first processing pipeline. What this means is that:
neonMicrobe
(especially the DADA2 wrappers and the batch constructor) taken in metadata as a primary argument.To download the metadata associated with the collection and analysis of NEON soil microbial marker gene sequences, use the downloadSequenceMetadata()
function. The downloadSequenceMetadata()
function downloads NEON data product "Soil microbe marker gene sequences" (NEON.DP1.10108.001) using the neonUtilities
package. This data product contains data tables related to the processing and generation of raw sequence data. The function returns a data.frame object with the data tables for the marker gene sequencing data product joined together:
- mmg_soilRawDataFiles - mmg_soilDnaExtraction - mmg_soilMarkerGeneSequencing - mmg_soilPcrAmplification
When 'all' is passed to the targetGene
argument, both the 16S and ITS metadata are downloaded in the same R object/file. For downstream analysis, limit the metadata to just the 16S or ITS records by filtering to either '16S rRNA' or 'ITS' in the data field 'targetGene'.
If time and space limitations are not issues for you, you can download the entire sequence dataset over the course of a few hours (depending on your download speed). Alternatively, downloadSequenceMetadata()
accepts arguments that can be used to specify the subset of the data you are interested in for your analysis.
In the following example, we run downloadSequenceMetadata()
with a number of arguments to narrow the range of data to be downloaded. Note that a copy of the output metadata file by default is automatically saved to the raw metadata directory /data/sequence_metadata/raw_metadata/
, but this can be changed using the outDir
argument.
meta_16s <- downloadSequenceMetadata(startYrMo = "2017-07", endYrMo = "2017-07", sites = c("CPER", "KONZ", "NOGP"), targetGene = "16S") meta_its <- downloadSequenceMetadata(startYrMo = "2017-07", endYrMo = "2017-07", sites = c("CPER", "KONZ", "NOGP"), targetGene = "ITS")
The following function performs basic QAQC checks on sequence metadata prior to downloading sequence data. This will reduce the number of sequence files that are downloaded to only those that will be used for analysis, thereby saving file space and reducing download times.
Specifically, this function will remove duplicates, quality-flagged samples, and (optionally) any R1 fastq files without corresponding R2 files.
meta_16s_qc <- qcMetadata(meta_16s, pairedReads = "Y", rmFlagged = "Y")
meta_its_qc <- qcMetadata(meta_its, pairedReads = "N", rmFlagged = "Y")
Now that we have the metadata table loaded into memory, we retrieve a table of unique raw data files and their sequencing run IDs. (Note that these chunks are not actually run in the .Rmd file, due to their relatively long runtime.)
downloadRawSequenceData(meta_16s_qc)
downloadRawSequenceData(meta_its_qc)
And with that, you've downloaded some fastq files! To learn how to process these fastq files into ASV tables using the DADA2 pipeline, while taking advantage of neonMicrobe
's organizational structure, see the vignette "Process 16S Sequences" or the vignette "Process ITS Sequences".
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.