In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
devtools::load_all() # TODO: Switch this to library(neonMicrobe) before publishing
setBaseDirectory(dirname(getwd()))
knitr::opts_knit$set(
  root.dir = NEONMICROBE_DIR_BASE()
)

This vignette demonstrates how to use the functions and parameters in this package to download the specific scope NEON soil microbe marker gene sequence data and associated data relevant to your analysis.

Load libraries

library(neonMicrobe)

library(plyr)
library(dplyr)
library(neonUtilities)

Set up directories

First, set your working directory to be the location where you want neonMicrobe's directory structure to take root. Then, run setBaseDirectory().

dirname(getwd())

setBaseDirectory(getwd())

Set up the directory structure associated with the various NEON data products. This will generate (recursively) the following directory structure

[base directory]
├── data
│   ├── raw_sequence
│   ├── sequence_metadata
│   ├── soil
│   └── tax_ref
├── outputs
│   ├── mid_process
│   └── track_reads
└── batch_outputs

Each of these directories may contain yet more subdirectories, but let's not worry about them yet. We'll explore them more in future vignettes.

makeDataDirectories()

Download data

Download metadata

neonMicrobe is a metadata-first processing pipeline. What this means is that:

Before downloading any raw sequence data, you must first download its metadata.
You can subset the raw sequence data before you even download it, by subsetting the sequence metadata.
Many of the functions in neonMicrobe (especially the DADA2 wrappers and the batch constructor) taken in metadata as a primary argument.

To download the metadata associated with the collection and analysis of NEON soil microbial marker gene sequences, use the downloadSequenceMetadata() function. The downloadSequenceMetadata() function downloads NEON data product "Soil microbe marker gene sequences" (NEON.DP1.10108.001) using the neonUtilities package. This data product contains data tables related to the processing and generation of raw sequence data. The function returns a data.frame object with the data tables for the marker gene sequencing data product joined together:

- mmg_soilRawDataFiles
- mmg_soilDnaExtraction
- mmg_soilMarkerGeneSequencing
- mmg_soilPcrAmplification

When 'all' is passed to the targetGene argument, both the 16S and ITS metadata are downloaded in the same R object/file. For downstream analysis, limit the metadata to just the 16S or ITS records by filtering to either '16S rRNA' or 'ITS' in the data field 'targetGene'.

If time and space limitations are not issues for you, you can download the entire sequence dataset over the course of a few hours (depending on your download speed). Alternatively, downloadSequenceMetadata() accepts arguments that can be used to specify the subset of the data you are interested in for your analysis.

In the following example, we run downloadSequenceMetadata() with a number of arguments to narrow the range of data to be downloaded. Note that a copy of the output metadata file by default is automatically saved to the raw metadata directory /data/sequence_metadata/raw_metadata/, but this can be changed using the outDir argument.

meta_16s <- downloadSequenceMetadata(startYrMo = "2017-07", endYrMo = "2017-07", 
                                     sites = c("CPER", "KONZ", "NOGP"), targetGene = "16S")
meta_its <- downloadSequenceMetadata(startYrMo = "2017-07", endYrMo = "2017-07", 
                                     sites = c("CPER", "KONZ", "NOGP"), targetGene = "ITS")

Quality control sequence metadata

The following function performs basic QAQC checks on sequence metadata prior to downloading sequence data. This will reduce the number of sequence files that are downloaded to only those that will be used for analysis, thereby saving file space and reducing download times.

Specifically, this function will remove duplicates, quality-flagged samples, and (optionally) any R1 fastq files without corresponding R2 files.

meta_16s_qc <- qcMetadata(meta_16s, pairedReads = "Y", rmFlagged = "Y")

meta_its_qc <- qcMetadata(meta_its, pairedReads = "N", rmFlagged = "Y")

Download raw sequence data

Now that we have the metadata table loaded into memory, we retrieve a table of unique raw data files and their sequencing run IDs. (Note that these chunks are not actually run in the .Rmd file, due to their relatively long runtime.)

downloadRawSequenceData(meta_16s_qc)

downloadRawSequenceData(meta_its_qc)

And with that, you've downloaded some fastq files! To learn how to process these fastq files into ASV tables using the DADA2 pipeline, while taking advantage of neonMicrobe's organizational structure, see the vignette "Process 16S Sequences" or the vignette "Process ITS Sequences".

claraqin/neonMicrobe documentation built on April 11, 2024, 11:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

claraqin/neonMicrobe
Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

Load libraries

Set up directories

Download data

Download metadata

Quality control sequence metadata

Download raw sequence data

R Package Documentation

Browse R Packages

We want your feedback!

claraqin/neonMicrobe Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

Load libraries

Set up directories

Download data

Download metadata

Quality control sequence metadata

Download raw sequence data

R Package Documentation

Browse R Packages

We want your feedback!

claraqin/neonMicrobe
Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data