In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(neonMicrobe)
# devtools::load_all() # TODO: Switch this to library(neonMicrobe) before publishing
# setBaseDirectory(dirname(getwd()))
knitr::opts_knit$set(
  root.dir = NEONMICROBE_DIR_BASE()
)

makeDataDirectories(check_location = FALSE)

The goal of this vignette is to merge various NEON data products to link the outputs of the DADA2 pipeline (e.g. non-chimeric sequence table, taxonomic table) to environmental/sample data.

It is assumed that you have already completed the steps in the "Download NEON Data" vignette and "Process 16S Sequences" vignette, so that there is an ASV table in your file system.

Load libraries

library(neonMicrobe)

library(phyloseq)
library(dplyr)
library(ggplot2)

Set base directory

Setting the base directory using setBaseDirectory() will create a point of reference for the neonMicrobe package, so that it knows where to look for the raw sequence files, and where to save the processed data. This should be the same base directory that you used in the "Download NEON Data" vignette and "Process 16 Sequences" vignette. If you're continuing to use the same R session, then you don't have to run this again.

print(getwd())
setBaseDirectory()

Load outputs from DADA2

Load in the outputs from dada2. These would normally consist of the ASV table and the associated taxonomy table, but for the purposes of demonstration, we load only the ASV table in this vignette. In particular, we'll use the ASV table that came pre-loaded with neonMicrobe for use in examples.

data("seqtab_greatplains_16s")
seqtab_orig <- seqtab_greatplains_16s

For reference, this is what loading your own data into this workflow might look like, if you have previously saved them to an Rds file. Please note that if you actually run the following code chunk after having run the "process-16s-sequences" vignette, then this will yield a sequence table seqtab_orig containing only 1 sample. This is expected behavior. To include more samples or a different set of samples, use the code in the "process-16s-sequences" vignette as a template, modifying the sample selection under the "Process reads into ASV tables" section. Alternatively, simply use the preloaded seqtab_orig to follow along in the vignette for now.

seqtab_orig <- readRDS(file.path(NEONMICROBE_DIR_MIDPROCESS(), "16S", "4_collapsed",
                       "NEON_16S_seqtab_nochim_grasslands_COLLAPSED.Rds"))
taxa_orig <- readRDS(file.path(NEONMICROBE_DIR_MIDPROCESS(), "16S", "5_tax", 
                    "NEON_16S_tax_grasslands_COLLAPSED.Rds"))

Now, we add sequence metadata, and subset to some columns of interest. Again, we use data that was pre-loaded with neonMicrobe.

data("seqmeta_greatplains_16s")
meta <- seqmeta_greatplains_16s %>% distinct(dnaSampleID, .keep_all = T) %>% 
    dplyr::select(sequencerRunID, dnaSampleID, dataQF.rawFiles, internalLabID.seq,
                 siteID, collectDate, plotID, deprecatedVialID, geneticSampleID)
meta$date_ymd <- as.Date(format(as.Date(meta$collectDate, format="%Y-%m-%d %H:%M:%S"), "%Y-%m-%d"))

This section downloads data associated with soil samples, such as temperature, moisture, and pH.

soil_greatplains <- downloadSoilData(
  startYrMo = "2017-07", endYrMo = "2017-07", 
  sites = c("CPER", "KONZ", "NOGP")
) 
# This is the code that was used to create the pre-loaded data
# soil_greatplains.

data("soil_greatplains")

select_columns <- c("domainID", "siteID", "plotID", "plotType", "nlcdClass", "collectDate", 
                    "sampleTiming", "standingWaterDepth", "sampleID", 
                    "horizon", "soilTemp", "litterDepth", "sampleTopDepth", "sampleBottomDepth", 
                    "geneticSampleID", "soilInCaClpH")
soils <- soil_greatplains[, select_columns]
sampledata <- merge(meta, soils, all.x=T, by = "geneticSampleID")

# Add rownames to sample data table 
rownames(sampledata) <- sampledata$dnaSampleID

Large Phyloseq objects can use a lot of memory. Here, we remove samples and taxa that didn't sequence well, to reduce the size of the output object.

# Remove low-abundance taxa (optional)
keep <- which(colSums(seqtab_orig) > 20)
seqtab <- seqtab_orig[,keep]
print(dim(seqtab))

# Remove low-quality samples (optional)
keep <- which(rowSums(seqtab) > 5000)
seqtab <- seqtab[keep,]
print(dim(seqtab))

The phyloseq R package uses row names to match up sequence data and sample data. If for some reason they do not match up, you may have to do some additional wrangling here. (For example, if not all DNA samples had an associated dnaSampleID, thus forcing you to use the fastq file names as the row names in the sequence table, then you may have to use the metadata to create a different shared identifier.) Here, we simply subset to the row names that do match, and reorder them.

# Subset to samples present in both dataframes
common_samples <- intersect(rownames(sampledata), rownames(seqtab))
sampledata <- sampledata[common_samples,]
seqtab <- seqtab[common_samples,]

# Check rowname agreement for phyloseq
identical(rownames(seqtab), rownames(sampledata))

Now we put it all together into a phyloseq object!

# Combine into phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows=FALSE),
               sample_data(sampledata))

# To include taxonomic data, you would add a "tax_table" argument:
# ps <- phyloseq(otu_table(seqtab, taxa_are_rows=FALSE),
#                sample_data(sampledata),
#                tax_table(taxa_orig))


# store the DNA sequences of our ASVs in the refseq slot of the phyloseq object,
# and then rename our taxa to a short string
dna <- Biostrings::DNAStringSet(taxa_names(ps))
names(dna) <- taxa_names(ps)
ps <- merge_phyloseq(ps, dna)
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))

# Print phyloseq summary.
ps

Don't forget to save!

saveRDS(ps, file.path(NEONMICROBE_DIR_OUTPUTS(), "phyloseq", "16S", "NEON_16S_phyloseq_greatplains.Rds"))

If the saved object still takes up too much memory, consider removing or consolidating additional taxa, using agglomeration functions in dada2. The speedyseq package is also useful for manipulating large Phyloseq objects

# Create a smaller subset
# Remove taxa not seen more than 3 times in at least 10 of the samples. 
# This protects against an OTU with small mean & trivially large C.V. 
# (from phyloseq tutorial)
ps_subset = filter_taxa(ps, function(x) sum(x > 3) > 10, TRUE)
saveRDS(ps_subset, file.path(NEONMICROBE_DIR_OUTPUTS(), "phyloseq", "16S", "NEON_16S_phyloseq_greatplains_subset.Rds"))

claraqin/neonMicrobe documentation built on April 11, 2024, 11:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

claraqin/neonMicrobe
Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

Load libraries

Set base directory

Load outputs from DADA2

R Package Documentation

Browse R Packages

We want your feedback!

claraqin/neonMicrobe Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

Load libraries

Set base directory

Load outputs from DADA2

R Package Documentation

Browse R Packages

We want your feedback!

claraqin/neonMicrobe
Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data