knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE
)

Prerequisites

HMP2Data can be installed using BiocManager as follows.

# Check if BiocManager is installed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
# Install HMP2Data package using BiocManager
BiocManager::install("HMP2Data")

Or the development version of the package can be installed from GitHub as shown below.

BiocManager::install("jstansfield0/HMP2Data")

The following packages will be used in this vignette to provide demonstrative examples of what a user might do with r BiocStyle::Biocpkg("HMP2Data").

library(HMP2Data)
library(phyloseq)
library(SummarizedExperiment)
library(MultiAssayExperiment)
library(dplyr)
library(ggplot2)
library(UpSetR)

Once installed, HMP2Data provides functions to access the HMP2 data.

MOMS-PI

The MOMS-PI data can be loaded as follows.

16S data

Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:

data("momspi16S_mtx")
momspi16S_mtx[1:5, 1:3]

Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:

data("momspi16S_tax")
colnames(momspi16S_tax)
momspi16S_tax[1:5, 1:3]
# Check if Greengene IDs match between the 16S and taxonomy data
# all.equal(rownames(momspi16S_mtx), rownames(momspi16S_tax)) # Should be TRUE

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("momspi16S_samp")
colnames(momspi16S_samp)
momspi16S_samp[1:5, 1:3]
# Check if sample names match between the 16S and sample data
# all.equal(colnames(momspi16S_mtx), rownames(momspi16S_samp)) # Should be TRUE

The momspi16S function assembles those matrices into a phyloseq object.

momspi16S_phyloseq <- momspi16S()
momspi16S_phyloseq

Here we have a phyloseq object containing the 16S rRNA seq data for r nrow(otu_table(momspi16S_phyloseq)) taxa and r ncol(otu_table(momspi16S_phyloseq)) samples.

Cytokine data

The MOMS-PI cytokine data can be loaded as a matrix, rownames are cytokine names, colnames are sample names:

data("momspiCyto_mtx")
momspiCyto_mtx[1:5, 1:5]

The cytokine data has r nrow(momspiCyto_mtx) rows and r ncol(momspiCyto_mtx) columns.

Load the cytokine sample annotation data as a matrix, rows are samples, columns are annotations:

data("momspiCyto_samp")
colnames(momspiCyto_samp)
momspiCyto_samp[1:5, 1:5]
# Check if sample names match between the 16S and sample data
# all.equal(colnames(momspiCyto_mtx), rownames(momspiCyto_samp)) # Should be TRUE

The function momspiCytokines will make a SummarizedExperiment containing cytokine data

momspiCyto <- momspiCytokines()
momspiCyto

The cytokine data contains data for r nrow(momspiCyto) cytokines over r ncol(momspiCyto) samples.

MultiAssayExperiment

We can construct a MultiAssayExperiment that will contain 16S and cytokine data for the common samples. Use the momspiMultiAssay function.

momspiMA <- momspiMultiAssay()
momspiMA

Subsetting the MultiAssayExperiment object

The 16S rRNA data can be extracted from the MultiAssayExperiment object as follows.

rRNA <- momspiMA[[1L]]

Or the cytokines data like so.

cyto <- momspiMA[[2L]]

The sample data can be accessed with the colData command.

colData(momspiMA) %>% head()

To extract only the samples with both 16S and cytokine data we can use the intersectColumns function.

completeMA <- intersectColumns(momspiMA)
completeMA

IBD

Load 16S data as a matrix, rows are SILVA IDs, columns are sample names:

data("IBD16S_mtx")
IBD16S_mtx[1:5, 1:5]

Load the SILVA taxonomy table as a matrix, rows are SILVA IDs, columns are taxonomic ranks:

data("IBD16S_tax")
colnames(IBD16S_tax)
IBD16S_tax[1:5, 1:5]

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("IBD16S_samp")
colnames(IBD16S_samp) %>% head()
IBD16S_samp[1:5, 1:5]

The IBD phyloseq object can be loaded as follows.

IBD <- IBD16S()
IBD

T2D

Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:

data("T2D16S_mtx")
T2D16S_mtx[1:5, 1:5]

Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:

data("T2D16S_tax")
colnames(T2D16S_tax)
T2D16S_tax[1:5, 1:5]

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("T2D16S_samp")
colnames(T2D16S_samp)
T2D16S_samp[1:5, 1:5]

The T2D phyloseq object can be loaded like so.

T2D <- T2D16S()
T2D

Features

Frequency Table Generation

The sample-level annotation data contianed in HMP2Data can be summarized using the table_two function. First, we need to make a list of the phyloseq and SummarizedExperiment objects which can then be entered into the table_two() table generating function.

list("MOMS-PI 16S" = momspi16S_phyloseq, "MOMS-PI Cytokines" = momspiCyto, "IBD 16S" = IBD, "T2D 16S" = T2D) %>% table_two()

We can also create a table of the breakdown of samples by visit number quantile.

list("MOMS-PI 16S" = momspi16S_phyloseq, "MOMS-PI Cytokines" = momspiCyto, "IBD 16S" = IBD, "T2D 16S" = T2D) %>% visit_table()

The patient-level characteristics can be summarized using the patient_table() function.

list("MOMS-PI 16S" = momspi16S_phyloseq, "MOMS-PI Cytokines" = momspiCyto, "IBD 16S" = IBD, "T2D 16S" = T2D) %>% patient_table()

Visits Histograms

Here we plot histograms of the number of samples at each visit for the data MOMS-PI 16S data. Note that there are r sum(momspi16S_samp$visit_number > 20) samples at visit numbers over 20.

# set up ggplots
p1 <- ggplot(momspi16S_samp, aes(x = visit_number)) + 
  geom_histogram(bins = 20, color = "#00BFC4", fill = "lightblue") +
  xlim(c(0,20)) + xlab("Visit number") + ylab("Count")
# theme(panel.background = element_blank(), panel.grid = element_blank())
p1

We can plot the histogram of the number of samples at each visit for all studies. Note that the X-axis is capped at count of 40 for better visualization.

# make data.frame for plotting
plot_visits <- data.frame(study = c(rep("MOMS-PI Cytokines", nrow(momspiCyto_samp)),
                     rep("IBD 16S", nrow(IBD16S_samp)),
                     rep("T2D 16S", nrow(T2D16S_samp))),
          visits = c(momspiCyto_samp$visit_number,
                     IBD16S_samp$visit_number,
                     T2D16S_samp$visit_number))
p2 <- ggplot(plot_visits, aes(x = visits, fill = study)) + 
  geom_histogram(position = "dodge", alpha = 0.7, bins = 30, color = "#00BFC4") + xlim(c(0, 40)) +
  theme(legend.position = c(0.8, 0.8))  + 
  xlab("Visit number") + ylab("Count")
p2

Note that there are r sum(plot_visits$visits > 40) samples at visit numbers over 40.

UpsetR plots

MOMS-PI 16S rRNA data

Here we plot the body sites which samples were taken from for each patient in the MOMS-PI 16S data.

# set up data.frame for UpSetR
momspi_upset <- aggregate(momspi16S_samp$sample_body_site, by = list(momspi16S_samp$subject_id), table)
tmp <- as.matrix(momspi_upset[, -1])
tmp <- (tmp > 0) *1
momspi_upset <- data.frame(patient = momspi_upset$Group.1, tmp)

# plot
upset(momspi_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

MOMS-PI Cytokines data

Here we plot the body sites which samples were taken from for each patient in the MOMS-PI cytokines data.

# set up data.frame for UpSetR
momspiCyto_upset <- aggregate(momspiCyto_samp$sample_body_site, by = list(momspiCyto_samp$subject_id), table)
tmp <- as.matrix(momspiCyto_upset[, -1])
tmp <- (tmp > 0) *1
momspiCyto_upset <- data.frame(patient = momspiCyto_upset$Group.1, tmp)

# plot
upset(momspiCyto_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

IBD data

The IBD patients only had samples taken from the feces.

T2D data

Here we plot the body sites which samples were taken from for each patient in the T2D 16S rRNA data.

# set up data.frame for UpSetR
T2D_upset <- aggregate(T2D16S_samp$sample_body_site, by = list(T2D16S_samp$subject_id), table)
tmp <- as.matrix(T2D_upset[, -1])
tmp <- (tmp > 0) *1
T2D_upset <- data.frame(patient = T2D_upset$Group.1, tmp)

# plot
upset(T2D_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

Exporting Data to CSV Format

To our knowledge, R and Bioconductor provide the most and best methods for the analysis of microbiome data. However, we realize they are not the only analysis environments and wish to provide methods to export the data from r BiocStyle::Biocpkg("HMP16SData") to CSV format.

Preparing the data

For SummarizedExperiment objects, we will need to separate the data and metadata first before exportation. First, make a data.frame for participant data.

momspi_cytokines_participants <- colData(momspiCyto)
momspi_cytokines_participants[1:5, ]

Then we need to pull out the data matrix.

momspi_cytokines <- assay(momspiCyto)
momspi_cytokines[1:5, 1:5]

For phyloseq objects the data, metadata, and taxonomy can be separated as follows. First, pull out metadata.

momspi_16S_participants <- sample_data(momspi16S_phyloseq)

Next, we can save the counts data as a matrix.

momspi16s_data <- as.matrix(otu_table(momspi16S_phyloseq))

Finally, the taxonomy table can be extracted.

momspi16s_taxa <- tax_table(momspi16S_phyloseq) %>% as.data.frame()

Save data as CSV

The data can be saved as .csv files as follows.

library(readr)
write_csv(data.frame(file_id = rownames(momspi_cytokines_participants), momspi_cytokines_participants), "momspi_cytokines_participants.csv")
write_csv(data.frame(momspi_cytokines), "momspi_cytokines.csv")

Session Info

sessionInfo()


dozmorovlab/HMP2Data documentation built on June 5, 2019, 5:12 p.m.