In jbisanz/qiime2R: qiime2R

Background

The qiime artifact is a method for storing the input and outputs for QIIME2 along with associated metadata and provenance information about how the object was formed. This method of storing objects has a number of obvious advantages; however, on the surface it does not lend itself to easy import to R for the R-minded data scientist. In reality, the .qza file is a compressed directory with an intuitive structure.

This package is trying to simplify the process of getting the artifact into R without discarding any of the associated data through a simple read_qza() function. The artifact is unpacked in to /tmp (or another directory if specified using tmp="/yourdirhere") and the raw data and associated metadata are read into a named list (see below). The object is then removed from the tmp dir (unless user specifies rm=F). Data are typically returned as either a matrix, data.frame, phylo object (trees), or DNAStringSets (nucleic acid sequences). In addition a qza_to_phyloseq() wrapper is provided to generate a phyloseq object for further analysis. It also provides a print_provenance() function for summarizing how the data was generated.

In this analysis we will use the qiime2R, phyloseq, and tidyverse libraries.

library(qiime2R)
library(phyloseq)
library(tidyverse)

Reading a .qza file

Here we are using data from the QIIME2 moving pictures tutorial. We start by reading the dada2-derived otu table.

download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/table.qza", "table.qza")
SequenceVariants<-read_qza("table.qza")

We can now view the associate data stored in this object by examining the names of the data stored within:

names(SequenceVariants)

The following information is stored within the object:

data - this contains your data in the R appropriate format, for example a feature table or phylogenetic tree
uuid - a unique identifier for the artifact
type - the data type stored within, see qiime2 documentation
format - the format the data is stored in, see qiime2 documentation
contents - the files stored within the object and their size
version - the version of the artifact, current testing has been on version 2018.4
provenance - information about how and where the data was generated

Understanding provenance

QIIME2 View can provide a graphical representation of this data; however, provenance is displayed as a nested list showing the input file dates, times, unique identifiers, and run time environments as below using the print_provenance() function. See qiime2 documentation for interpretation.

print_provenance(SequenceVariants)

Generating a phyloseq object

Many R users may wish to use phyloseq to help analyze their data. There is a wrapper function called qza_to_phyloseq() to assist with this purpose. We will download the associated tree, metadata, and taxonomy to build the object. At least two of the above items will be required to build a phyloseq object.

download.file("https://data.qiime2.org/2018.4/tutorials/moving-pictures/sample_metadata.tsv", "sample_metadata.tsv")
download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/taxonomy.qza", "taxonomy.qza")
download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/rooted-tree.qza", "rooted-tree.qza")

phyobj<-qza_to_phyloseq(features="table.qza", taxonomy = "taxonomy.qza", tree = "rooted-tree.qza", metadata="sample_metadata.tsv")

phyobj

Generating a TreeSummarizedExperiment (TSE) object

A TreeSummarizedExperiment is a generic and highly optimized container for complex data structures incorporating hierarchical information (such as phylogenetic trees and sample hierarchies) and reference sequences. The qza_to_tse() constructs such a data structure. We will download the associated features, tree, metadata, and taxonomy to build the object. To construct a TreeSummarizedExperiment, you'll at least need to provide some features.

download.file("https://data.qiime2.org/2018.4/tutorials/moving-pictures/sample_metadata.tsv", "sample_metadata.tsv")
download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/table.qza", "table.qza")
download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/taxonomy.qza", "taxonomy.qza")
download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/rooted-tree.qza", "rooted-tree.qza")

tse <- qza_to_tse(features="table.qza", taxonomy = "taxonomy.qza", tree = "rooted-tree.qza", metadata="sample_metadata.tsv")

tse

Using ggplot2 to visualize data

We can use ggplot2 to graph data generated by QIIME2 in R allowing for advanced graphing parameters. Using the Shannon Diversity metrics generated in the moving pictures tutorial, we will plot the diversity over time by participant. See ggplot2 documentation for the vast array of plotting options available.

download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/core-metrics-results/shannon_vector.qza", "shannon_vector.qza")

metadata<-
  read_tsv("sample_metadata.tsv", comment="#q2:types") #to exclude the column denoting the variable class

read_qza("shannon_vector.qza")$data %>%
  as.data.frame() %>%
  rownames_to_column("#SampleID") %>% # to allow a smooth joining with the metadata
  left_join(metadata) %>%
  ggplot(
    aes(
      x=DaysSinceExperimentStart, 
      y=shannon, 
      group=BodySite, 
      color=BodySite, 
      shape=ReportedAntibioticUsage)
        ) +
  geom_line() +
  geom_point() +
  facet_wrap(~Subject) + #make a separate plot for each subject
  theme_bw()

Similarily we could use this approach to match more metadata onto a PCoA analysis of say unweighted UniFrac Distances. When a PCoA is imported, data will contain two peices of information: .\$data\$ProportionExplained as well as the vectors themselves .\$data\$Vectors.

download.file("https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/core-metrics-results/unweighted_unifrac_pcoa_results.qza", "unweighted_unifrac_pcoa_results.qza")

pcoa<-read_qza("unweighted_unifrac_pcoa_results.qza")$data
pcoa$Vectors %>%
  rename("#SampleID"=SampleID) %>% # to match the metadata
  left_join(metadata) %>%
  ggplot(
    aes(
      x=PC1, 
      y=PC2, 
      color=BodySite,
      shape=Subject
        )
    ) +
  geom_point() +
  theme_bw() +
  xlab(paste0("PC1: ", pcoa$ProportionExplained["PC1"])) + #add variance explained to axis
  ylab(paste0("PC2: ", pcoa$ProportionExplained["PC2"]))