knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE, warning = FALSE, error = FALSE,
  fig.height = 4, fig.width = 6, fig.align = "center"
)

Introduction

Microbiome data generated by high through-put sequecing methods come with the hierachical phylogenic taxonomy informtation. Researches often need to group OTUs/features base on their phylogenic information on different levels in order to find any potential point of interests. The phylox is a R package allows user to look at the phylogenic sequencing data in different phylogenic levels. As it is named the core component of this package was built based on the phyloseq package. This tutorial shows a basic workflow of using this package to analyze microbiome data.

Installation and Prerequisites

The phylox package is build on top of the R/Bioconductor package phyloseq. Please make sure the phyloseq is installed in your enviroment before installing phylox. If you want to use the statistic function of this package, please also make sure the bioconductor package limma or DESeq2 is installed. To install packages from bioconductor, please run the following.

source("http://bioconductor.org/biocLite.R")
biocLite("phyloseq")
biocLite("limma")
biocLite("DESeq2")

The phyloseqExtra can only be installed from github.

devtools::install_github("zhuchcn/phylox")

Data Import

Data slots

The workflow starts with the OTU/feature clustered OTU/feature table with taxonomy table to each OTU/feature. To generate the OTU/feature table, please refer to the QIIME2 or the dada2 tutorial. As stated earlier that the phylox package is built on top of phyloseq. The phyloseq object is S4 class that can hold 5 data slots, and 3 of them are essential for the phylox, with the other 2 optional.

The 3 essential data slots are:

The 2 optional slots are:

From DADA2

The phyloseq object can be constructed from the DADA2 output very easily

ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), 
               sample_data(metadata), 
               tax_table(taxa))

The DADA2 package uses the reference sequence as the feature names. So you can create a refseq slots using the Biostrings package. The Biostirngs package can be installed from bioconductor. Although the reference sequences don't seem to be quite usful during the data analysis of microbiome data.

library(Biostrings)
refseq = DNAStringSet(colnames(seqtab.nochim))
taxa = taxa[colnames(seqtab.nochim)]    # to make sure that they match
colnames(seqtab.nochim) = str_c(        # give each feature a uniformed name
    "OTU", 
    str_pad(
        1:ncol(seqtab.nochim), 
        width = ceiling(log10(ncol(seqtab.nochim))),
        pad = "0", side = "left"
    )
)
rownames(taxa) = colnames(seqtab.nochim)
names(refseq)  = colnames(seqtab.nochim)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
               sample_data(metadata),
               tax_table(taxa),
               refseq)

SummarizedPhyloseq

The SummarizedPhyloseq is a S4 class inherits from the phyloseq-class from the phyloseq package, that contains 7 additional slots from species to kingdom to store the summarized abundance table.

library(phyloseq); library(phylox)
data(fatigue)
spy = summarizeFromPhyloseq(fatigue)
spy

The OTU table can also be converted to relative abundance.

fatigue_prop = transform_sample_counts(fatigue, function(x) x/sum(x))
spy_prop = summarizeFromPhyloseq(fatigue_prop)

Use the slot assessor to get the summarized OTU table on each level. Type ?kingdom_table to see more help

Statistic Analysis

Researches often want to do hypothesis testing between different study group, population, or any other sample variable. We provide a encapsulated function to do statistic analysis on count data using the DESeq2 package, or relative abundant data using limma package, and then store the statistic results in a SummarizedPhyloStats object. These functions require the DESeq2 or limma packages pre-installed in your enviroment. These two packages can be installed from the bioconductor. An advantage of these linear model based methods is they are flexible on study design and can handle repeated measures. This is especially usful in cross-over studies clinical studies.

design = model.matrix(data = as(sample_data(fatigue), "data.frame"),
                      ~Subject + 1)
spys_de = spy_to_deseq2(spy, design, resultsName = "SubjectPatient")
spys_de
spys_lm = spy_to_limma(spy_prop, design, transform = "log", p.value = 2, coef = 2)
spys_lm

Use the slot assessor to get the statistic result table on each level.

phylum_table(spys_de)

Visualization

Stacked Barplot

The plot_bar() function can be used to quickly generate barplots. The function takes 5 arguments, a SummarizedPhyloseq object, a character variable level to specify the phylogenic level, a character variable by from the sample meta-data column names, a logical variable show.legend, and a logical variable plotly. The output is a ggplot object so you can add any ggplot function on it.

plot_bar(spy_prop)

The default value for level is Phylum. If the by argument is not specified, the sample id will be used and each sample will be an individual bar.

The by variable can take up to 3 elements.

plot_bar(spy_prop, level = "Phylum", by = c("Subject", "Sex"))
plot_bar(spy_prop, level = "Genus", by = c("Subject", "Sex", "age_range"), show.legend = F, plotly = T )

Boxplot function

The boxplot function takes more arguments. Please refer to the help document for the usage of each arguments.

plot_box(spy_prop, level = "Genus", taxon = "g__Ruminococcus", by = "Subject")
plot_box(spy_prop, level = "Genus", taxon = "g__Ruminococcus", by = c("Subject","Sex"), jitter = 0.15)
plot_box(spy_prop, level = "Genus", taxon = "g__Ruminococcus", by = c("Subject", "Sex"), show.points = FALSE)
plot_box(spy_prop, level = "Genus", taxon = "g__Ruminococcus", by = c("Subject","Sex"), jitter = 0.15, box.size = 1, whisker.size = 1, point.alpha = 0.75, point.color = "steelblue", style = "academic")

The plot_box function also can draw lines by specifying the line argument. This is particually usful for repearted measures.

Heatmap

To use the heatmap function, please install the zheatmap package from github.

devtools::install_github("zhuchcn/zheatmap")
plot_heatmap(spy_prop, spys_lm, coef = "pvalue", cutoff = 0.1, anno.var = "Subject")

Cladogram with microbiomeViz

Cladogram is a very useful visualization method for microbiome data with different phylogenic levels. Most current popular cladogram visualization tools are not able to use flexible statistic methods fo annotation. Here we provide a solution combining with the microbiomeViz package. Below is an example using the DESeq2 result to annotate the cladogram

library(ggplot2);library(microbiomeViz)
tr = parsePhyloseq(fatigue, node.size.scale = 1.25)
p = tree.backbone(tr, size=1)
anno.data = create_annodata(spys_de, coef = "padj", cutoff = 0.05)
p = clade.anno(p, anno.data)
p

The cladogram is a ggplot object, so it can take the theme function from ggplot for further formating.

p + geom_point(
    data=data.frame(x=1:2,  y=1:2, color=c("Control","Patient")),
    aes(x=x, y=y, color=color), size=0, stroke=0) +
    guides(
        color = guide_legend(
            override.aes = list(size=3, color= rev(c("#00468BFF", "#ED0000FF")))
        )) +
    theme(
    legend.text = element_text(size = 11),
    plot.margin = margin(l=0, r=160, t=0, b=0)
)


zhuchcn/phylox documentation built on May 31, 2019, 5:14 p.m.