knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial will show you how microViz makes working with phyloseq objects easier.
library(dplyr) library(phyloseq) library(microViz)
phyloseq objects are probably the most commonly used data format for working with microbiome data in R.
The creator of phyloseq, Paul J. McMurdie, explains the structure of phyloseq objects and how to construct them on the phyloseq website.
biom format files can be imported to phyloseq with the import_biom
function. Such biom files are generated (or can be) from many processing tools including QIIME 1 / 2, MetaPhlAn, and NG-Tax.
mothur output is also directly supported via phyloseq's import_mothur
function.
DADA2 output can converted to phyloseq according to these DADA2 handoff instructions
Don't worry too much about getting all of your sample metadata into your biom file or phyloseq object at the start, as ps_join()
makes it easy to add sample data later.
# This tutorial will just use some example data # It is already available as a phyloseq object from the corncob package example_ps <- microViz::ibd example_ps
phyloseq checks that your sample and taxa names are consistent across the different slots of the phyloseq object. microViz provides phyloseq_validate()
to check for and fix other possible problems with your phyloseq that might cause problems in later analyses. It is recommended to run this at the start of your analyses, and fix any problems identified.
example_ps <- phyloseq_validate(example_ps, remove_undetected = TRUE) example_ps <- tax_fix(example_ps)
Once you have a valid phyloseq object, microViz provides several helpful functions for manipulating that object. The names and syntax of some functions will be familiar to users of dplyr, and reading the dplyr help pages may be useful for getting the most out of these functions.
Add a data.frame of metadata to the sample_data slot with ps_join()
Compute new sample_data variables with ps_mutate()
Subset or reorder variables in sample_data with ps_select()
Subset samples based on sample_data variables with ps_filter()
Reorder samples (can be useful for examining sample data or plotting) with:
ps_reorder()
: manually set sample order
ps_arrange()
: order samples using sample_data variables
ps_seriate()
: order samples according to microbiome similarity
Remove duplicated/repeated samples with ps_dedupe()
Remove samples with missing values in sample_data with ps_drop_incomplete()
Check out the examples section on each function's reference page for extensive usage examples.
Lets look at the sample data already in our example phyloseq.
# 91 samples, with 15 sample_data variables example_ps # return sample data as a tibble for pretty printing samdat_tbl(example_ps)
Maybe you want to only select participants who have IBD (not controls). You can do that by filtering samples based on the values of the sample data variable: ibd
example_ps %>% ps_filter(ibd == "ibd") # it is essential to use `==` , not just one `=` # notice that taxa that no longer appear in the remaining 67 samples have been removed!
More complicated filtering rules can be applied. Let's say you want female IBD patients with "mild" or "severe" activity, who are at least 13 years old.
partial_ps <- example_ps %>% ps_filter( gender == "female", activity %in% c("mild", "severe"), age >= 13 ) partial_ps
Let's have a look at the sample data of these participants. We will also arrange the samples grouped by disease and in descending age order, and select only a few interesting variables to show.
partial_ps %>% ps_arrange(DiseaseState, desc(age)) %>% ps_select(DiseaseState, age, matches("activ"), abx) %>% # selection order is respected samdat_tbl() # this adds the .sample_name variable
You can also sort sample by microbiome similarity with ps_seriate()
.
partial_ps %>% tax_agg("Genus") %>% ps_seriate(dist = "bray", method = "OLO_ward") %>% # these are the defaults comp_barplot(tax_level = "Genus", sample_order = "asis", n_taxa = 10) # note that comp_barplot with sample_order = "bray" will run # this ps_seriate call internally, so you don't have to!
You can also arrange samples by abundance of one of more microbes using ps_arrange()
with .target = "otu_table"
. Arranging by taxon can only be done at the current taxonomic rank, so we will aggregate to Genus level first.
# Arranging by decreasing Bacteroides abundance partial_ps %>% tax_agg("Genus") %>% ps_arrange(desc(Bacteroides), .target = "otu_table") %>% otu_get() %>% # get the otu table .[, 1:6] # show only a subset of the otu_table # Plot samples' compositions in this order partial_ps %>% tax_agg("Genus") %>% ps_arrange(desc(Bacteroides), .target = "otu_table") %>% comp_barplot(tax_level = "Genus", sample_order = "asis", n_taxa = 10) # Notice this is sorted by bacteroides counts # (this doesn't quite match relative abundance % due to sequencing depth variation)
ps_filter()
vs. phyloseq::subset_samples()
As well as filtering your samples, ps_filter()
might also modify the otu_table and tax_table of the phyloseq object (unlike phyloseq::subset_samples()
, which never does this).
Why does it do this?\
If you remove many samples from your dataset, often your phyloseq object will be left with taxa that never occur in any of the remaining samples (i.e. total counts of zero). ps_filter()
removes those absent taxa by default.
If you don't want this, you can set the .keep_all_taxa
argument to TRUE
in ps_filter
.
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.