In richardsprague/actino: Phyloseq utilities for personal microbiome

library(tidyverse)
# the "sprague.map" variable below contains data that is read by these lines.
DATA_DIR <- Sys.getenv("UBIOME_DATADIR")
load(file.path(DATA_DIR,"ubiome_people","ub_data-Richard_Sprague","derived/spragueAllRanks.RData"))
knitr::opts_chunk$set(echo = TRUE)

This package helps you read and manipulate uBiome and other consumer-formatted microbiome files in the open source package Phyloseq.

Introduction

Phyloseq is an open source microbiome manipulation package in the BioConductor toolkit for R^[Read this beginner's guide: http://joey711.github.io/phyloseq-demo/phyloseq-demo.html]. Introduced in 2013^[McMurdie, Paul J., and Susan Holmes. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” Edited by Michael Watson. PLoS ONE 8, no. 4 (April 22, 2013): e61217. doi:10.1371/journal.pone.0061217.], it turns your microbiome data into an object-oriented representation that can be easily imported or exported to other common formats, and it supports many analysis techniques including calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics.

The combination of R, with RStudio, and Phyloseq is a powerful environment for microbiome investigation. Providing easy, interactive access to nearly every state-of-the-art algorithm in statistics, machine and deep learning, it can let uBiome researchers study our data quickly and iteratively, with exceptional plotting features that set the standard for publication-quality work.

Although the R language, compared to Python, can be more daunting for non-programmers, the package presented here should greatly simplify the process for scientists who already know the types of microbiome-related questions they want answered. If you are already familiar with the basic terms for how uBiome stores data in CSV files, you should be able to easily perform some powerful analysis and graphing by simply copying the environment and examples in this demo.

Prerequisites

If you are already a Python programmer, and you already have the Anaconda environment, you can simply install Anaconda RStudio. If you think you'll be doing regular serious programming for Python, you may want to study up on Anaconda anyway.

Otherwise, if you just want to get started as fast as possible on some of the examples in this book and don't want to deal just now with a bunch of other programming installs, just do this:

Download a (free) copy of open source licensed version of RStudio. Version are available for Mac, Windows, and Linux and they are all interchangable.
Install the latest (free) copy of R. You may as well install the latest possible version, at least version 3.4 or later.

Setting up the machine

I want all my R apps to use the same data directory for everything, but when I give out my R code, I don’t want to expose my hard drive structure.

Create the file .Rprofile in your home directory. You can add any initialization code you like, but I added the following, to set a global environment variable every time I launch:

Sys.setenv(UBIOME_DATADIR = "/path/to/my/ubiome/data")

NOTE: the file must have a blank carriage return as the last line.

Setting this one system variable makes it easy to find my microbiome data on any machine. In my folder /data, I keep separate folders for each person whose data I have. Inside each person's folder, I have a series of JSON files that contain the microbiome data for that person. I also have one mapfile (explained below) in Excel with a label and other metadata for each JSON file. Optionally I'll also include a directory /fastq where I keep fastq files (if any) for that user.

To read a single JSON file for myself, I use the following R commands:

library(jsonlite)
UBIOME_DATADIR <- Sys.getenv("UBIOME_DATADIR")
myJSON <- jsonlite::read_json(file.path(UBIOME_DATADIR,"/richard_sprague/richard_sprague-2017-01-01-gut-123456.json"))

I name all my microbiome JSON files using the convention "name-date-site-ssr.json" to make it easier to find the correct JSON file later. Even if you have only a few of these files, I strongly recommend adopting a consistent naming scheme.

Loading the package

If this is your first time to use Phyloseq, you will need to load a few prerequisites:

> install.packages(c(""ggplot2","dplyr", "tidyr", "readxl"))
> source("<https://bioconductor.org/biocLite.R>")
> biocLite("phyloseq")

Once these are installed, download the actino package from this repo using the devtools package:

> library(devtools)
> install_github("richardsprague/actino")

You may find that you lack a number of prerequisites, such as the Phyloseq package and a few common packages like dplyr and readxl. The system will warn you, and then you should install them yourself in the manner in which you are accustomed.

Actino includes a few built-in phyloseq objects, including kombucha.genus. The following commands will plot the abundances:

library("actino")
library(phyloseq)
plot_bar(prune_taxa(taxa_sums(kombucha.genus)>100000,kombucha.genus), fill = "Genus")

To load a new phyloseq object from a JSON file you will need a mapfile, which is an Excel file containing a row for each sample, and several columns including "SSR", "Date", "Label" and more.

mapfile <- data.frame
sprague.map %>% select(Username,Date,SSR,Site,Notes,Label) %>% filter(SSR %in% c(15372,15759,60454,27127)) %>% knitr::kable(caption="A Typical Mapfile (Excel)")

Load a standard uBiome-downloaded CSV file using this command:

experiment_to_phyloseq("myCSVfile.csv",mapfile)

and similarly, for a JSON file:

phyloseq_from_JSON_at_rank("myJSONfile.json",mapfile,rank="genus")

Either of these functions will generate a valid phyloseq object.

The package includes some sample data, which you can load like this:

data("kombucha.csv")
head(kombucha.csv)

The data, loaded into the variable kombucha.csv is a dataframe created from a CSV file downloaded from the uBiome "Advanced" tab (see the section on how to download your results as a CSV file). In this case, it's from one experiment of a user (me!) who drank several liters of kombucha each day to see the effect on his microbiome.

This is the raw data. To be useful for Phyloseq, it needs an accompanying file, called a mapfile, that maps attributes to each sample. The mapfile is a dataframe where the columns correspond to attributes of the data. You can have as many columns as you like, but one of the columns (usually the first one) must stand for the name of the sample. In most uBiome situations, the samples will be referred to by their SSR, and the actino package will expect a column named ssr in its mapfile. (Note: lowercase)

Let's create a very simple mapfile for kombucha.csv:

kombucha <- kombucha.csv  # simplify the name to make it easier to type
kombucha$ssr <- 1234 # I gave this sample the arbitrary ssr '1234'.
ssrs <- unique(kombucha$ssr)

ssrs

In this simple example, our CSV file contains only a single SSR, because it results from a single sample. Let's construct an equally simple map file.

The first column represents the sample name (aka SSR). We'll add a second column that represents the date of the sample. Finally, let's add a third column color that we’ll pretend is associated with some feature of the samples. Color of the stool? Color of the kombucha drink that day? Okay, if you don’t like my example features, so go ahead and add some other columns if you like.

mapfile<-data.frame(ssr=ssrs,date = Sys.Date()+seq(1,length(ssrs)), color = colors()[1:length(ssrs)])

mapfile

Although this example is a bit contrived, the result for Phyloseq is an entirely valid mapfile. Alternatively you could have made the map file manually too; Remember that in R it’s easy to create any dataframe from a CSV or Excel file. If you create one by hand, simply read it into R using the read.csv() function. Although in this demo, we are loading some pre-existing data, you could just as easily have loaded straight from the CSV file of your choice like this:

kombucha <- read.csv("myKombuchaDataFile.csv")
mapfile <- read.csv("myKombuchaMapFile.csv")

Once your data is available as an R dataframe, use the function actino::experiment_to_phyloseq() to convert it to Phyloseq:

kombucha.ps <- experiment_to_phyloseq(kombucha,mapfile)

kombucha.ps

And that's it! You now have a perfectly good Phyloseq object.

To run Phyloseq commands, you'll need to load the Phyloseq library, and after that everything's super-easy. In this example, we'll use another of the actino built-in data objects:

data("kombucha.genus")
kombucha.genus

This is a series of 19 samples taken from an experiment where I drank kombucha and tested my microbiome daily to see the effect. To generate this object, I downloaded my uBiome JSON data as separate files which I read one-by-one using actino::phyloseq_from_JSON_at_rank(). Two phyloseq objects can be combined using the function phyloseq::merge_phyloseq(). A series of these pairwise merges later, and I had all 19 samples into a single phyloseq object. The mapfile contained the following meta-data:

sample_variables(kombucha.genus)

The first five samples and their key metadata:

sample_data(kombucha.genus)[1:5,c("Date","Site","Label")]

Because kombucha.genus is a full-blown phyloseq object, I can do any of the standard phyloseq analysis on it, and like any other R command, the functions can be combined. For example, this single command generates a heatmap of those genera that make up more than 1% of the samples (i.e. 10,000 (out of 1 Million) normalized reads).

kombucha.gut <- subset_samples(kombucha.genus, Site == "gut") # just the gut samples

plot_heatmap(prune_taxa(taxa_sums(kombucha.gut)>10000,kombucha.gut),method = "CCA", distance = "none", sample.order = "Date", sample.label = "Date")

In real life, of course, you'll want to reorder the axes and perform some other transformations of the data. Phyloseq makes that trivially easy with some other useful commands:

head(get_taxa_unique(kombucha.ps, taxonomic.rank = "Genus"),10) # which are the unique taxa at Genus level?
#Note: if our data included other tax_ranks, we could have specified that.

ntaxa(kombucha.ps) # how many unique taxa were in the sample?

names(sort(taxa_sums(kombucha.ps),TRUE)[1:10]) # the ten most common genera

sample_data(subset_samples(kombucha.genus,grepl("[P|p]otato",sample_data(kombucha.genus)$Notes)))[,c("Date","Label")]

Clustering and PCA

kombucha.ord = ordinate(kombucha.genus, method="NMDS")
plot_ordination(kombucha.genus,kombucha.ord,label="Site")

By applying the label "Site" to each point of the ordination, we can see the natural clustering that occurs when showing different microbiomes on the same plot. Although each sample is different from another, gut samples are usually more similar to one another than skin or mouth samples, and this form of plot makes it easy to tell the difference.

More advanced commands

Phyloseq comes with zillions of additional and useful commands. Here are a few.

Rarefaction

Here's a simple example where we rarefy all samples to an even depth. Again, because the counts are already so low, this is a bit of a contrived example, but you can see how rarefaction did change the number of taxa in the samples.

kombucha.rare <- rarefy_even_depth(kombucha.genus)

d <- data.frame(tax_name=head(taxa_names(kombucha.genus)),before=head(taxa_sums(kombucha.ps)),after=head(taxa_sums(kombucha.rare)))
rownames(d)<-NULL
d