In zachcp/phylogeo: Mapping of Phyloseq Information

library("knitr")
opts_chunk$set(cache=FALSE, fig.width=9, fig.height=4, message=FALSE, warning=FALSE, comment = "#>")

About this vignette

phylogeo is based on top of phyloseq and therefore uses the same data model, an excellent overview of which can be found here. This vignette is simply a continuation of that tutorial. We create some random data from phyloseq including the following:

an OTU table,
a Taxonomy Table,
a Sample Data dataframe.
Latitude and Longitude information that is added to the Sample Data, and
A phylogenetic tree of the relationship betwen the OTUs contianed in the OTU table.

Phyloseq provides toolf for dealing with the first three items on this list. By adding location information and phylogenetic information you can take advantage fo the plotting tools in the phylogeo package.

Note: If these data-types don't make sense to you, it may be worthwhile to look at the phyloseq tutorials to get your bearings. For a grasp of what OTUs are and how they are used in metagenome analysis you could do worse than to start with Robert Edgars' page on OTUs. Edgar is the author of a number of highly used bioinformatic programs including Muscle and USearch (see here).

Generate phyloseq data and plot with phylogeo

Load phylogeo and get some sequence data from the `ape` package

In this vignette, sequence data from the ape package will be used, although you will probably have your own sequencing data to use. Everything except that sequence data is generated denovo using base R.

#load libraries
library(phyloseq)
library(phylogeo)
library(ape)

# import the woodmouse sequence data
data(woodmouse)

# set some global varialbe we will need to make our dataset
woodmousenames <- names(as.list(woodmouse))
samplenum = 15
samplenames =  paste0("Sample", 1:samplenum)

Create an OTU table

An operational taxonomic unit (OTU) table that a table holding count information. The row values are typicaly OTUs which in macro-biology are typically species, but in microbiology are semi-arbitrary cutoffs of sequence similarity that corresponds roughly to species. For 16S studies, these OTUs are typically 16S DNA amplicon sequences clustered at 97% identity. We can make an arbitrary dataset where out OTUs will be the woodmouse data, and the columns will be our samples.

otumat = matrix(sample(1:100, 75, replace = TRUE), 
                nrow = length(woodmousenames), 
                ncol = samplenum)
rownames(otumat) <- woodmousenames
colnames(otumat) <- samplenames
otumat

Now we need a pretend taxonomy table

The taxonomy table lets you assign taxonomic information for each OTU. The rownames of the taxonomy table must be the same as the OTU table. Each of the columns represent some value, typically corresponding to the classic Kindom, Phylum.... Species heirarchy. However, this table can be used to assign arbitrary data to OTUs as well. Here we will generate some random taxonomy data using classic phylogeny, but with some made-up, filler data.

taxmat = matrix(sample(letters, 75, replace = TRUE), 
                nrow = nrow(otumat), 
                ncol = 7)
rownames(taxmat) <- rownames(otumat)
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", 
                      "Family", "Genus", "Species")
taxmat

And some fake Sample Data

Sample data is information about your samples. The sample data rownames must corresond to the values in the OTU table. We will make some sample data with some arbitrary columns. These columns are often used to store informaiton about sample processing, or about the samples themselves.

sampledata <- matrix(sample(letters,samplenum*7, replace=TRUE), 
                     nrow = samplenum, 
                     ncol = 7)
rownames(sampledata) <- samplenames
# in your real data these columns might hold soil data, or processing information
colnames(sampledata) <- paste0("Category", 1:ncol(sampledata))
# generate some random lat/lon data and add it to the smaple data.
latitude = sample((-90:90), samplenum)
longitude = sample((-180:180), samplenum)
sampledata <- cbind(sampledata, latitude, longitude)
sampledata <- as.data.frame(sampledata)
sampledata

make a phylogenetic tree from the woodhouse sequences

To take advantage of the information stored in the DNA sequences we need to generate a phylogenetic tree. Once generated and loaded into the physeq object, phylogeo can use to explore whether there are geogrphic trends in the subclades. The members of the tree must be the same names as the names of the rows (OTUs) in the OTU table.

# muscle is used for alignments
musc <- muscle(woodmouse)
# nj will create a phylogenetic tree from the alignments
tree <- nj(musc)
plot(tree)

create the phyloseq object

All of the pieces now must be put together into the phyloseq pbject

phy <- phyloseq(sample_data = sample_data(sampledata),
                otu_table   = otu_table(otumat, taxa_are_rows=TRUE),
                tax_table   = tax_table(taxmat),
                phy_tree    = phy_tree(tree))
phy

Phylogeo can now be used to map this phyloseq object.

# map the locations
# random lat/lons yield samples all over
map_phyloseq(phy)
# map the locations
# color by sample data category; size by abundance
map_phyloseq(phy, color="Category1", size="Abundance")
# map the relationship between samples
map_network(phy, color="Category2", lines=TRUE)
# only keep highly related samples
# tip: if there are unrelated samples that drop out of the network analysis try  base_data = TRUE
map_network(phy, color="Category2",lines=TRUE, maxdist = 0.3)
# map the tree
# with this random data, perhaps it is not surprisign to wind all of the woodmouse sequences in each of the smapel locations.
map_tree(phy, color="Category1")
# map clusters
# cluster analyses can let you look at the possilbity of non random distributions of subclades.
# does, say,  a certain subclade only occur in North America?
map_clusters(phy)
# plot distance
# pairwise sample distances ecological distance vs. geographic distance
plot_distance(phy)