library("knitr") opts_chunk$set(cache=FALSE, fig.width=9, fig.height=4, message=FALSE, warning=FALSE, comment = "#>")
Analysis of data generated from microbial genome sequencing has provided a fertile ground for exploring the large fraction of bacterial life that remains uncultured. The phylogeo package provides a way to explore the geographic relationships of their microbial gene data by providing mapping functions to visualize relationships between metagenomic samples.
phylogeo is built on top of the powerful phyloseq package which provides a framework for organizing all of the relevant microbiome data likeoperational taxonomic unit data, taxonomic data, phylogenetic trees, and sample information. Furthermore, phyloseq uses the elegant ggplot plotting package.
phylogeo's primary requirement is that your phyloseq sample_data data contain Latitude and Longitude columns. (For basic information about the phyloseq data model see here). If your phyloseq samples are location tagged then the following functions become available to you:
map_phyloseq()
can visualize your sample locations on a mapmap_network()
will connect samples based on a specified ecological distancemap_tree()
will draw a paired plot of a phylogenetic tree and the location of those samplesmap_clusters()
will draw a paired plot of a phylogenetic tree and the location of those samplesplot_distance()
will plot the pairwise distances between samples using ecological and greatcircle distanceThe use and utility of these functions are explained below, after a description of the datasets included in phylogeo.
phylogeo has three datasets.
mountainsoil
is a dataset from a microbiome study focusing on microbial community succession following forest firesbatmicrobiome
is a dataset from a bat-guano microbiome study.epoxomicin_KS
is a dataset of Epoxomicin-like ketosynthase (KS) domains.These datasets have geographic coordinates and are suitable for illustrating
the features of the phylogeo package, and they can be loaded using
the data()
command.
library(phylogeo) data(mountainsoil) data(batmicrobiome) data(epoxomicin_KS)
map_phyloseq provides a way to quickly look at your data by mapping it.
Because no calculations are done to the underlying data, drawing a map
using this command is quite quick. The mapping in this command (and all commands)
is handled by the map_data
function of ggplot. This style of mapping
allows zooming in by region
using a lookup table of country-names.
Other ggplot goodies like coloring by a variable are all available as
is a set of convenience functions that allows you to jitter points that
are overlapping at the scale you are interested in. Two examples showing
how these features can be used to zoom into a region, separate samples
and look at the distribution of a variable ("PH") in the phyloseq
sample_data
table.
#high level world map map_phyloseq(mountainsoil) #zoomed in map, colored by pH and jittered in order to see the points better map_phyloseq(mountainsoil, region="china", jitter=TRUE, jitter.x=2,jitter.y=2, color="PH")
The goal of the map_network function is to visualize the relatedness of ecological samples. It does this using the same technique in phyloseq's plot_net and plot_network functions, namely, it will calculate an ecological distance between samples. Instead of using a network layout algorithm, however, phylogeo will place points by their geographic location. If the points are crowded, they can be jittered to space them apart. A number of configuration options are available to control the appearance of the points and lines.
#simple network map without lines map_network(batmicrobiome) #connect samples and visualize connections map_network(batmicrobiome, lines=TRUE, color="SCIENTIFIC_NAME", jitter=TRUE) #play with the jitter and cutoff to get a better view map_network(batmicrobiome, color="SCIENTIFIC_NAME", jitter=TRUE, jitter.x=20,jitter.y=1, lines=TRUE, line_weight = 0.2, size=2)
One current downside to the current mapping setup is that the region
keyword
only works on country names and cannot zoom to, say "the western hemisphere".
In this example the result is a lot of dead space. For figures you would want
to ggsave
a PDF/SVG and zoom to your region of interest. Down the line a more
nuanced mapping system might be considered.
The goal of map_tree is to provide access to both phylogenetic and geographic information in one place. To be useful, the data may need to be pared back to relatively small number of OTUs/samples.
#simple map + tree map_tree(epoxomicin_KS) #color the map to see locations map_tree(epoxomicin_KS, color="Geotype", jitter=TRUE, size=2, ladderize=TRUE)
#trim samples and locations #x <- prune_samples() #y <- subset_taxa() # map_tree(y, color="Geotype", jitter=TRUE)
map_clusters provides a way to explore the spatial distribution of subsets of your data. From your phylogentic tree (required), a distance matrix is used to make a kmeans calculation which will cluster your tree into k number of clusters. The members of these clusters will then be displayed on the map.
#clustered trees plus maps map_clusters(epoxomicin_KS, clusternum=4)
This function creates a scatterplot of distances comparing the geographic great-circle distance to the the ecological distance (default:jaccard).
plot_distance(mountainsoil)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.