library("knitr") opts_chunk$set(cache=FALSE, fig.width=9, fig.height=4, message=FALSE, warning=FALSE, comment = "#>")
phylogeo is based on top of phyloseq and therefore uses the same data model, an excellent overview of which can be found here. This vignette is simply a continuation of that tutorial. We create some random data from phyloseq including the following:
Phyloseq provides toolf for dealing with the first three items on this list. By adding location information and phylogenetic information you can take advantage fo the plotting tools in the phylogeo package.
Note: If these data-types don't make sense to you, it may be worthwhile to look at the phyloseq tutorials to get your bearings. For a grasp of what OTUs are and how they are used in metagenome analysis you could do worse than to start with Robert Edgars' page on OTUs. Edgar is the author of a number of highly used bioinformatic programs including Muscle and USearch (see here).
ape
packageIn this vignette, sequence data from the ape package will be used, although you will probably have your own sequencing data to use. Everything except that sequence data is generated denovo using base R.
#load libraries library(phyloseq) library(phylogeo) library(ape) # import the woodmouse sequence data data(woodmouse) # set some global varialbe we will need to make our dataset woodmousenames <- names(as.list(woodmouse)) samplenum = 15 samplenames = paste0("Sample", 1:samplenum)
An operational taxonomic unit (OTU) table that a table holding count information. The row values are typicaly OTUs which in macro-biology are typically species, but in microbiology are semi-arbitrary cutoffs of sequence similarity that corresponds roughly to species. For 16S studies, these OTUs are typically 16S DNA amplicon sequences clustered at 97% identity. We can make an arbitrary dataset where out OTUs will be the woodmouse data, and the columns will be our samples.
otumat = matrix(sample(1:100, 75, replace = TRUE), nrow = length(woodmousenames), ncol = samplenum) rownames(otumat) <- woodmousenames colnames(otumat) <- samplenames otumat
The taxonomy table lets you assign taxonomic information for each OTU. The rownames of the taxonomy table must be the same as the OTU table. Each of the columns represent some value, typically corresponding to the classic Kindom, Phylum.... Species heirarchy. However, this table can be used to assign arbitrary data to OTUs as well. Here we will generate some random taxonomy data using classic phylogeny, but with some made-up, filler data.
taxmat = matrix(sample(letters, 75, replace = TRUE), nrow = nrow(otumat), ncol = 7) rownames(taxmat) <- rownames(otumat) colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species") taxmat
Sample data is information about your samples. The sample data rownames must corresond to the values in the OTU table. We will make some sample data with some arbitrary columns. These columns are often used to store informaiton about sample processing, or about the samples themselves.
sampledata <- matrix(sample(letters,samplenum*7, replace=TRUE), nrow = samplenum, ncol = 7) rownames(sampledata) <- samplenames # in your real data these columns might hold soil data, or processing information colnames(sampledata) <- paste0("Category", 1:ncol(sampledata)) # generate some random lat/lon data and add it to the smaple data. latitude = sample((-90:90), samplenum) longitude = sample((-180:180), samplenum) sampledata <- cbind(sampledata, latitude, longitude) sampledata <- as.data.frame(sampledata) sampledata
To take advantage of the information stored in the DNA sequences we need to generate a phylogenetic tree. Once generated and loaded into the physeq object, phylogeo can use to explore whether there are geogrphic trends in the subclades. The members of the tree must be the same names as the names of the rows (OTUs) in the OTU table.
# muscle is used for alignments musc <- muscle(woodmouse) # nj will create a phylogenetic tree from the alignments tree <- nj(musc) plot(tree)
All of the pieces now must be put together into the phyloseq pbject
phy <- phyloseq(sample_data = sample_data(sampledata), otu_table = otu_table(otumat, taxa_are_rows=TRUE), tax_table = tax_table(taxmat), phy_tree = phy_tree(tree)) phy
# map the locations # random lat/lons yield samples all over map_phyloseq(phy) # map the locations # color by sample data category; size by abundance map_phyloseq(phy, color="Category1", size="Abundance") # map the relationship between samples map_network(phy, color="Category2", lines=TRUE) # only keep highly related samples # tip: if there are unrelated samples that drop out of the network analysis try base_data = TRUE map_network(phy, color="Category2",lines=TRUE, maxdist = 0.3) # map the tree # with this random data, perhaps it is not surprisign to wind all of the woodmouse sequences in each of the smapel locations. map_tree(phy, color="Category1") # map clusters # cluster analyses can let you look at the possilbity of non random distributions of subclades. # does, say, a certain subclade only occur in North America? map_clusters(phy) # plot distance # pairwise sample distances ecological distance vs. geographic distance plot_distance(phy)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.