r library("knitr")
r opts_chunk$set(cache=FALSE, fig.width=9, message=FALSE, warning=FALSE)
library("MicrobiomeExperiment")
data(GlobalPatterns) data(esophagus) data(enterotype) data(soilrep)
data(GlobalPatterns)
GlobalPatterns
Table of component constructor functions for building component data objects
Function | Input Class | Output Description
--- | --- | ---
otu_table
| numeric matrix | otu_table
object storing OTU abundance
otu_table
| data.frame | otu_table
object storing OTU abundance
sample_data
| data.frame | sample_data
object storing sample variables
tax_table
| character matrix | taxonomyTable
object storing taxonomic identities
tax_table
| data.frame | taxonomyTable
object storing taxonomic identities
read_tree
| file path char | phylo-class tree, read from file
read.table
| table file path | A matrix or data.frame (Std R
core function)
---
MicrobiomeExperiment constructors: functions for building/merging MicrobiomeExperiment objects.
Function | Input Class | Output Description
--- | --- | ---
MicrobiomeExperiment
| Two or more component objects | MicrobiomeExperiment-class, experiment-level object
merge_MicrobiomeExperiment
| Two or more component or MicrobiomeExperiment-class objects | Combined instance of MicrobiomeExperiment-class
---
The following example illustrates using the constructor methods for component data tables.
otu1 <- otu_table(raw_abundance_matrix, taxa_are_rows=FALSE) sam1 <- sample_data(raw_sample_data.frame) tax1 <- tax_table(raw_taxonomy_matrix) tre1 <- read_tree(my_tree_file)
Once you've converted the data tables to their appropriate class, combining them into one object requires only one additional function call, MicrobiomeExperiment()
:
ex1b <- MicrobiomeExperiment(my_otu_table, my_sample_data, my_taxonomyTable, my_tree)
You do not need to have all four data types in the example above in order to combine them into one validity-checked experiment-level MicrobiomeExperiment-class object. The MicrobiomeExperiment()
method will detect which component data classes are present, and build accordingly. Downstream analysis methods will access the required components using MicrobiomeExperiment's accessors, and throw an error if something is missing. For most downstream methods you will only need to supply the combined, MicrobiomeExperiment-class object (the output of MicrobiomeExperiment()
), usually as the first argument.
ex1c <- MicrobiomeExperiment(my_otu_table, my_sample_data)
Whenever an instance of the MicrobiomeExperiment-class is created by MicrobiomeExperiment --- for example, when we use the import_qiime()
function to import data, or combine manually imported tables using MicrobiomeExperiment()
--- the row and column indices representing taxa or samples are internally checked/trimmed for compatibility, such that all component data describe exactly (and only) the same OTUs and samples.
The MicrobiomeExperiment project includes support for two complete different categories of merging.
merge_samples()
, merge_taxa()
merge_MicrobiomeExperiment()
Once you have a MicrobiomeExperiment object available, many accessor functions are available to query aspects of the data set. The function name and its purpose are summarized in the Accessor Functions Table.
Accessor functions for MicrobiomeExperiment objects.
Function | Returns
--- | ---
[
| Standard extraction operator. Works on otu_table
, sample_data
, and taxonomyTable
access
| General slot accessor function for MicrobiomeExperiment-package
get_taxa
| Abundance values of all taxa in sample i'
get_sample| Abundance values of taxa
i' for all samples
get_taxa_unique
| A unique vector of the observed taxa at a particular taxonomic rank
get_variable
| An individual sample variable vector/factor
nsamples
| Get the number of samples described by an object
ntaxa
| Get the number of OTUs (taxa) described by an object
otu_table
| Build or access otu_table objects
rank_names
| Get the names of the available taxonomic ranks
sample_data
| Build or access sample_data
objects
sample_names
| The names of all samples
taxa_names
| The names of all taxa
sample_sums
| The sum of the abundance values of each sample
sample_variables
| The names of sample variables
taxa_sums
| The sum of the abundance values of each taxa
taxa_are_rows
| TRUE
if taxa are row indices in otu_table
tax_table
| A taxonomy table
phy_tree
| Access the tree contained in a MicrobiomeExperiment object
---
Trimming high-throughput phylogenetic sequencing data can be useful, or even necessary, for certain types of analyses. However, it is important that the original data always be available for reference and reproducibility; and that the methods used for trimming be transparent to others, so they can perform the same trimming or filtering steps on the same or related data. To facilitate this, MicrobiomeExperiment contains many ways to trim/filter the data from a phylogenetic sequencing project. Because matching indices for taxa and samples is strictly enforced, subsetting one of the data components automatically subsets the corresponding indices from the others. Variables holding trimmed versions of your original data can be declared, and further trimmed, without losing track of the original data.
In general, most trimming should be accomplished using the S4 methods prune_taxa()
or prune_samples()
.
topN <- 20
For example, lets make a new object that only holds the most abundant r topN
taxa in the experiment. To accomplish this, we will use the prune_taxa()
function.
data(GlobalPatterns) most_abundant_taxa <- sort(taxa_sums(GlobalPatterns), TRUE)[1:topN] ex2 <- prune_taxa(names(most_abundant_taxa), GlobalPatterns)
Now we can ask the question, "what taxonomic Family are these OTUs?" (Subsetting still returns a taxonomyTable
object, which is summarized. We will need to convert to a vector)
topFamilies <- tax_table(ex2)[, "Family"] as(topFamilies, "vector")
The previous example was a relatively simple filtering in which we kept only the most abundant r topN
in the whole experiment. But what if we wanted to keep the most abundant r topN
taxa of each sample? And of those, keep only the taxa that are also found in at least one-third of our samples? What if we wanted to keep only those taxa that met some across-sample criteria?
For this more complicated filtering MicrobiomeExperiment contains a function, genefilter_sample
, that takes as an argument a MicrobiomeExperiment object, as well as a list of one or more filtering functions that will be applied to each sample in the abundance matrix (otu_table
), as well as an integer argument, A
, that specifies for how many samples the filtering function must return TRUE
for a particular taxa to avoid removal from the object. A supporting function filterfun_sample
is also included in MicrobiomeExperiment to facilitate creating a properly formatted function (enclosure) if more than one function is going to be applied simultaneously. genefilter_sample
returns a logical vector suitable for sending directly to prune_taxa
for the actual trimming.
Here is an example on a completely fabricated otu_table
called testOTU
.
testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1<- filterfun_sample(topk(2)) wh1 <- genefilter_sample(testOTU, f1, A=2) wh2 <- c(T, T, T, F, F) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
Here is a second example using the included dataset, GlobalPatterns
. The most abundant taxa are kept only if they are in the most abundant 10\% of taxa in at least half of the samples in dataset GlobalPatterns
. Note that it is not necessary to subset GlobalPatterns
in order to do this filtering. The S4 method prune_taxa
subsets each of the relavent component objects, and returns the complex object back.
data(GlobalPatterns) f1<- filterfun_sample(topp(0.1)) wh1 <- genefilter_sample(GlobalPatterns, f1, A=(1/2*nsamples(GlobalPatterns))) sum(wh1) ex2 <- prune_taxa(wh1, GlobalPatterns)
print(ex2)
If instead of the most abundant fraction of taxa, you are interested in the most abundant fraction of individuals (aka sequences, observations), then the topf
function is appropriate. For steep rank-abundance curves, topf
will seem to be much more conservative (trim more taxa) because it is based on the cumulative sum of relative abundance. It does not guarantee that a certain number or fraction of total taxa (richness) will be retained.
data(GlobalPatterns) f1<- filterfun_sample(topf(0.9)) wh1 <- genefilter_sample(GlobalPatterns, f1, A=(1/3*nsamples(GlobalPatterns))) sum(wh1) prune_taxa(wh1, GlobalPatterns)
The filter_taxa
function is directly analogous to the genefilter
function for microarray filtering, but is used for filtering OTUs from MicrobiomeExperiment objects. It applies an arbitrary set of functions -- as a function list, for instance, created by genefilter::filterfun
-- as across-sample criteria, one OTU at a time. It can be thought of as an extension of the genefilter-package (from the Bioconductor repository) for MicrobiomeExperiment objects. It takes as input a MicrobiomeExperiment object, and returns a logical vector indicating whether or not each OTU passed the criteria. Alternatively, if the prune
option is set to r FALSE
, it returns the already-trimmed version of the MicrobiomeExperiment object.
Inspect the following example. Note that the functions genefilter
and kOverA
are from the genefilter package.
data("enterotype") library("genefilter") flist<- filterfun(kOverA(5, 2e-05)) ent.logi <- filter_taxa(enterotype, flist) ent.trim <- filter_taxa(enterotype, flist, TRUE) identical(ent.trim, prune_taxa(ent.logi, enterotype)) identical(sum(ent.logi), ntaxa(ent.trim)) filter_taxa(enterotype, flist, TRUE)
It is possible to subset the samples in a MicrobiomeExperiment object based on the sample variables using the subset_samples()
function. For example to subset GlobalPatterns
such that only certain environments are retained, the following line is needed (the related tables are subsetted automatically as well):
ex3 <- subset_samples(GlobalPatterns, SampleType%in%c("Freshwater", "Ocean", "Freshwater (creek)")) ex3
For this example only a categorical variable is shown, but in principle a continuous variable could be specified and a logical expression provided just as for the subset
function. In fact, because sample_data
component objects are an extension of the data.frame class, they can also be subsetted with the subset
function:
subset(sample_data(GlobalPatterns), SampleType%in%c("Freshwater", "Ocean", "Freshwater (creek)"))
It is possible to subset by specific taxonomic category using the subset_taxa()
function. For example, if we wanted to subset GlobalPatterns
so that it only contains data regarding the phylum Firmicutes:
ex4 <- subset_taxa(GlobalPatterns, Phylum=="Firmicutes") ex4
Can also randomly subset, for example a random subset of 100 taxa from the full dataset.
randomSpecies100 <- sample(taxa_names(GlobalPatterns), 100, replace=FALSE) ex5 <- prune_taxa(randomSpecies100, GlobalPatterns)
Sample-wise transformation can be achieved with the transform_sample_counts()
function. It requires two arguments, (1) the MicrobiomeExperiment object that you want to transform, and the function that you want to use to perform the transformation. Any arbitrary function can be provided as the second argument, as long as it returns a numeric vector with the same length as its input. In the following trivial example, we create a second object, ex2
, that has been "transformed" by the identity function such that it is actually identical to GlobalPatterns
.
data(GlobalPatterns) ex2 <- transform_sample_counts(GlobalPatterns, I)
For certain kinds of analyis we may want to transform the abundance data. For example, for RDA we want to transform abundance counts to within-sample ranks, and to further include a threshold beyond which all taxa receive the same rank value. The ranking for each sample is performed independently, so that the rank of a particular taxa within a particular sample is not influenced by that sample's total quantity of sequencing relative to the other samples in the project.
The following example shows how to perform such a thresholded-rank transformation of the abundance table in the complex MicrobiomeExperiment object GlobalPatterns
with an arbitrary threshold of 500.
ex4<- transform_sample_counts(GlobalPatterns, threshrankfun(500))
Suppose we are skeptical about the importance of OTU-level distinctions in our dataset. For this scenario, MicrobiomeExperiment includes a taxonomic-agglommeration method,tax_glom()
, which merges taxa of the same taxonomic category for a user-specified taxonomic level. In the following code, we merge all taxa of the same Genus, and store that new object as ex6
.
ex6 <- tax_glom(GlobalPatterns, taxlevel="Genus")
Similarly, our original example object (GlobalPatterns
) also contains a phlyogenetic tree corresponding to each OTU, which we could also use as a means to merge taxa in our dataset that are closely related. In this case, we specify a threshold patristic distance. Taxa more closely related than this threshold are merged. This is especially useful when a dataset has many taxa that lack a taxonomic assignment at the level you want to investigate, a problem when using tax_glom()
. Note that for datasets with a large number of taxa, tax_glom
will be noticeably faster than tip_glom
. Also, keep in mind that tip_glom
requires that its first argument be an object that contains a tree, while tax_glom
instead requires a taxonomyTable
(See MicrobiomeExperiment classes).
ex7 <- tip_glom(GlobalPatterns, speciationMinLength = 0.05)
Command output not provided here to save time during compilation of the vignette. The user is encouraged to try this out on your dataset, or even this example, if interested. It may take a while to run on the full, untrimmed data.
Please check the MicrobiomeExperiment installation tutorial for help with installation. This is likely to be the first place news and updated information about installation will be posted, as well. Also check out the rest of the MicrobiomeExperiment homepage on GitHub, as this is the best place to post issues, bug reports, feature requests, contribute code, etc.
For running parallel implementation of functions/methods in MicrobiomeExperiment (e.g. UniFrac(GlobalPatterns, parallel=TRUE)
), you will need also to install a function for registering a parallel "backend". Only one working parallel backend is needed, but there are several options, and the best one will depend on the details of your particular system. The "doParallel" package is a good place to start. Any one of the following lines from an R
session will install a backend package.
install.packages("doParallel") install.packages("doMC") install.packages("doSNOW") install.packages("doMPI")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.