r library("knitr") r opts_chunk$set(cache=FALSE, fig.width=9, message=FALSE, warning=FALSE)

Load MicrobiomeExperiment

library("MicrobiomeExperiment")

Example Data

data(GlobalPatterns)
data(esophagus)
data(enterotype)
data(soilrep) 

MicrobiomeExperiment Object Summaries

data(GlobalPatterns)
GlobalPatterns

Convert raw data to MicrobiomeExperiment components

Table of component constructor functions for building component data objects


Function | Input Class | Output Description --- | --- | --- otu_table | numeric matrix | otu_table object storing OTU abundance otu_table | data.frame | otu_table object storing OTU abundance sample_data | data.frame | sample_data object storing sample variables tax_table | character matrix | taxonomyTable object storing taxonomic identities tax_table | data.frame | taxonomyTable object storing taxonomic identities read_tree | file path char | phylo-class tree, read from file read.table | table file path | A matrix or data.frame (Std R core function) ---

MicrobiomeExperiment constructors: functions for building/merging MicrobiomeExperiment objects.


Function | Input Class | Output Description --- | --- | --- MicrobiomeExperiment | Two or more component objects | MicrobiomeExperiment-class, experiment-level object merge_MicrobiomeExperiment| Two or more component or MicrobiomeExperiment-class objects | Combined instance of MicrobiomeExperiment-class ---

The following example illustrates using the constructor methods for component data tables.

otu1 <- otu_table(raw_abundance_matrix, taxa_are_rows=FALSE)
sam1 <- sample_data(raw_sample_data.frame) 
tax1 <- tax_table(raw_taxonomy_matrix)
tre1 <- read_tree(my_tree_file)

MicrobiomeExperiment() function: building complex MicrobiomeExperiment objects

Once you've converted the data tables to their appropriate class, combining them into one object requires only one additional function call, MicrobiomeExperiment():

ex1b <- MicrobiomeExperiment(my_otu_table, my_sample_data, my_taxonomyTable, my_tree)

You do not need to have all four data types in the example above in order to combine them into one validity-checked experiment-level MicrobiomeExperiment-class object. The MicrobiomeExperiment() method will detect which component data classes are present, and build accordingly. Downstream analysis methods will access the required components using MicrobiomeExperiment's accessors, and throw an error if something is missing. For most downstream methods you will only need to supply the combined, MicrobiomeExperiment-class object (the output of MicrobiomeExperiment() ), usually as the first argument.

ex1c <- MicrobiomeExperiment(my_otu_table, my_sample_data)

Whenever an instance of the MicrobiomeExperiment-class is created by MicrobiomeExperiment --- for example, when we use the import_qiime() function to import data, or combine manually imported tables using MicrobiomeExperiment() --- the row and column indices representing taxa or samples are internally checked/trimmed for compatibility, such that all component data describe exactly (and only) the same OTUs and samples.

Merge

The MicrobiomeExperiment project includes support for two complete different categories of merging.

Accessor functions

Once you have a MicrobiomeExperiment object available, many accessor functions are available to query aspects of the data set. The function name and its purpose are summarized in the Accessor Functions Table.

Accessor functions for MicrobiomeExperiment objects.


Function | Returns --- | --- [ | Standard extraction operator. Works on otu_table, sample_data, and taxonomyTable access | General slot accessor function for MicrobiomeExperiment-package get_taxa | Abundance values of all taxa in sample i'get_sample| Abundance values of taxai' for all samples get_taxa_unique | A unique vector of the observed taxa at a particular taxonomic rank get_variable | An individual sample variable vector/factor nsamples | Get the number of samples described by an object ntaxa | Get the number of OTUs (taxa) described by an object otu_table | Build or access otu_table objects rank_names | Get the names of the available taxonomic ranks sample_data | Build or access sample_data objects sample_names | The names of all samples taxa_names | The names of all taxa sample_sums | The sum of the abundance values of each sample sample_variables | The names of sample variables taxa_sums | The sum of the abundance values of each taxa taxa_are_rows | TRUE if taxa are row indices in otu_table tax_table | A taxonomy table phy_tree | Access the tree contained in a MicrobiomeExperiment object ---

Trimming, subsetting, filtering MicrobiomeExperiment data

Trimming: prune_taxa()

Trimming high-throughput phylogenetic sequencing data can be useful, or even necessary, for certain types of analyses. However, it is important that the original data always be available for reference and reproducibility; and that the methods used for trimming be transparent to others, so they can perform the same trimming or filtering steps on the same or related data. To facilitate this, MicrobiomeExperiment contains many ways to trim/filter the data from a phylogenetic sequencing project. Because matching indices for taxa and samples is strictly enforced, subsetting one of the data components automatically subsets the corresponding indices from the others. Variables holding trimmed versions of your original data can be declared, and further trimmed, without losing track of the original data.

In general, most trimming should be accomplished using the S4 methods prune_taxa() or prune_samples().

Simple filtering example

topN <- 20

For example, lets make a new object that only holds the most abundant r topN taxa in the experiment. To accomplish this, we will use the prune_taxa() function.

data(GlobalPatterns)
most_abundant_taxa <- sort(taxa_sums(GlobalPatterns), TRUE)[1:topN]
ex2 <- prune_taxa(names(most_abundant_taxa), GlobalPatterns)

Now we can ask the question, "what taxonomic Family are these OTUs?" (Subsetting still returns a taxonomyTable object, which is summarized. We will need to convert to a vector)

topFamilies <- tax_table(ex2)[, "Family"]
as(topFamilies, "vector")

Arbitrarily complex abundance filtering

The previous example was a relatively simple filtering in which we kept only the most abundant r topN in the whole experiment. But what if we wanted to keep the most abundant r topN taxa of each sample? And of those, keep only the taxa that are also found in at least one-third of our samples? What if we wanted to keep only those taxa that met some across-sample criteria?

genefilter_sample(): Filter by Within-Sample Criteria

For this more complicated filtering MicrobiomeExperiment contains a function, genefilter_sample, that takes as an argument a MicrobiomeExperiment object, as well as a list of one or more filtering functions that will be applied to each sample in the abundance matrix (otu_table), as well as an integer argument, A, that specifies for how many samples the filtering function must return TRUE for a particular taxa to avoid removal from the object. A supporting function filterfun_sample is also included in MicrobiomeExperiment to facilitate creating a properly formatted function (enclosure) if more than one function is going to be applied simultaneously. genefilter_sample returns a logical vector suitable for sending directly to prune_taxa for the actual trimming.

Here is an example on a completely fabricated otu_table called testOTU.

testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE)
f1<- filterfun_sample(topk(2))
wh1 <- genefilter_sample(testOTU, f1, A=2)
wh2 <- c(T, T, T, F, F)
prune_taxa(wh1, testOTU)
prune_taxa(wh2, testOTU)

Here is a second example using the included dataset, GlobalPatterns. The most abundant taxa are kept only if they are in the most abundant 10\% of taxa in at least half of the samples in dataset GlobalPatterns. Note that it is not necessary to subset GlobalPatterns in order to do this filtering. The S4 method prune_taxa subsets each of the relavent component objects, and returns the complex object back.

data(GlobalPatterns)
f1<- filterfun_sample(topp(0.1))
wh1 <- genefilter_sample(GlobalPatterns, f1, A=(1/2*nsamples(GlobalPatterns)))
sum(wh1)
ex2 <- prune_taxa(wh1, GlobalPatterns)
print(ex2)

If instead of the most abundant fraction of taxa, you are interested in the most abundant fraction of individuals (aka sequences, observations), then the topf function is appropriate. For steep rank-abundance curves, topf will seem to be much more conservative (trim more taxa) because it is based on the cumulative sum of relative abundance. It does not guarantee that a certain number or fraction of total taxa (richness) will be retained.

data(GlobalPatterns)
f1<- filterfun_sample(topf(0.9))
wh1 <- genefilter_sample(GlobalPatterns, f1, A=(1/3*nsamples(GlobalPatterns)))
sum(wh1)
prune_taxa(wh1, GlobalPatterns)

filter_taxa(): Filter by Across-Sample Criteria

The filter_taxa function is directly analogous to the genefilter function for microarray filtering, but is used for filtering OTUs from MicrobiomeExperiment objects. It applies an arbitrary set of functions -- as a function list, for instance, created by genefilter::filterfun -- as across-sample criteria, one OTU at a time. It can be thought of as an extension of the genefilter-package (from the Bioconductor repository) for MicrobiomeExperiment objects. It takes as input a MicrobiomeExperiment object, and returns a logical vector indicating whether or not each OTU passed the criteria. Alternatively, if the prune option is set to r FALSE, it returns the already-trimmed version of the MicrobiomeExperiment object.

Inspect the following example. Note that the functions genefilter and kOverA are from the genefilter package.

data("enterotype")
library("genefilter")
flist<- filterfun(kOverA(5, 2e-05))
ent.logi <- filter_taxa(enterotype, flist)
ent.trim <- filter_taxa(enterotype, flist, TRUE)
identical(ent.trim, prune_taxa(ent.logi, enterotype)) 
identical(sum(ent.logi), ntaxa(ent.trim))
filter_taxa(enterotype, flist, TRUE)

subset_samples(): Subset by Sample Variables

It is possible to subset the samples in a MicrobiomeExperiment object based on the sample variables using the subset_samples() function. For example to subset GlobalPatterns such that only certain environments are retained, the following line is needed (the related tables are subsetted automatically as well):

ex3 <- subset_samples(GlobalPatterns, SampleType%in%c("Freshwater", "Ocean", "Freshwater (creek)"))
ex3

For this example only a categorical variable is shown, but in principle a continuous variable could be specified and a logical expression provided just as for the subset function. In fact, because sample_data component objects are an extension of the data.frame class, they can also be subsetted with the subset function:

subset(sample_data(GlobalPatterns), SampleType%in%c("Freshwater", "Ocean", "Freshwater (creek)"))

subset_taxa(): subset by taxonomic categories

It is possible to subset by specific taxonomic category using the subset_taxa() function. For example, if we wanted to subset GlobalPatterns so that it only contains data regarding the phylum Firmicutes:

ex4 <- subset_taxa(GlobalPatterns, Phylum=="Firmicutes")
ex4

random subsample abundance data

Can also randomly subset, for example a random subset of 100 taxa from the full dataset.

randomSpecies100 <- sample(taxa_names(GlobalPatterns), 100, replace=FALSE)
ex5 <- prune_taxa(randomSpecies100, GlobalPatterns)

Transform abundance data

Sample-wise transformation can be achieved with the transform_sample_counts() function. It requires two arguments, (1) the MicrobiomeExperiment object that you want to transform, and the function that you want to use to perform the transformation. Any arbitrary function can be provided as the second argument, as long as it returns a numeric vector with the same length as its input. In the following trivial example, we create a second object, ex2, that has been "transformed" by the identity function such that it is actually identical to GlobalPatterns.

data(GlobalPatterns)
ex2 <- transform_sample_counts(GlobalPatterns, I)

For certain kinds of analyis we may want to transform the abundance data. For example, for RDA we want to transform abundance counts to within-sample ranks, and to further include a threshold beyond which all taxa receive the same rank value. The ranking for each sample is performed independently, so that the rank of a particular taxa within a particular sample is not influenced by that sample's total quantity of sequencing relative to the other samples in the project.

The following example shows how to perform such a thresholded-rank transformation of the abundance table in the complex MicrobiomeExperiment object GlobalPatterns with an arbitrary threshold of 500.

ex4<- transform_sample_counts(GlobalPatterns, threshrankfun(500))

Phylogenetic smoothing

tax_glom()

Suppose we are skeptical about the importance of OTU-level distinctions in our dataset. For this scenario, MicrobiomeExperiment includes a taxonomic-agglommeration method,tax_glom(), which merges taxa of the same taxonomic category for a user-specified taxonomic level. In the following code, we merge all taxa of the same Genus, and store that new object as ex6.

ex6 <- tax_glom(GlobalPatterns, taxlevel="Genus")

tip_glom()

Similarly, our original example object (GlobalPatterns) also contains a phlyogenetic tree corresponding to each OTU, which we could also use as a means to merge taxa in our dataset that are closely related. In this case, we specify a threshold patristic distance. Taxa more closely related than this threshold are merged. This is especially useful when a dataset has many taxa that lack a taxonomic assignment at the level you want to investigate, a problem when using tax_glom(). Note that for datasets with a large number of taxa, tax_glom will be noticeably faster than tip_glom. Also, keep in mind that tip_glom requires that its first argument be an object that contains a tree, while tax_glom instead requires a taxonomyTable (See MicrobiomeExperiment classes).

ex7 <- tip_glom(GlobalPatterns, speciationMinLength = 0.05)

Command output not provided here to save time during compilation of the vignette. The user is encouraged to try this out on your dataset, or even this example, if interested. It may take a while to run on the full, untrimmed data.

Installation

Installation

Please check the MicrobiomeExperiment installation tutorial for help with installation. This is likely to be the first place news and updated information about installation will be posted, as well. Also check out the rest of the MicrobiomeExperiment homepage on GitHub, as this is the best place to post issues, bug reports, feature requests, contribute code, etc.

Installing Parallel Backend

For running parallel implementation of functions/methods in MicrobiomeExperiment (e.g. UniFrac(GlobalPatterns, parallel=TRUE)), you will need also to install a function for registering a parallel "backend". Only one working parallel backend is needed, but there are several options, and the best one will depend on the details of your particular system. The "doParallel" package is a good place to start. Any one of the following lines from an R session will install a backend package.

install.packages("doParallel")
install.packages("doMC")
install.packages("doSNOW")
install.packages("doMPI")


HCBravoLab/MicrobiomeExperiment documentation built on May 17, 2019, 8:47 p.m.