knitr::opts_chunk$set(cache = FALSE,
                      fig.width = 9,
                      message = FALSE,
                      warning = FALSE)

mia implements tools for microbiome analysis based on the SummarizedExperiment [@SE], SingleCellExperiment [@SCE] and TreeSummarizedExperiment [@TSE] infrastructure. Data wrangling and analysis are the main scope of this package.

Installation

To install mia, install BiocManager first, if it is not installed. Afterwards use the install function from BiocManager.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mia")

Load mia

library("mia")

Loading a TreeSummarizedExperiment object

A few example datasets are available via mia. For this vignette the GlobalPatterns dataset is loaded first.

data(GlobalPatterns, package = "mia")
tse <- GlobalPatterns
tse

Functions for working with microbiome data

One of the main topics for analysing microbiome data is the application of taxonomic data to describe features measured. The interest lies in the connection between individual bacterial species and their relation to each other.

mia does not rely on a specific object type to hold taxonomic data, but uses specific columns in the rowData of a TreeSummarizedExperiment object. taxonomyRanks can be used to construct a character vector of available taxonomic levels. This can be used, for example, for subsetting.

# print the available taxonomic ranks
colnames(rowData(tse))
taxonomyRanks(tse)
# subset to taxonomic data only
rowData(tse)[,taxonomyRanks(tse)]

The columns are recognized case insensitive. Additional functions are available to check for validity of taxonomic information or generate labels based on the taxonomic information.

table(taxonomyRankEmpty(tse, "Species"))
head(getTaxonomyLabels(tse))

For more details see the man page ?taxonomyRanks.

Merging and agglomeration based on taxonomic information.

Agglomeration of data based on these taxonomic descriptors can be performed using functions implemented in mia. In addition to the aggValue functions provide by TreeSummarizedExperiment mergeFeaturesByRank is available. mergeFeaturesByRank does not require tree data to be present.

mergeFeaturesByRank constructs a factor to guide merging from the available taxonomic information. For more information on merging have a look at the man page via ?mergeFeatures.

# agglomerate at the Family taxonomic rank
x1 <- mergeFeaturesByRank(tse, rank = "Family")
## How many taxa before/after agglomeration?
nrow(tse)
nrow(x1)

Tree data can also be shrunk alongside agglomeration, but this is turned of by default.

# with agglomeration of the tree
x2 <- mergeFeaturesByRank(tse, rank = "Family",
                        agglomerateTree = TRUE)
nrow(x2) # same number of rows, but
rowTree(x1) # ... different
rowTree(x2) # ... tree

For mergeFeaturesByRank to work, taxonomic data must be present. Even though only one rank is available for the enterotype dataset, agglomeration can be performed effectively de-duplicating entries for the genus level.

data(enterotype, package = "mia")
taxonomyRanks(enterotype)
mergeFeaturesByRank(enterotype)

To keep data tidy, the agglomerated data can be stored as an alternative experiment in the object of origin. With this synchronized sample subsetting becomes very easy.

altExp(tse, "family") <- x2

Keep in mind, that if you set na.rm = TRUE, rows with NA or similar value (defined via the empty.fields argument) will be removed. Depending on these settings different number of rows will be returned.

x1 <- mergeFeaturesByRank(tse, rank = "Species", na.rm = TRUE)
altExp(tse,"species") <- mergeFeaturesByRank(tse, rank = "Species", na.rm = FALSE)
dim(x1)
dim(altExp(tse,"species"))

For convenience the function splitByRanks is available, which agglomerates data on all ranks selected. By default all available ranks will be used. The output is compatible to be stored as alternative experiments.

altExps(tse) <- splitByRanks(tse)
tse
altExpNames(tse)

Constructing a tree from taxonomic data

Constructing a taxonomic tree from taxonomic data stored in rowData is quite straightforward and uses mostly functions implemented in TreeSummarizedExperiment.

taxa <- rowData(altExp(tse,"Species"))[,taxonomyRanks(tse)]
taxa_res <- resolveLoop(as.data.frame(taxa))
taxa_tree <- toTree(data = taxa_res)
taxa_tree$tip.label <- getTaxonomyLabels(altExp(tse,"Species"))
rowNodeLab <- getTaxonomyLabels(altExp(tse,"Species"), make_unique = FALSE)
altExp(tse,"Species") <- changeTree(altExp(tse,"Species"),
                                   rowTree = taxa_tree,
                                   rowNodeLab = rowNodeLab)

Transformation of assay data

Transformation of count data stored in assays is also a main task when work with microbiome data. transformAssay can be used for this and offers a few choices of available transformations. A modified object is returned and the transformed counts are stored in a new assay.

assayNames(enterotype)
anterotype <- transformAssay(enterotype, method = "log10", pseudocount = 1)
assayNames(enterotype)

For more details have a look at the man page ?transformAssay.

Sub-sampling to equal number of counts per sample. Also known as rarefying.

data(GlobalPatterns, package = "mia")

tse.subsampled <- subsampleCounts(GlobalPatterns, 
                                  min_size = 60000, 
                                  name = "subsampled",
                                  replace = TRUE,
                                  seed = 1938)
tse.subsampled

Alternatively, one can save both original TreeSE and subsampled TreeSE within a MultiAssayExperiment object.

library(MultiAssayExperiment)
mae <- MultiAssayExperiment(c("originalTreeSE" = GlobalPatterns,
                              "subsampledTreeSE" = tse.subsampled))
mae
# To extract specifically the subsampled TreeSE
experiments(mae)$subsampledTreeSE

Community indices

In the field of microbiome ecology several indices to describe samples and community of samples are available. In this vignette we just want to give a very brief introduction.

Functions for calculating alpha and beta diversity indices are available. Using estimateDiversity multiple diversity indices are calculated by default and results are stored automatically in colData. Selected indices can be calculated individually by setting index = "shannon" for example.

tse <- estimateDiversity(tse)
colnames(colData(tse))[8:ncol(colData(tse))]

Beta diversity indices are used to describe inter-sample connections. Technically they are calculated as dist object and reduced dimensions can be extracted using cmdscale. This is wrapped up in the runMDS function of the scater package, but can be easily used to calculated beta diversity indices using the established functions from the vegan package or any other package using comparable inputs.

library(scater)
altExp(tse,"Genus") <- runMDS(altExp(tse,"Genus"), 
                              FUN = vegan::vegdist,
                              method = "bray",
                              name = "BrayCurtis", 
                              ncomponents = 5, 
                              assay.type = "counts", 
                              keep_dist = TRUE)

JSD and UniFrac are implemented in mia as well. calculateJSD can be used as a drop-in replacement in the example above (omit the method argument as well) to calculate the JSD. For calculating the UniFrac distance via calculateUniFrac either a TreeSummarizedExperiment must be used or a tree supplied via the tree argument. For more details see ?calculateJSD, ?calculateUnifrac or ?calculateDPCoA.

runMDS performs the decomposition. Alternatively runNMDS can also be used.

Other indices

estimateDominance and estimateEvenness implement other sample-wise indices. The function behave equivalently to estimateDiversity. For more information see the corresponding man pages.

Utility functions

To make migration and adoption as easy as possible several utility functions are available.

Data loading functions

Functions to load data from biom files, QIIME2 output, DADA2 objects [@dada2] or phyloseq objects are available.

library(phyloseq)
data(esophagus, package = "phyloseq")
esophagus
esophagus <- makeTreeSEFromPhyloseq(esophagus)
esophagus

For more details have a look at the man page, for examples ?makeTreeSEFromPhyloseq.

General wrapper functions

Row-wise or column-wise assay data subsetting.

# Specific taxa
assay(tse['522457',], "counts") |> head()
# Specific sample
assay(tse[,'CC1'], "counts") |> head()

Selecting most interesting features

getTopFeatures returns a vector of the most top abundant feature IDs.

data(esophagus, package = "mia")
top_taxa <- getTopFeatures(esophagus,
                       method = "mean",
                       top = 5,
                       assay.type = "counts")
top_taxa

Generating tidy data

To generate tidy data as used and required in most of the tidyverse, meltAssay can be used. A data.frame in the long format will be returned.

molten_data <- meltAssay(tse,
                         assay.type = "counts",
                         add_row_data = TRUE,
                         add_col_data = TRUE
)
molten_data

Session info

sessionInfo()

References



microbiome/mia documentation built on May 17, 2024, 2:18 a.m.