knitr::opts_chunk$set(echo = TRUE)
The R package MiRKATS contains functions to test an association between the microbiota (taxonomic profiles of a microbial community) and suvival outcomes via an kernel matrix. The kernel matrix is based upon distance or dissimilarity metrics designed to summarize ecological and/or microbial communities, possibly incorporating phylogenetic information.
CompQuadForm, survival, and GUniFrac packages are required for MiRKAT-S. All three required packages are available on CRAN (https://cran.r-project.org/web/packages/). In this vignette, we use the throat data from GUniFrac for demonstration. In our demonstration, kernels are constructed from the family of UniFrac distances and the Bray-Curtis dissimilarity; however, additional options for relevant kernels are mentioned.
We use the throat microbiome data from package GUniFrac to demonstrate the use of MiRKAT-S. These data were generated to study the effect of smoking on the microbiota of the upper respiratory tract in 60 individuals, 28 smokers and 32 nonsmokers. As such, the original dataset includes samples from the microbiome of the nasopharynx and oropharynx on each side of the body. The dataset used here consists of the oropharynx samples from the left side of the body.
Because the dataset has a binary phenotype (smoking) rather than a measure of censored time to event outcomes, we consider smoking status and gender as covariates and generate null outcome data from the Exponential distribution. Specifically, we generate survival times as $S \sim \text{Exponential}(1 + I(\text{smoke}) + I(\text{male}))$, and censoring times as $C \sim \text{Exponential}(0.75)$. Then the outcome measure consists of $T = \text{min}(S, C)$ and $\Delta = I(S \leq C)$. This simulation procedure results in approximately 33% censoring.
# Load the data library(MiRKATS, quietly=TRUE) library(GUniFrac, quietly=TRUE) data(throat.tree) data(throat.otu.tab) data(throat.meta) # Prepare covariates set.seed(1) Male = rep(0, nrow(throat.meta)) Male[throat.meta$Sex == "Male"] <- 1 Smoker = rep(0, nrow(throat.meta)) Smoker[throat.meta$SmokingStatus == "Smoker"] <- 1 # Simulate outcomes # Here, outcome is associated with covariates but unassociated with microbiota # 33% censoring SurvTime <- rexp(60, (1 + Smoker + Male)) CensTime <- rexp(60, 0.75) Delta <- as.numeric( SurvTime <= CensTime ) ObsTime <- pmin(SurvTime, CensTime)
Many options exist for distance or dissimilarity metrics. Below, we calculate the distances used in the manuscript: the unweighted and weighted UniFrac distances, the generalized UniFrac distance with $\alpha=0.5$, and the Bray-Curtis dissimilarity. Choice of distance metric (and therefore kernel) is discussed following the example.
unifracs <- GUniFrac(throat.otu.tab, throat.tree, alpha=c(0, 0.5, 1))$unifracs D.weighted <- unifracs[,,"d_1"] D.unweighted <- unifracs[,,"d_UW"] D.generalized <- unifracs[,,"d_0.5"] D.braycurtis <- as.matrix(vegdist(throat.otu.tab, method="bray"))
The D2K function in MiRKAT-S converts distance matrices to kernel matrices via [ K = -\frac{1}{2} \left(I - \frac{1 1'}{n}\right) D^2 \left(I - \frac{1 1'}{n}\right). ] Here, $I$ is the identity matrix and $1$ is an $n$-vector of ones. To ensure that $K$ is positive semi-definite, we replace negative eigenvalues with zero. That is, we perform an eigenvalue decomposition $K = U \Lambda U$, where $\Lambda = \text{diag}(\lambda_1,...,\lambda_n)$, and then reconstruct the kernel matrix using the nonnegative eigenvalues $\Lambda^ = \text{diag}(\text{max}(\lambda_1,0),...,\text{max}(\lambda_n,0))$ so that $K = U \Lambda^ U$.
Alternatively, a distance matrix can be used as input to MiRKATS (argument kd) with argument distance=TRUE; the distance will be automatically converted to a kernel matrix exactly as described above.
K.weighted <- D2K(D.weighted) K.unweighted <- D2K(D.unweighted) K.generalized <- D2K(D.generalized) K.braycurtis <- D2K(D.braycurtis)
Here, we run MiRKAT-S to test the association between the microbiota and simulated survival times, adjusting for gender and smoking status.
# use kernel matrix with distance=FALSE MiRKATS(kd = K.generalized, distance = FALSE, obstime = ObsTime, delta = Delta, covar = cbind(Male, Smoker)) # equivalently, use distance matrix with distance=TRUE MiRKATS(kd = D.generalized, distance = TRUE, obstime = ObsTime, delta = Delta, covar = cbind(Male, Smoker))
The argument "distance" indicates whether "kd" is a distance matrix (TRUE) or kernel matrix (FALSE). The output is the p-value for the test using Davies' exact method, which computes the p-value based on a mixture of chi-square distributions. We use a small-sample correction to account for the modest sample sizes and sparse OTU count matrices that often result from studies of the microbiome.
Because there are only 60 individuals in this dataset, permutation is a reasonable choice for p-value calculation, as MiRKAT-S may be slightly anti-conservative with very small samples. MiRKAT-S will generate a warning when permutation is not used for sample sizes $n \leq 50$. "nperm" indicates the number of permutations to perform to generate the p-value (default = 1000).
MiRKATS(kd = K.generalized, distance = FALSE, obstime = ObsTime, delta = Delta, covar = cbind(Male, Smoker), perm=TRUE, nperm=1000)
How to choose an appropriate distance matrix and kernel for testing is a difficult question. However, it is important, since the distance metric used to generate the kernel for MiRKAT-S strongly affects the power of the test. In particular, MiRKAT-S has highest power when the form of association between the microbiota and the outcome assumed by the kernel matches the true form of association.
In the case of the UniFrac families and the Bray-Curtis dissimilarity, the factors at play are (1) the abundance of the associated taxa and (2) whether closely related taxa (phylogenetically) tend to be related or not related to the outcome as a group. For example, the following are some of the distance metrics that have been used for studies of the microbiome:
tabl <- " | Distance | Phylogeny? | Abundance? | Other notes | Reference | |:---------------|:--------:|:--------:|---------------------------|:-----:| | Unweighted UniFrac | Yes | No | | 1 | | Weighted UniFrac | Yes | Yes | | 2 | | Generalized UniFrac | Yes | (Yes) | Parameter alpha defines extent to which abundance is taken into account | 3 | | Jaccard | No | No | 1 - (taxa in both)/(taxa in either); typically presence/absence, but can be extended to an abundance-weighted version | 4,5 | | Bray-Curtis | No | Yes | Similar to Jaccard, but uses counts | 6 | " cat(tabl) # output the table in a format good for HTML/PDF/docx conversion
In the table above, "Yes" indicates the distance or dissimilarity metric has the feature; "(Yes)" indicates that the feature is present either in some variations of the metric or is present to some extent; and "No" indicates that the feature is not present.
Depending on which of these characteristics are expected to be present in a particular study (based on prior knowledge or intuition), an appropriate distance or dissimilarity can be selected. If the study is exploratory and strong protection of type 1 error is not needed, several distance metrics can be explored. Depending on which one(s) are highly significant, some information can be gained about the nature of any association between the microbiota and the outcome.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.