Vignette for ZIBBSeqDiscovery"

Introduction

This vignette gives an example about using ZIBBSeqDiscovery to analysis microbiome counts data and evaluate the associate between phenotype of interest and the composition of the counts. ZIBBSeqDiscovery employs zero-inflated beta binomial to model the distribution of the microbiome counts data, and apply moment corrected correlation (MCC) approach to adjust the p values.

Data requirement

At a minimum, ZIBBSeqDiscovery requires a count data matrix, predictor (design) matrix, and zero inflation related matrix.

Notation and models

Let $Y = (y_{ij}) \in \mathbb{Z}^{m \times n}$ be the count matrix, and each element $y_{ij}$ represents the count of OTU $i$ in sample $j$ ($i = 1,\ldots,m, j=1,\ldots,n$). Let $s_j = \sum_{i=1}^m y_{ij}$ be the library size for sample $j$. We model the distribution of microbiome counts data using a zero-inflated beta binomial as,

$$f(y_{ij}|\alpha_{1ij}, \alpha_{2ij}, \pi_{ij}) = \pi_{ij} I_{y_{ij}=0} + (1 - \pi_{ij}) {\binom{s_j}{y_{ij}}} \frac{\mathrm{B}(y_{ij}+\alpha_{1ij},s_j-y_{ij}+\alpha_{2ij})}{\mathrm{B}(\alpha_{1ij},\alpha_{2ij})}$$

We can interpret the above modelling as: assume the count $y_{ij}$ comes from a mixture of zero and a beta binomial distribution. The beta-binomial model assumes that $y_{ij}$ follows a binomial distribution $Bin(s_j, \mu_{ij})$, and $\mu_{ij} \sim Beta(\alpha_{1ij}, \alpha_{2ij})$.

We will then re-parameterize $(\alpha_{1ij}, \alpha_{2ij})$ as: $\alpha_{1ij} = E(\mu_{ij}) (1 - \phi_i)/\phi_i$, and $\alpha_{2ij} = (1 - E(\mu_{ij})) (1 - \phi_i)/\phi_i$. The re-parameterized parameters $(E(\mu_{ij}), \phi_i)$ are easy to interpret: $E(\mu_{ij})$ is the expectation of probability that a single OTU count in sample $j$ maps to OTU $i$, and $\phi_i$ is the overdispersion compared to the binomial for OTU $i$.

Two link functions are employed in order to include the effects of covariates,

Data analysis example

Data source

We use the real kostic data set ([Kostic, 2012]) which has 2505 OTUs and 185 samples. The interested phenotype here is the health status for samples. To save computing time, we will only use the first 300 OTUs in the example below.

Demo example

First, let's load the package and data

rm(list = ls())
library(ZIBBSeqDiscovery)
data(kostic.x)
data(kostic.y)

We need to remove OTUs which has zero count across all samples.

kostic.x <- kostic.x[which(rowSums(kostic.x)>0),]
kostic.x <- kostic.x[1:300, ]

Now, let's construct the design matrix and zero inflation related matrix.

kostic.y <- cbind(1, kostic.y=="Tumor")
kostic.z <- cbind(1, log(colSums(kostic.x)))

We first fit the ZIBB model with free approach.

out.free <- fitZIBB(kostic.x, kostic.y, kostic.z, mode="free")

And then fit with constrained approaching using the estimation from free approach as initial values.

out.constrained <- fitZIBB(kostic.x, kostic.y, kostic.z, mode="constrained", 
                           gn=3, betastart=out.free$betahat, 
                           psi.start=out.free$psi, eta.start=out.free$zeroCoef)

Note it will take several minutes to run through the fitZIBB function depending on the configuration of your computer. Finally, use MCC method to adjust the p values (i.e. replace NAs in the p values by the results from MCC).

out.free.mcc <- mcc.adj(out.free, kostic.x, kostic.y, kostic.z, K=4)
out.constrained.mcc <- mcc.adj(out.constrained, kostic.x, kostic.y, kostic.z, K=4)

In order to check the effects of MCC adjustment, we count how many NAs in the reported p values before and after MCC adjustment. Note that NA appears mostly in the cases when the OTU counts are zero in most of the samples. So we check the cases such that OTU has 180, 181, 182, 183, and 184 zero counts across the 185 samples respectively.

kostic.x.0 <- rowSums(kostic.x==0)
for (i in 1:5) {
  idx <- kostic.x.0 == (185-i)
  if (i==1) {
    sum.df <- data.frame(zero.counts = 185-i, N = sum(idx), 
                         Number.NA.free = sum(is.na(out.free$p[idx])), 
                         Number.NA.constrained = sum(is.na(out.constrained$p[idx])), 
                         Number.NA.free.MCC = sum(is.na(out.free.mcc$p[idx])), 
                         Number.NA.constrained.MCC = sum(is.na(out.constrained.mcc$p[idx])))
  } else {
    sum.df <- rbind(sum.df, c(185-i, sum(idx), sum(is.na(out.free$p[idx])),
                              sum(is.na(out.constrained$p[idx])),
                              sum(is.na(out.free.mcc$p[idx])),
                              sum(is.na(out.constrained.mcc$p[idx]))))
  }
}
print.data.frame(sum.df, right = FALSE)

We should notice that the NAs are fewer after the MCC adjustment.

References

Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., Ojesina, A. I., Jung, J., Bass, A. J., Tabernero, J., et al. (2012). Genomic analysis identifies association of fusobacterium with colorectal carcinoma. Genome research 22, 292-298.



Try the ZIBBSeqDiscovery package in your browser

Any scripts or data that you put into this service are public.

ZIBBSeqDiscovery documentation built on May 2, 2019, 4:21 a.m.