knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The \texttt{countprop} package allows estimation of several types of proportionality metrics for count-basedcompositional data such as 16S, metagenomic, and single-cell sequencing data. The package includes functions that allow standard empirical estimates of these proportionality metrics, as well as estimates based on the multinomial logit-normal model.
First, we'll define the model. Assume $n$ samples and $J+1$ features. Suppose the counts for sample $i$ for feature $j$ are denoted by $y_{ij}$ for $i=1,\dots,n$ and $j=1,\dots,J+1$. They are modelled using the multinomial distribution: ```{=latex} \begin{align} y_i &\sim \text{Multinomial}(M_i; p_{i1},\dots,p_{i(J+1)}), \end{align}
where $M_i=\sum_{j=1}^{J+1} y_{ij}$ and proportion vector $\pb_i=(p_{i1},\dots,p_{i(J+1)})$. The proportions themselves are modelled using a logit-normal model, which can be formulated through a set of latent vectors $\left( w_{i1}, \dots, w_{iJ} \right)$ which are related to the proportions by: ```{=latex} \begin{align*} p_{ij} &= \alr^{-1}\lp\wb_i\rp \\ &= \begin{cases} \frac{\exp\lc w_{ij} \rc}{1+\sum_{j=1}^J \exp\lc w_{ij} \rc} & \mbox{if } j=1,\dots,J \\ \frac{1}{1+\sum_{j=1}^J \exp\lc w_{ij} \rc} & \mbox{if } j=J+1. \end{cases} \end{align*}
The latent vectors are distributed as multivariate normal: ```{=latex} \begin{align} \left( w_{i1}, \dots, w_{iJ} \right) &\sim \mbox{MV-Normal}_J \left( \mub, \Sigmab \right). \end{align}
The read-depths are assumed to be distributed as log-normal: ```{=latex} \begin{align*} M_i \sim \text{Log-Normal} \left(\mu_\ell, \sigma_\ell^2 \right). \end{align*}
Finally, to guard against spurious correlations, we apply the $L_1$-penalty to the inverse covariance matrix $\Sigmab$ (i.e.\ the ``graphical lasso" penalty). ```{=latex} \begin{align} \ell\lp w_{i1}, \dots, w_{iJ} \rp = \log\det \Sigmab^{-1} - \tr(S \Sigmab^{-1}) - \lambda || \Sigmab^{-1} ||_1 \end{align}
# Fitting the model The \texttt{countprop} package has a built-in function to estimate the model parameters. First, let's load the countprop library and look at the first few lines of the murine single cell sequencing dataset included with the package: ```r library(countprop) data(singlecell) head(singlecell, 2)
To fit the multinomial logit-normal model, we can use the \texttt{mleLR()} function:
mle <- mleLR(singlecell) # Maximum likelihood estimates of model parameters mle$mu mle$Sigma.inv
For the \texttt{mleLR()} function, it is necessary to specify a value for $\lambda$, which is the graphical lasso penalty parameter. The default is 0. However, we can also run multiple values of $\lambda$ to find which one leads to the best fit based on the Extended Bayesian Information Criterion (EBIC). To do this, we use the \texttt{mlePath()} function. This allows us to choose the number of $\lambda$ values we want to run the model on (\texttt{n.lambda} parameter). This can also be parallelized by setting \texttt{n.cores>1}. Once we've obtained the model fit, we can visualize the EBIC values for each $\lambda$ value using \texttt{ebicPlot()}.
mle2 <- mlePath(singlecell, n.lambda=10, n.cores=1) mle2$min.idx # Index of smallest lambda value # Plot EBIC for different lambda values ebicPlot(mle2)
In this case, the optimal value of $\lambda$ is the one in the first position of the lambda vector. When the optimal $\lambda$ value is the smallest one considered, then it's possible that an even smaller $\lambda$ value would be optimal and was not considered. In this case, the argument \texttt{lambda.min.ratio} can be reduced from its default of \texttt{0.1}:
mle3 <- mlePath(singlecell, n.lambda=10, lambda.min.ratio = 0.0001, n.cores=1) mle3$min.idx ebicPlot(mle3)
The minimum EBIC now corresponds to the 3rd smallest value of $\lambda$.
Once the model parameters have been estimated, the model-based proportionality metrics can be estimated:
# Variation matrix logitNormalVariation(mle3$est.min$mu, mle3$est.min$Sigma) # Phi matrix logitNormalVariation(mle3$est.min$mu, mle3$est.min$Sigma, type="phi") # Rho matrix logitNormalVariation(mle3$est.min$mu, mle3$est.min$Sigma, type="rho")
The package also provides the standard naive (empirical) estimates of the proportionality metrics.
# Naive (empirical) variation matrix naiveVariation(singlecell) # Naive (empirical) Phi matrix naiveVariation(singlecell, type="phi") # Naive (empirical) Rho matrix naiveVariation(singlecell, type="rho")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.