knitr::opts_chunk$set( screenshot.force = FALSE, echo = TRUE, rows.print = 5, message = FALSE, warning = FALSE)

This vignette documents the data format used in **PLNmodel** by `PLN`

and its variants. It also shows how to create an object in the proper format for further analyses from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class objects.

We illustrate the format using trichoptera data set, a full description of which can be found in the corresponding vignette.

library(PLNmodels) data(trichoptera)

The trichoptera data set is a list made of two data frames: `Abundance`

(hereafter referred to as the *counts*) and `Covariate`

(hereafter the *covariates*).

str(trichoptera, max.level = 1)

The covariates include, among others, the wind, pressure and humidity.

names(trichoptera$Covariate)

In the PLN framework, we model the counts from the covariates, let's say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called *formula interface* and it would thus be natural to write something like

PLN(Abundance ~ Wind + Pressure, data = trichoptera)

Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically **multivariate**: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame).

We must therefore prepare a data structure where `Abundance`

refers to a count *matrix* whereas `Wind`

and `Pressure`

refer to *vectors* before feeding it to `PLN`

. That's the purpose of `prepare_data`

.

trichoptera2 <- prepare_data(counts = trichoptera$Abundance, covariates = trichoptera$Covariate) str(trichoptera2)

If you look carefully, you can notice a few difference between `trichoptera`

and `trichoptera2`

:

- the first is a
`list`

whereas the second is a`data.frame`

[^1]; `Abundance`

is a matrix-column of`trichoptera2`

that you can extract using the usual functions`[`

and`[[`

to retrieve the count matrix;`trichoptera2`

has an additional`Offset`

column (more on that later).

It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The *proper way* to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods:

- Total Sum Scaling (TSS), where the offset of a sample is the total count in that sample
- Cumulative Sum Scaling (CSS), introduced in [@CSS], where the offset of a sample if the cumulative sum of counts in that sample, up to a quantile determined in a data driven way.
- Relative Log-Expression (RLE), implemented in [@DESeq2], where all samples are used to compute a reference sample, each sample is compared to the reference sample using log-ratios and the offset is the median log-ratio.
- Geometric Mean of Pairwise Ratio (GMPR), introduced in [@GMPR] where each sample is compared to each other to compute a median log-ratio and the offset of a sample is the geometric means of those pairwise ratios.
- Wrench, introduced in [@Kumar2018] and fully implemented in the Wrench package, where all samples are used to compute reference proportions and each sample is compared to the reference using ratios (and
**not log-ratios**) of proportions to compute compositional correction factors. In that case, the offset is the product of (geometrically centered) compositional factors and (geometrically centered) depths.

Each of these offset be computed from a counts matrix using the `compute_offset`

function and changing its `offset`

argument:

## same as compute_offset(trichoptera$Abundance, offset = "TSS") compute_offset(trichoptera$Abundance)

In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden)

compute_offset(trichoptera$Abundance, "CSS") compute_offset(trichoptera$Abundance, "RLE") compute_offset(trichoptera$Abundance, "GMPR")

We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks.

compute_offset(trichoptera$Abundance, "RLE", pseudocounts = 1)

A better solution consists in using only positive counts to compute the offsets:

compute_offset(trichoptera$Abundance, "RLE", type = "poscounts")

Finally, we can use wrench to compute the offsets:

compute_offset(trichoptera$Abundance, "Wrench")

**Note** TSS is the only methods that produces offset on the same scale as the counts, all others produces offsets that are (hopefully) *proportional* to library sizes but on a different scale. To force the offsets to be on the same scale as the counts for all methods, you can use the option `scale = "count"`

.

compute_offset(trichoptera$Abundance, "Wrench", scale = "count")

`prepare_data`

We'll already learned that `prepare_data`

can join counts and covariates into a single data.frame. It can also compute offset through `compute_offset`

and does so by default with `offset = "TSS"`

, hence the `Offset`

column in `trichoptera2`

. You can change the offset method and provide additional arguments that will passed on to `compute_offset`

.

str(prepare_data(trichoptera$Abundance, trichoptera$Covariate, offset = "RLE", pseudocounts = 1))

Different communities use different standard for the count data where samples are either or columns of the counts matrix. `prepare_data`

uses heuristics to guess the direction of the counts matrix (or fail informatively doing so) and automatically transpose it if needed.

Finally, `prepare_data`

enforces sample-consistency between the counts and the covariates and automatically trims away:
- samples for which only covariates or only counts are available;
- samples with no positive counts

For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2 = 47 samples left, as expected.

nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample trichoptera$Covariate[-49,] ## remove last sample ))

`prepare_data_from_[phyloseq|biom]`

Community composition data are quite popular in microbial ecology and usually stored in flat files using the biom format and/or imported in R as phyloseq-class objects [@phyloseq] using the Bioconductor phyloseq package.

We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object.

Reading from a biom file requires the bioconductor package biomformat. This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a biom file using the following steps:

- read your biom file with
`biomformat::read_biom()`

- extract the count table with
`biomformat::biom_data()`

- extract the covariates with
`biomformat::sample_metadata()`

(or build your own) - feed them to
`prepare_data`

as illustrated below:

## If biomformat is not installed, uncomment the following lines # if (!requireNamespace("BiocManager", quietly = TRUE)) { # install.packages("BiocManager") # } # BiocManager::install("biomformat") library(biomformat) biomfile <- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat") biom <- biomformat::read_biom(biomfile) ## extract counts counts <- as(biomformat::biom_data(biom), "matrix") ## extract covariates (or prepare your own) covariates <- biomformat::sample_metadata(biom) ## prepare data my_data <- prepare_data(counts = counts, covariates = covariates) str(my_data)

Likewise, preparing data from a phyloseq-class object requires the bioconductor package phyloseq. This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a phyloseq object using the following steps:

- extract the count table with
`phyloseq::otu_table()`

- extract the covariates with
`phyloseq::sample_data()`

(or build your own) - feed them to
`prepare_data`

as illustrated below:

## If biomformat is not installed, uncomment the following lines # if (!requireNamespace("BiocManager", quietly = TRUE)) { # install.packages("BiocManager") # } # BiocManager::install("phyloseq") library(phyloseq) data("enterotype") ## extract counts counts <- as(phyloseq::otu_table(enterotype), "matrix") ## extract covariates (or prepare your own) covariates <- phyloseq::sample_data(enterotype) ## prepare data my_data <- prepare_data(counts = counts, covariates = covariates) str(my_data)

We detail here the mathematical background behind the various offsets and the way they are computed. Note $\mathbf{Y} = (Y_{ij})$ the counts matrix where $Y_{ij}$ is the count of species $j$ in sample $i$. Assume that there are $p$ species and $n$ samples in total. The offset of sample $i$ is noted $O_i$ and computed in the following way.

Offsets are simply the total counts of a sample (frequently called depths in the metabarcoding literature): $$ O_i = \sum_{j=1}^p Y_{ij} $$

Positive counts are used to compute sample-specific quantiles $q_i^l$ and cumulative sums $s_i^l$ defined as
$$
q_i^l = \min {q \text{ such that } \sum_j 1_{Y_{ij} \leq q} \geq l \sum_j 1_{Y_{ij} > 0} } \qquad s_i^l = \sum_{j: Y_{ij} \leq q_i^l} Y_{ij}
$$
The sample-specific quantiles are then used to compute reference quantiles defined as $q^l = \text{median} {q^i_l}$ and median average deviation around the quantile $q^l$ as $d^l = \text{median} |q_i^l - q^l|$. The method then searches for the smallest quantile $l$ for which it detects instability, defined as large relative increase in the $d^l$. Formally, $\hat{l}$ is the smallest $l$ satisfying $\frac{d^{l+1} - d^l}{d^l} \geq 0.1$. The scaling sample-specific offset are then chosen as:
$$
O_i = s_i^{\hat{l}} / \text{median}_i { s_i^{\hat{l}} }
$$
Dividing by the median of the $s_i^{\hat{l}}$ ensures that offsets are centered around $1$ and compare sizes differences with respect to the reference sample. Note also that the reference quantiles $q^l$ can be computed using either the median (default, as in the original @CSS paper) or the mean, by specifying `reference = mean`

, as implemented in `metagenomeseq`

.

A reference sample $(q_j)*j$ is first built by computing the geometric means of each species count:
$$
q_j = \exp \left( \frac{1}{n} \sum*{i} \log(Y_{ij})\right)
$$
Each sample is then compared to the reference sample to compute one ratio per species and the final offset $O_i$ is the median of those ratios:
$$
O_i = \text{median}*j \frac{Y*{ij}}{q_j}
$$
The method fails when no species is shared across all sample (as all $q_j$ are then $0$) or when a sample shares less than 50% of species with the reference (in which case the median of the ratios may be null or infinite). The problem can be alleviated by adding pseudocounts to the $c_{ij}$ with `pseudocounts = 1`

or using positive counts in the computations (`type = "poscounts"`

)

This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE)
$$
r_{ii'} = {\text{median}}*{j: Y*{ij}.Y_{i'j} > 0} \frac{Y_{ij}}{Y_{i'j}}
$$
The offset is then taken as the median of all the $r_{ii'}$:
$$
O_i = \text{median}*{i' != i} r*{ii'}
$$
The method fails when there is only one sample in the data set or when a sample shares no species with any other.

This method is fully detailed in @Kumar2018 and we only provide a barebone implementation corresponding to the defaults parameters of `Wrench::wrench()`

. Assume that samples belong to $K$ discrete groups and note $g_i$ the group of sample $i$. Wrench is based on the following (simplified) log-normal model for counts:
$$
Y_{ij} \sim \pi_{ij} \delta_0 + (1 - \pi_{ij})\log\mathcal{N}(\mu_{ij}, \sigma^2_j)
$$
where the $Y_{ij}$ are independent and the mean $\mu_{ij}$ is decomposed as:
$$
\mu_{ij} = \underbrace{\log{p_{0j}}}*{\text{log-ref. prop.}} + \underbrace{\log{d_i}}*{\text{log-depth}} + \underbrace{\log{\zeta_{0g_i}}}*{\text{log effect of group } g_i} + \underbrace{a*{i}}*{\text{(f|m)ixed effect}} + \underbrace{b*{ij}}_{\text{mixed effects}}
$$
where the random effects are independents centered gaussian and the depths is the total sum of counts:
$$
\begin{align*}
d_i & = \sum_{j=1}^p c_{ij} \
b_{ij} & \sim \mathcal{N}(0, \eta^2_{g_i}) \
\end{align*}
$$

The **net** log fold change $\theta_{ij}$ of the **proportion ratio** $r_{ij} = c_{ij} / d_i p_{0j}$ of species $j$ relative to the reference is $\log(\theta_{ij}) \overset{\Delta}{=} \mathbb{E}[\log(r_{ij}) | a_i, b_{ij}] = \log{\zeta_{0g_i}} + a_i + b_{ij}$. We can decompose it as $\theta_{ij} = \Lambda_i^{-1} v_{ij}$ where $\Lambda_i^{-1}$ is the *compositional correction factor* and $v_{ij}$ is the fold change of **true abundances**.

With the above notations, the net fold change compounds both the fold change of true abundances and the compositional correction factors. With the assumption that the $b_{ij}$ are centered, $\log(\hat{\Lambda}*i)$ can be estimated through a robust average of the $\hat{\theta}*{ij}$, which can themselves be computed from the log-ratio of proportions.

We detail here how the different parameters and/or effects are estimated.

- The reference proportions $p_{0j}$ are constructed as averages of the sample proportions $p_{ij}$ and the ratio are derived from both quantities $$ p_{ij} = \frac{Y_{ij}}{\sum_{j=1}^p Y_{ij}} \qquad p_{0j} = \frac{1}{n} \sum_{i=1}^n p_{ij} \qquad r_{ij} = \frac{p_{ij}}{p_{0j}} $$
- The probabilities of absence $\pi_{ij}$ are estimated by fitting the following Bernoulli models: $$ 1_{{Y_{ij} = 0}} \sim \mathcal{B}(\pi_{j}^{d_i}) $$ and setting $\hat{\pi}_{ij} = \hat{\pi}^{d_i}$
- The species variances $\sigma^2_j$ are estimated by fitting the following linear model (with no zero-inflation component)
$$
\log Y_{ij} \sim \log(d_i) + \mu_{g_i} + \mathcal{N}(0, \sigma^2_j)
$$
Note that in the original
`Wrench::wrench()`

, the log depth $\log(d_i)$ is used as predictor but I believe it makes more sense to use it an offset. - set the group proportions $p_{gj}$ and group ratios $r_{gj}$ to: $$ p_{gj} = \frac{\sum_{i : g_i = g} Y_{ij}}{\sum_{j, i : g_i = g} Y_{ij}} \qquad r_{gj} = \frac{p_{gj}}{p_{0j}} $$
- Estimate the location and dispersion parameters as:
$$
\hat{\zeta}
*{0g} = \frac{\sum*{j=1}^p r_{gj}}{p} \qquad \log{r_{g.}} = \frac{\sum_{j: r_{gj} \neq 1} \log{r_{gj}}}{\sum_{j: r_{gj} \neq 0} 1}

\qquad \hat{\eta}*{g}^2 = \frac{\sum*{j: r_{gj} \neq 1} (\log{r_{gj}} - \log{r_{g.}})^2}{\sum_{j: r_{gj} \neq 0} 1} $$ - Estimate the mixed effects as shrunken (and scaled) averages of the ratios
$$
\hat{a}
*i = \frac{\sum*{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} (\log{r_{ij}} - \log{\hat{\zeta}*{0g_i}})}{\sum*{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j}} \qquad \hat{b}*{ij} = \frac{\hat{\eta}^2*{g_i}}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} \left( \log{r_{ij}} - \log\hat{\zeta}_{0g_i} - \hat{a}_i\right) $$ - Estimate the regularized ratios as:
$$
\hat{\theta}
*{ij} = \exp\left( \log\hat{\zeta}*{0g_i} + \hat{a}*i + \hat{b}*{ij} \right) $$ - Estimate the compositional correction factors as (weighted) means of the regularized (and possibly corrected) ratios:
$$
\hat{\Lambda}
*i = \begin{cases} \sum*{j = 1}^p \hat{\theta}*{ij} \bigg/ p& \text{ if type = "simple"} \ \sum*{j = 1}^p \hat{\theta}*{ij} e^{-\hat{\sigma}_j^2 / 2} / w*{ij} \bigg/ \sum_{j=1}^p 1/w_{ij} & \text{ if type = "wrench"} \ \end{cases} $$ where $w_{ij} = (1 - \hat{\pi}*{ij})(\hat{\pi}*{ij} + e^{\hat{\sigma}*j^2 + \hat{\eta}_i^2} - 1)$. The correction term $e^{\hat{\sigma}_j^2 / 2}$ arises from the relation $\mathbb{E}[r*{ij} | r_{ij} > 0] = \theta_{ij} e^{\sigma_j^2/2}$ and the weight $w_{ij}$ are marginal variances: $\mathbb{V}[r_{ij}] = w_{ij}$.

The offsets are then the product of compositional correction factors and depths:
$$
O_i = \frac{\hat{\Lambda}*i}{(\prod*{i = 1}^n \hat{\Lambda}*i)^{1/n}} \times \frac{d_i}{(\prod*{i = 1}^n d_i)^{1/n}}
$$

[^1]: although a `data.frame`

is technically a `list`

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.