knitr::opts_chunk$set( screenshot.force = FALSE, echo = TRUE, rows.print = 5, message = FALSE, warning = FALSE)

This vignette documents the data format used in **PLNmodel** by `PLN`

and its variants. It also shows how to create an object in the proper format for further analyses from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class objects.

We illustrate the format using trichoptera data set, a full description of which can be found in the corresponding vignette.

library(PLNmodels) data(trichoptera)

The trichoptera data set is a list made of two data frames: `Abundance`

(hereafter referred to as the *counts*) and `Covariate`

(hereafter the *covariates*).

str(trichoptera, max.level = 1)

The covariates include, among others, the wind, pressure and humidity.

names(trichoptera$Covariate)

In the PLN framework, we model the counts from the covariates, let's say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called *formula interface* and it would thus be naturel to write something like

PLN(Abundance ~ Wind + Pressure, data = trichoptera)

Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically **multivariate**: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame).

We must therefore prepare a data structure where `Abundance`

refers to a count *matrix* whereas `Wind`

and `Pressure`

refer to *vectors* before feeding it to `PLN`

. That's the purpose of `prepare_data`

.

trichoptera2 <- prepare_data(counts = trichoptera$Abundance, covariates = trichoptera$Covariate) str(trichoptera2)

If you look carefully, you can notice a few difference between `trichoptera`

and `trichoptera2`

:

- the first is a
`list`

whereas the second is a`data.frame`

[^1]; `Abundance`

is a matrix-column of`trichoptera2`

that you can extract using the usual functions`[`

and`[[`

to retrieve the count matrix;`trichoptera2`

has an additional`Offset`

column (more on that later).

It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The *proper way* to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods:

- Total Sum Scaling (TSS), where the offset of a sample is the total count in that sample
- Cumulative Sum Scaling (CSS), introduced in [@CSS], where the offset of a sample if the cumulative sum of counts in that sample, up to a quantile determined in a data driven way.
- Relative Log-Expression (RLE), implemented in [@DESeq2], where all samples are used to compute a reference sample, each sample is compared to the reference sample using log-ratios and the offset is the median log-ratio.
- Geometric Mean of Pairwise Ratio (GMPR), introduced in [@GMPR] where each sample is compared to each other to compute a median log-ratio and the offset of a sample is the geometric means of those pairwise ratios.

Each of these offset be computed from a counts matrix using the `compute_offset`

function and changing its `offset`

argument:

## same as compute_offset(trichoptera$Abundance, offset = "TSS") compute_offset(trichoptera$Abundance)

In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden)

compute_offset(trichoptera$Abundance, "CSS") compute_offset(trichoptera$Abundance, "RLE") compute_offset(trichoptera$Abundance, "GMPR")

We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks.

compute_offset(trichoptera$Abundance, "RLE", pseudocounts = 1)

`prepare_data`

We'll already learned that `prepare_data`

can join counts and covariates into a single data.frame. It can also compute offset through `compute_offset`

and does so by default with `offset = "TSS"`

, hence the `Offset`

column in `trichoptera2`

. You can change the offset method and provide additional arguments that will passed on to `compute_offset`

.

str(prepare_data(trichoptera$Abundance, trichoptera$Covariate, offset = "RLE", pseudocounts = 1))

Different communities use different standard for the count data where samples are either or columns of the counts matrix. `prepare_data`

uses heuristics to guess the direction of the counts matrix (or fail informatively doing so) and automatically transpose it if needed.

Finally, `prepare_data`

enforces sample-consistency between the counts and the covariates and automatically trims away:
- samples for which only covariates or only counts are available;
- samples with no positive counts

For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2 = 47 samples left, as expected.

nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample trichoptera$Covariate[-49,] ## remove last sample ))

`prepare_data_from_[phyloseq|biom]`

Community composition data are quite popular in microbial ecology and usually stored in flat files using the biom format and/or imported in R as phyloseq-class objects [@phyloseq] using the Bioconductor phyloseq package.

We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object.

Reading from a biom file requires the bioconductor package biomformat. This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a biom file using the following steps:

- read your biom file with
`biomformat::read_biom()`

- extract the count table with
`biomformat::biom_data()`

- extract the covariates with
`biomformat::sample_metadata()`

(or build your own) - feed them to
`prepare_data`

as illustrated below:

## If biomformat is not installed, uncomment the following lines # if (!requireNamespace("BiocManager", quietly = TRUE)) { # install.packages("BiocManager") # } # BiocManager::install("biomformat") library(biomformat) biomfile <- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat") biom <- biomformat::read_biom(biomfile) ## extract counts counts <- as(biomformat::biom_data(biom), "matrix") ## extract covariates (or prepare your own) covariates <- biomformat::sample_metadata(biom) ## prepare data my_data <- prepare_data(counts = counts, covariates = covariates) str(my_data)

Likewise, preparing data from a phyloseq-class object requires the bioconductor package phyloseq. This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a phyloseq object using the following steps:

- extract the count table with
`phyloseq::otu_table()`

- extract the covariates with
`phyloseq::sample_data()`

(or build your own) - feed them to
`prepare_data`

as illustrated below:

## If biomformat is not installed, uncomment the following lines # if (!requireNamespace("BiocManager", quietly = TRUE)) { # install.packages("BiocManager") # } # BiocManager::install("phyloseq") library(phyloseq) data("enterotype") ## extract counts counts <- as(phyloseq::otu_table(enterotype), "matrix") ## extract covariates (or prepare your own) covariates <- phyloseq::sample_data(enterotype) ## prepare data my_data <- prepare_data(counts = counts, covariates = covariates) str(my_data)

We detail here the mathematical background behind the various offsets and the way they are computed. Note $\mathbf{C} = (c_{ij})$ the counts matrix where $c_{ij}$ is the count of species $j$ in sample $i$. Assume that there are $J$ species in total. The offset of sample $i$ is noted $O_i$ and computed in the following way.

Offsets are simply the total counts of a sample: $$ O_i = \sum_{j=1}^J c_{ij} $$

Positive counts are used to compute sample-specific quantiles $q_i^l$ and cumulative sums $s_i^l$ defined as
$$
q_i^l = \min {q \text{ such that } \sum_j 1_{c_{ij} \leq q} \geq l \sum_j 1_{c_{ij} > 0} } \qquad s_i^l = \sum_{j: c_{ij} \leq q_i^l} c_{ij}
$$
The sample-specific quantiles are then used to compute reference quantiles defined as $q^l = \text{median} {q^i_l}$ and median average deviation around the quantile $q^l$ as $d^l = \text{median} |q_i^l - q^l|$. The method then searches for the smallest quantile $l$ for which it detects instability, defined as large relative increase in the $d^l$. Formally, $\hat{l}$ is the smallest $l$ satisfying $\frac{d^{l+1} - d^l}{d^l} \geq 0.1$. The scaling sample-specific offset are then chosen as:
$$
O_i = s_i^{\hat{l}} / \text{median}_i { s_i^{\hat{l}} }
$$
Dividing by the median of the $s_i^{\hat{l}}$ ensures that offsets are centered around $1$ and compare sizes differences with respect to the reference sample. Note also that the reference quantiles $q^l$ can be computed using either the median (default, as in the original @CSS paper) or the mean, by specifying `reference = mean`

, as implemented in `metagenomeseq`

.

A reference sample $(q_j)*j$ is first built by computing the geometric means of each species count:
$$
q_j = \exp \left( \frac{1}{n} \sum*{i} \log(c_{ij})\right)
$$
Each sample is then compared to the reference sample to compute one ratio per species and the final offset $O_i$ is the median of those ratios:
$$
O_i = \text{median}*j \frac{c*{ij}}{q_j}
$$
The method fails when no species is shared across all sample (as all $q_j$ are then $0$) or when a sample shares less than 50% of species with the reference (in which case the median of the ratios may be null or infinite). The problem can be alleviated by adding pseudocounts to the $c_{ij}$ with `pseudocounts = 1`

.

This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE)
$$
r_{ii'} = {\text{median}}*{j: c*{ij}.c_{i'j} > 0} \frac{c_{ij}}{c_{i'j}}
$$
The offset is then taken as the median of all the $r_{ii'}$:
$$
O_i = \text{median}*{i' != i} r*{ii'}
$$
The method fails when there is only one sample in the data set or when a sample shares no species with any other.

[^1]: although a `data.frame`

is technically a `list`

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.