In mcalgaro93/benchdamic: Benchmark of differential abundance methods on microbiome data

knitr::opts_chunk$set(
    warning = FALSE, 
    error = FALSE, 
    message = FALSE)

Introduction

This vignette provides an introductory example on how to work with the analysis framework firstly proposed in [@calgaro].

The package is named benchdamic, acronym for "BENCHmarking of Differential Abundance detection methods for MICrobial data". Not only does the package structure allow the users to test a variety of commonly used methods for differential abundance analysis, but it also enables them to set benchmarks including custom methods on their datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions, ii) ability to control false discoveries, iii) concordance of the findings, and iv) enrichment of differentially abundant microbial species in specific conditions. Each step of the assessment is flexible when it comes to the choice of differential abundance methods, their parameters, and input data types. Various graphic outputs lead the users to an informed decision when evaluating the most suitable method to use for their data.

Installation

To install this package, start R (version "4.2") and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("benchdamic")

or use:

if (!require("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("mcalgaro93/benchdamic")

Then, load some packages for basic functions and data:

library(benchdamic)
# Parallel computation
library(BiocParallel)
# Generate simulated data
library(SPsimSeq)
# Data management
library(phyloseq)
library(SummarizedExperiment)
library(plyr)
# Graphics and tables
library(ggplot2)
library(cowplot)
library(kableExtra)

Data loading

All datasets used in benchdamic are downloaded using the HMP16SData Bioconductor package [@HMP16SData].

For GOF and TIEC analyses a homogeneous group of samples (e.g., only samples from a specific experimental condition, phenotype, treatment, body site, etc.) ps_stool_16S is used (use help("ps_stool_16S") for details). It contains 16S data from:

32 stool samples from participants of the Human Microbiome Project;
71 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 71 taxa corresponding to 71 genera).

data("ps_stool_16S")
ps_stool_16S

For Concordance and Enrichment analyses the ps_plaque_16S dataset is used (use help("ps_plaque_16S") for details). It contains 16S data from:

30 participants of the Human Microbiome Project;
samples collected from subgingival plaque and supragingival plaque for each subject (a total of 60 samples);
88 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 88 taxa corresponding to 88 genera).

data("ps_plaque_16S")
ps_plaque_16S

Goodness of Fit

Assumption: Many DA detection methods are based on parametric distributions.

Research question: Which are the parametric distributions that can fit both the proportion of zeros and the counts in your data?

GOF structure

As different methods rely on different statistical distributions to perform DA analysis, the goodness of fit (GOF) of the statistical models underlying some of the DA methods on a 16S dataset is assessed. For each model, its ability to correctly estimate the average counts and the proportion of zeroes by taxon is evaluated.

Five distributions are considered: (1) the negative binomial (NB) used in edgeR and DeSeq2 [@edger; @deseq2], (2) the zero-inflated negative binomial (ZINB) used in ZINB-WaVE [@zinbwave], (3) the truncated Gaussian Hurdle model of MAST [@mast], (4) the zero-inflated Gaussian (ZIG) mixture model of metagenomeSeq [@zig], and (5) the Dirichlet-Multinomial (DM) distribution underlying ALDEx2 Monte-Carlo sampling [@aldex2] and multivariate extension of the beta-binomial distribution used by corncob [@corncob].

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figGOF). To help with the reading: green boxes represent the inputs or the outputs, red boxes are the methods and blue boxes are the main parameters of those method.

knitr::include_graphics("./GOF_structure.svg")

Parametric distributions

Negative Binomial and Zero-Inflated Negative Binomial Models

For any $\mu \ge 0$ and $\theta > 0$, let $f_{NB}(\cdot;\mu,\theta)$ denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean $\mu$ and inverse dispersion parameter $\theta$, namely:$$ f_{NB} = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)}\left(\frac{\theta}{\theta+1} \right)^\theta\left(\frac{\mu}{\mu+\theta} \right)^y, \forall y \in \mathbb{N} $$Note that another parametrization of the NB PMF is in terms of the dispersion parameter $\psi = \theta^{-1}$ (although $\theta$ is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is $\mu$ and its variance is:$$ \sigma^2 = \mu + \frac{\mu^2}{\theta} = \mu+\psi\mu^2 $$In particular, the NB distribution boils down to a Poisson distribution when $\psi=0 \iff \theta=+ \infty$.

For any $\pi\in[0,1]$, let $f_{ZINB}(\cdot;\mu,\theta,\pi)$ be the PMF of the ZINB distribution given by:

$$ f_{ZINB}(\cdot;\mu,\theta,\pi) = \pi\delta_0(y)+(1-\pi)f_{NB}(y;\mu,\theta), \forall y\in\mathbb{N} $$

where $\delta_0(\cdot)$ is the Dirac function. Here, $\pi$ can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

To fit these distributions on real count data the edgeR [@edger] and zinbwave [@zinbwave] packages are used. In benchdamic they are implemented in the fitNB() and fitZINB() functions.

Zero-Inflated Gaussian Model

The raw count for sample j and feature i is denoted by $c_{ij}$. The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: $y_{ij} = log_2(c_{ij}+1)$ as a mixture of a point mass at zero $I_{0}(y)$ and a count distribution $f_{count}(y;\mu,\sigma^2) \sim N(\mu,\sigma^2)$. Given mixture parameters $\pi_j$, we have that the density of the ZIG distribution for feature i, in sample j with $s_j$ total counts is: $$f_{ZIG}(y_{ij};s_j,\beta,\mu_i,\sigma^2_i) = \pi_j(s_j)\cdot I_{0}(y_{ij})+(1-\pi_j(s_j))\cdot f_{count}(y_{ij};\mu,\sigma^2)$$

The mean model is specified as:$$E(y_{ij})=\pi_{j} + (1-\pi_j)\cdot\left(b_{i0}+\eta_ilog_2\left( \frac{s_j^{\hat{l}}}{N}+1 \right) \right)$$

In this case, parameter $b_{i0}$ is the intercept of the model while the term including the logged normalization factor $log_2\left(\frac{s_j^{\hat{l}}}{N}+1 \right)$ captures feature-specific normalization factors through parameter $\eta_i$. In details, $s_j^{\hat{l}}$ is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. $N$ is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors (which is used instead of 1000). The mixture parameters $\pi_j(s_j)$ are modeled as a binomial process:

$$log\frac{\pi_j}{1-\pi_j} = \beta_0+\beta_1\cdot log(s_j)$$

To fit this distribution on real count data the metagenomeSeq package [@zig] is used. In benchdamic it is implemented in the fitZIG() function.

Truncated Gaussian Hurdle Model

The original field of application of this method was the single-cell RNAseq data, where $y = log_2(TPM+1)$ expression matrix was modeled as a two-part generalized regression model [@mast]. In microbiome data that starting point translates to a $y_{ij} = log_2\left(counts_{ij}\cdot\frac{10^6}{libSize_{j}}+1 \right)$ or a $log_2\left(counts_{ij}\cdot\frac{ median(libSize)}{libSize_{j}}+1\right)$.

The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.

Given normalized, possibly thresholded, abundance $y_{ij}$, the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene $i$. Define the indicator $z_{ij}$, indicating whether taxon $i$ is expressed in sample $j$ (i.e., $z_{ij} = 0$ if $y_{ij} = 0$ and $z_{ij} = 1$ if $y_{ij} > 0$). We fit logistic regression models for the discrete variable $Z$ and a Gaussian linear model for the continuous variable $(Y|Z=1)$ independently, as follows:

$$ logit(Pr(Z_{ij}=1))=X_j\beta_i^D $$

$$ P(Y_{ij}=y|Z_{ij}=1) \sim N(X_j\beta^C_i,\sigma^2_i)$$

To estimate this distribution on real count data the MAST package [@mast] is used. In benchdamic it is implemented in the fitHURDLE() function.

Dirichlet-Multinomial Mixture Model

The probability mass function of a $n$ dimensional multinomial sample $y = (y_1,...,y_n)^T$ with library size $libSize = \sum_{i=1}^ny_i$ and parameter $p=(p_1,...,p_n)$ is:

$$ f(y;p)= {libSize\choose y}\prod_{i=1}^np_i^{y_i} $$

The mean-variance structure of the MN model doesn't allow over-dispersion, which is common in real data. DM distribution models the probability parameter $p$ in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector $y$ over $libSize$ trials under DM with parameter $\alpha=(\alpha_1,...,\alpha_n)$, $a_i>0$ and proportion vector $p \in \Delta_n={(p_1,...,p_n):p_i\ge0,\sum_ip_i=1 }$ is:

$$ f(y|\alpha)={libSize\choose y}\frac{\prod_{i=1}^n(a_i)y_i}{(\sum_i\alpha_i)\cdot libSize} $$

The mean value for the $i^{th}$ taxon and $j^{th}$ sample of the count matrix is given by $libSize_j\cdot \frac{\alpha_{ij}}{\sum_i a_{ij}}$.

To estimate this distribution on real count data the MGLM package [@MGLM; @MGLMpackage] is used. In benchdamic it is implemented in the fitDM() function.

Comparing estimated and observed values

The goodness of fit for the previously described distributions is assessed comparing estimated and observed values. For each taxon the following measures are compared:

the Mean Difference (MD) i.e. the difference between the estimated mean and the observed mean abundance (log scale);
the Zero Probability Difference (ZPD) i.e. the difference between the probability to observe a zero and the observed proportion of samples which have zero counts.

To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction ($log(counts+1)$), is well suited, indeed it reduces the count range making the differences more stable.

Except for the fitHURDLE() function, which performs a CPM transformation on the counts (or the one with the median library size), and the fitZIG() function which models the $log_2(counts+1)$, the other methods, fitNB(), fitZINB(), and fitDM(), model the $counts$ directly. For these reasons, fitHURDLE()'s output should not be compared directly to the observed $log(counts+1)$ mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

Here an example on how to fit a Truncated Gaussian hurdle model:

example_HURDLE <- fitHURDLE(
    object = ps_stool_16S,
    scale = "median"
)
head(example_HURDLE)

The values above are those estimated by the fitHURDLE() function. Some NA values could be present due to taxa sparsity. The internally used function to prepare for the comparisons the observed counts is prepareObserved(), specifying the scale parameter if the HURDLE model is considered (if scale = "median", the median library size is used to scale counts instead of $10^6$):

observed_hurdle <- prepareObserved(
    object = ps_stool_16S, 
    scale = "median")
head(observed_hurdle)

Which are different from the non-scaled observed values:

head(prepareObserved(object = ps_stool_16S))

The function to compute MD and ZPD values, is meanDifferences():

head(meanDifferences(
    estimated = example_HURDLE,
    observed = observed_hurdle
))

A wrapper function to simultaneously perform the estimates and the mean differences is fitModels():

GOF_stool_16S <- fitModels(
    object = ps_stool_16S,
    models = c("NB", "ZINB", "DM", "ZIG", "HURDLE"),
    scale_HURDLE = c("median", "default"),
    verbose = FALSE # TRUE is always suggested
)

Exploiting the internal structure of the fitModels()'s output the Root Mean Squared Error (RMSE) values for MD values can be extracted (the lower, the better):

plotRMSE(GOF_stool_16S, difference = "MD", plotIt = FALSE)

Similarly, they are extracted for ZPD values:

plotRMSE(GOF_stool_16S, difference = "ZPD", plotIt = FALSE)

Visualization

Mean Differences

To plot estimated and observed values the plotMD() function can be used (Figure \@ref(fig:plotGOFMD)). No systematic trend are expected, moreover, the closer the values to the dotted line are (representing equality between observed and estimated values), the better the goodness of fit relative to the model.

plotMD(
    data = GOF_stool_16S,
    difference = "MD",
    split = TRUE
)

If some warning messages are shown with this graph, they are likely due to sparse taxa. To address this, the number of NA values generated by each model can be investigated (which are 24 for each HURDLE model):

plyr::ldply(GOF_stool_16S, function(model) 
    c("Number of NAs" = sum(is.na(model))), 
    .id = "Distribution")

To summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is also displayed for each model. For the HURDLE_default model, a quite different range of values of mean differences is displayed because of the excessive default scaling proposed (1 million). It is also possible to plot only a subset of the estimated models (Figure \@ref(fig:plotGOFMDnoHurdleDefault)).

plotMD(
    data = GOF_stool_16S[1:5],
    difference = "MD",
    split = TRUE
)

From the Figure \@ref(fig:plotGOFMDnoHurdleDefault), DM distribution slightly overestimates the logarithm of the average counts for low values, while the HURDLE_median distribution presents an overestimation that increases as the observed values increase. ZIG, but especially NB and ZINB distributions produce very similar estimated and observed values. Similarly, to plot the mean differences for Zero Probability/Proportion the plotMD() function is used (Figure \@ref(fig:plotGOFZPD)).

plotMD(
    data = GOF_stool_16S[1:5],
    difference = "ZPD",
    split = TRUE
)

From the figure \@ref(fig:plotGOFZPD), ZIG and NB models underestimate the probability to observe a zero for sparse features, while the HURDLE_median model presents a perfect fit as the probability to observe a zero is the zero rate itself by construction. DM and ZINB models produce estimated values very similar to the observed ones. MDs and ZPDs are also available in the Figure \@ref(fig:plotGOFMDcollapsed) with a different output layout:

plot_grid(plotMD(data = GOF_stool_16S[1:5], difference = "MD", split = FALSE),
    plotMD(data = GOF_stool_16S[1:5], difference = "ZPD", split = FALSE),
    ncol = 2
)

RMSE

As already mentioned, to summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is used. The summary statistics for the overall performance are visible in Figure \@ref(fig:plotGOFRMSE):

plot_grid(plotRMSE(GOF_stool_16S, difference = "MD"),
    plotRMSE(GOF_stool_16S, difference = "ZPD"),
    ncol = 2
)

The lower the RMSE value, the better the goodness of fit of the model.

Discussion about GOF

The Goodness of Fit chapter is focused on some existing parametric models: NB, ZINB, HURDLE, ZIG, DM. The assumption of this analysis is that if a model estimates the data well, then a method based on that model may be a possibly good choice for studying the differential abundance. Other distributions could also be investigated (Poisson, Zero-Inflated Poisson...) but what about DA methods which are based on non-parametric models such as ANCOM? We can't use the GOF framework to compare the parametric models to non-parametric models. However, non-parametric methods may work well in real scenarios due to their added robustness and other evaluations are necessary in order not to favor one group of methods over another.

DA methods

Differential abundance analysis is the core of benchdamic. DA analysis steps can be performed both directly, using the DA_<name_of_the_method>() methods, or indirectly, using the set_<name_of_the_method>() functions.

set_<name_of_the_method>() functions allow to create lists of instructions for DA methods which can be used by the runDA(), runMocks(), and runSplits() functions (more details in each chapter).

This framework grants a higher flexibility allowing users to set up the instructions for many DA methods only at the beginning of the analysis. If some modifications are needed, the users can re-set the methods or modify the list of instructions directly.

A list of the available methods is presented below (Table \@ref(tab:availableMethodsTable)). They are native to different application fields such as RNA-Seq, single-cell RNA-Seq, or Microbiome data analysis. Some basic information are reported for each DA method, for more details please refer to functions' manual.

available_methods <- read.csv(file = "./benchdamic_methods.csv", sep = ";")
kable(x = available_methods, 
    caption = "DA methods available in benchdamic.",
    col.names = c("Method (package)", "Short description", "Test",
        "Normalization / Transformation", "Suggested input", 
        "Output", "Application"), booktabs = TRUE) %>%
    kable_styling(latex_options = "scale_down") %>%
    row_spec(0, bold = TRUE, color = "black") %>%
    column_spec(c(1,5,6), width = "3cm", color = "black") %>% 
    column_spec(2:4, width = "6cm", color = "black") %>%
    landscape()

Please remember that the data pre-processing, including QC analysis, filtering steps, and normalization, are not topics treated in benchdamic. In real life situations those steps precede the DA analysis and they are of extreme importance to obtain reliable results.

Some exceptions are present for the normalization step. In benchdamic, norm_edgeR(), norm_DESeq2(), norm_CSS(), and norm_TSS() are implemented functions to add the normalization/scaling factors to the phyloseq or TreeSummarizedExperiment objects, needed by DA methods. As for DA methods, normalization instructions list, including the previous functions, can be set using set_<normalization_name>() or setNormalizations() too. To run the normalization instructions the function runNormalizations() can be used (more examples will follow).

Many DA methods already contain options to normalize or transform counts. If more complex normalizations/transformations are needed, all the DA methods support the use of TreeSummarizedExperiment objects. In practice, users can put the modified count matrix in a named assay (the counts assay is the default one which contains the raw counts) and run the DA method on that assay using the parameter assay_name = "assay_to_use".

Add a custom DA method

To add a custom method to the benchmark, it must:

include a verbose = TRUE (or FALSE) parameter to let the user know what the method is doing;
return a pValMat matrix which contains the raw p-values and adjusted p-values in rawP and adjP columns respectively;
return a statInfo matrix which contains the summary statistics for each feature, such as the logFC, standard errors, test statistics and so on;
return a name which contains the complete name of the used method.

An example is proposed:

DA_yourMethod <- function(
        object, 
        assay_name = "counts",
        param1,
        param2,
        verbose = TRUE)
{
    if(verbose)
        message("Reading data")
    # Extract the data from phyloseq or TreeSummarizedExperiment
    counts_metadata <- get_counts_metadata(
        object = object, 
        assay_name = assay_name)
    counts <- counts_metadata[[1]] # First position = counts 
    metadata <- counts_metadata[[2]] # Second position = metadata

    ### your method's code 
    # Many things here
    if(verbose)
        message("I'm doing this step.")
    # Many other things here
    ### end of your method's code

    if(verbose)
        message("Extracting important statistics")
    # contains the p-values
    vector_of_pval <- NA 
    # contains the adjusted p-values
    vector_of_adjusted_pval <- NA 
    # contains the OTU, or ASV, or other feature names. 
    # Usually extracted from the rownames of the count data
    name_of_your_features <- NA 
    # contains the logFCs
    vector_of_logFC <- NA 
    # contains other statistics
    vector_of_statistics <- NA 

    if(verbose)
        message("Preparing the output")
    pValMat <- data.frame("rawP" = vector_of_pval,
                          "adjP" = vector_of_adjusted_pval)
    statInfo <- data.frame("logFC" = vector_of_logFC,
                           "statistics" = vector_of_statistics)
    name <- "write.here.the.name"
    # Be sure that the algorithm hasn't changed the order of the features. If it
    # happens, re-establish the original order.
    rownames(pValMat) <- rownames(statInfo) <- name_of_your_features

    # Return the output as a list
    return(list("pValMat" = pValMat, "statInfo" = statInfo, "name" = name))
} # END - function: DA_yourMethod

Once the custom method is set, it can be run by using the DA_yourMethod() function or manually, by setting a list of instructions of the custom method with the desired combination of parameters:

my_custom_method <- list(
    customMethod.1 = list( # First instance
        method = "DA_yourMethod", # The name of the function to call
        assay_name = "counts",
        param1 = "A", # Its combination of parameters
        param2 = "B"), # No need of verbose and object parameters
    customMethod.2 = list( # Second instance
        method = "DA_yourMethod", 
        assay_name = "counts",
        param1 = "C",
        param2 = "D")
    # Other istances
)

The method field, containing the name of the method to call, is mandatory, while the verbose parameter and the object are not needed.

Type I Error Control

Assumption: Many DA methods do not control the number of false discoveries.

Research question: Which are the DA methods which can control the number of false positives in your data?

TIEC structure

The Type I Error is the probability of a statistical test to call a feature DA when it is not, under the null hypothesis. To evaluate the Type I Error rate Control (TIEC) for each differential abundance detection method:

using the createMocks() function, homogeneous samples (e.g., only the samples from one experimental group) are randomly assigned to a group ('grp1' or 'grp2');
DA methods are run to find differences between the two mock groups using runMocks();
the number of DA feature for each method is counted, these are False Positives (FP) by construction;
points 1-3 are repeated many times (N = 3, but at least 1000 is suggested) and the results are averaged using the createTIEC() function.

In this setting, the p-values of a perfect test should be uniformly distributed between 0 and 1 and the false positive rate (FPR or observed $\alpha$), which is the observed proportion of significant tests, should match the nominal value (e.g., $\alpha=0.05$).

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figTIEC).

knitr::include_graphics("./TIEC_structure.svg")

Create mock comparisons

Using createMocks() function, samples are randomly grouped, N = 3 times. A higher N is suggested (at least 1000) but in that case a longer running time is required.

set.seed(123)
my_mocks <- createMocks(
    nsamples = phyloseq::nsamples(ps_stool_16S),
    N = 3
) # At least N = 1000 is suggested

Set up normalizations and DA methods

Once the mocks have been generated, DA analysis is performed. Firstly, some normalization factors, such as TMM from edgeR and CSS from metagenomeSeq, and some size factors such as poscounts from DESeq2 are added to the phyloseq object (or TreeSummarizedExperiment object). This can be done, manually, using the norm_edgeR(), norm_DESeq2(), and norm_CSS() methods:

ps_stool_16S <- norm_edgeR(
    object = ps_stool_16S,
    method = "TMM"
)
ps_stool_16S <- norm_DESeq2(
    object = ps_stool_16S,
    method = "poscounts"
)
ps_stool_16S <- norm_CSS(
    object = ps_stool_16S,
    method = "CSS"
)

Or automatically, using the setNormalizations() and runNormalizations() methods:

my_normalizations <- setNormalizations(
    fun = c("norm_edgeR", "norm_DESeq2", "norm_CSS"), 
    method = c("TMM", "poscounts", "CSS"))
ps_stool_16S <- runNormalizations(normalization_list = my_normalizations, 
    object = ps_stool_16S, verbose = TRUE)

Some messages "Found more than one "phylo" class in cache..." could be shown after running the previous functions. They are caused by duplicated class names between phyloseq and tidytree packages and can be ignored.

After the normalization/size factors have been added to the phyloseq or TreeSummarizedExperiment object, the user could decide to filter rare taxa which do not carry much information. In this example vignette a simple filter is applied to keep only features with a count in at least 3 samples:

ps_stool_16S <- phyloseq::filter_taxa(
    physeq = ps_stool_16S, 
    flist = function(x) sum(x > 0) >= 3, prune = TRUE)
ps_stool_16S

Some zero-inflated negative binomial weights using the weights_ZINB() function are computed. They can be used as observational weights in the generalized linear model frameworks of DA_edgeR(), DA_DESeq2(), and DA_limma(), as described in [@zinbweights].

zinbweights <- weights_ZINB(
    object = ps_stool_16S,
    K = 0,
    design = "~ 1",
)

For each row of the mock_df data frame a bunch of DA methods is run. In this demonstrative example the following DA methods are used:

basic t and wilcox tests;
edgeR with TMM scaling factors [@edger] with and without ZINB weights [@zinbwave; @zinbweights];
DESeq2 with poscounts normalization factors [@deseq2] with and without ZINB weights [@zinbwave; @zinbweights];
limma-voom with TMM scaling factors [@limma; @voom; @limmarnaseq] with and without ZINB weights [@zinbwave; @zinbweights];
ALDEx2 with all and iqlr data transformation (denom parameter) performing the wilcox test [@aldex2];
metagenomeSeq with CSS normalization factors using both the fitFeatureModel (for a zero-inflated log-normal distribution, mixture model, as suggested in the package vignette) and the fitZig (for a zero-inflated gaussian distribution, mixture model) algorithms [@zig];
corncob with a focus on average differences (not dispersion, regulated by phi.formula and phi.formula_null parameters) using both Wald and LRT tests [@corncob];
MAST with both rescalings, default (i.e. $10^6$, for CPMs) and median [@mast];
Seurat with LogNormalize and CLR normalization/transformations, t and wilcox tests, and $10^5$ as scaling factor [@seurat];
ANCOMBC2 based on ANCOM-II preprocessing of zero counts but with the addition of a linear regression framework and sampling fraction bias correction (BC parameter) [@ancom-bc; @ancom-ii];
dearseq with asymptotic test [@dearseq];
linda with winsorization for the outliers and pseudo-count addition to handle zeros [@linda];
Maaslin2 with TSS normalization, LOG transformation, and LM analysis method [@maaslin2];
ZicoSeq with default parameters [@ZicoSeq].

Among the available methods, NOISeq [@noiseq] has not been used since it does not return p-values but only adjusted ones. Similarly, mixMC [@mixMC] has not been used since it does not return p-values. Many combination of parameters are still possible for all the methods.

my_basic <- set_basic(pseudo_count = FALSE,
    contrast = c("group", "grp2", "grp1"), 
    test = c("t", "wilcox"), 
    paired = FALSE, 
    expand = TRUE)

my_edgeR <- set_edgeR(
    pseudo_count = FALSE,
    group_name = "group",
    design = ~ group,
    robust = FALSE,
    coef = 2,
    norm = "TMM",
    weights_logical = c(TRUE, FALSE),
    expand = TRUE)

my_DESeq2 <- set_DESeq2(
    pseudo_count = FALSE,
    design = ~ group,
    contrast = c("group", "grp2", "grp1"),
    norm = "poscounts",
    weights_logical = c(TRUE, FALSE),
    alpha = 0.05,
    expand = TRUE)

my_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ group,
    coef = 2,
    norm = "TMM",
    weights_logical = c(FALSE, TRUE),
    expand = TRUE)

my_ALDEx2 <- set_ALDEx2(
    pseudo_count = FALSE,
    design = "group",
    mc.samples = 128,
    test = "wilcox",
    paired.test = FALSE, 
    denom = c("all", "iqlr"), 
    contrast = c("group", "grp2", "grp1"),
    expand = TRUE)

my_metagenomeSeq <- set_metagenomeSeq(
    pseudo_count = FALSE, 
    design = "~ group", 
    coef = "groupgrp2", 
    norm = "CSS", 
    model = c("fitFeatureModel", "fitZig"),
    expand = TRUE)

my_corncob <- set_corncob(
    pseudo_count = FALSE,
    formula = ~ group,
    formula_null = ~ 1,
    phi.formula = ~ group,
    phi.formula_null = ~ group,
    test = c("Wald", "LRT"),
    boot = FALSE,
    coefficient = "groupgrp2")

my_MAST <- set_MAST(
    pseudo_count = FALSE,
    rescale = c("default", "median"),
    design = "~ 1 + group",
    coefficient = "groupgrp2",
    expand = TRUE)

my_Seurat <- set_Seurat(
    pseudo_count = FALSE,
    test = c("t", "wilcox"),
    contrast = c("group", "grp2", "grp1"),
    norm = c("LogNormalize", "CLR"),
    scale.factor = 10^5,
    expand = TRUE
)

my_ANCOM <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "group", 
    contrast = c("group", "grp2", "grp1"),
    BC = TRUE, 
    expand = TRUE   
)

my_dearseq <- set_dearseq(
    pseudo_count = FALSE,covariates = NULL,
    variables2test = "group", 
    preprocessed = FALSE, 
    test = "asymptotic",
    expand = TRUE)

my_linda <- set_linda(
    formula = "~ group", 
    contrast = c("group", "grp2", "grp1"),
    is.winsor = TRUE,
    zero.handling = "pseudo-count",
    alpha = 0.05,
    expand = TRUE)

my_Maaslin2 <- set_Maaslin2(
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = "group",
    contrast = c("group", "grp2", "grp1"),
    expand = TRUE)

my_ZicoSeq <- set_ZicoSeq(contrast = c("group", "grp2", "grp1"),
    feature.dat.type = "count", 
    is.winsor = TRUE, outlier.pct = 0.03, winsor.end = "top",
    is.post.sample = TRUE, post.sample.no = 25, perm.no = 99,
    ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, 
    link.func = list(function(x) sign(x) * (abs(x))^0.5))

my_methods <- c(my_basic, my_edgeR, my_DESeq2, 
    my_limma, my_metagenomeSeq, my_corncob, 
    my_ALDEx2, my_MAST, my_Seurat, my_ANCOM, my_dearseq,
    my_linda, my_Maaslin2, my_ZicoSeq)

After concatenating all the DA instructions, they are run on the mock comparisons using the runMocks() function:

bpparam <- BiocParallel::SerialParam()
# Random grouping each time
Stool_16S_mockDA <- runMocks(
    mocks = my_mocks, 
    method_list = my_methods, 
    object = ps_stool_16S, 
    weights = zinbweights, 
    verbose = FALSE, 
    BPPARAM = bpparam)

If some warnings are reported, verbose = TRUE can be used to obtain the method name and the mock comparison where the warnings occured.

The structure of the output in this example is the following:

Comparison1 to Comparison3 on the first level, which contains:
- Methods' output lists on the second level:
  - pValMat which contains the matrix of raw p-values and adjusted p-values in rawP and adjP columns respectively;
  - statInfo which contains the matrix of summary statistics for each feature, such as the logFC, standard errors, test statistics and so on;
  - dispEsts which contains the dispersion estimates for methods like edgeR and DESeq2;
  - name which contains the complete name of the used method.

The list of methods can be run in parallel leveraging the BiocParallel package. In details, parallelization is supported through the MulticoreParam() function as long as ANCOM-based functions are run on a single core (n_cl = 1 parameter) due to a different parallelization management of those functions.

# Example of a run without ancom-based methods
ancom_index <- which(grepl(pattern = "ANCOM", 
    names(my_methods)))
bpparam = BiocParallel::MulticoreParam()
Stool_16S_mockDA <- runMocks(
    mocks = my_mocks, 
    method_list = my_methods[-ancom_index], 
    object = ps_stool_16S, 
    weights = zinbweights, 
    verbose = FALSE, 
    BPPARAM = bpparam)

ANCOM based methods are usually the most time consuming. Parallel computing is still possible as long as it is directly managed by those methods (n_cl parameter). In the following example, each mock dataset is analyzed in serial mode but ANCOM is run in more than one core.

# Modify the n_cl parameter
my_ANCOM_parallel <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "group", 
    contrast = c("group", "grp2", "grp1"),
    BC = TRUE, 
    n_cl = 2, # Set this number according to your machine
    expand = TRUE   
)

bpparam = BiocParallel::SerialParam()
Stool_16S_mockDA_ANCOM <- runMocks(
    mocks = my_mocks,
    method_list = my_ANCOM_parallel, # Only ANCOM
    object = ps_stool_16S,
    weights = zinbweights,
    verbose = FALSE,
    BPPARAM = bpparam)

Add a new DA method later in the analysis

It may happen that at a later time the user wants to add to the results already obtained, the results of another group of methods. For example a new version of limma:

my_new_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ group,
    coef = 2,
    norm = "CSS",
    weights_logical = FALSE)

Which returns a new set of limma instructions and a warning for using CSS normalization factors instead of those native to edgeR.

First of all, the same mocks and the same object must be used to obtain the new results. To run the new instructions the runMocks() function is used:

Stool_16S_mockDA_new_limma <- runMocks(
    mocks = my_mocks, 
    method_list = my_new_limma, 
    object = ps_stool_16S, 
    verbose = FALSE, BPPARAM = bpparam)

To put everything together a mapply() function is used to exploit the output structures:

Stool_16S_mockDA_merged <- mapply(
    Stool_16S_mockDA, # List of old results
    Stool_16S_mockDA_new_limma, # List of new results
    FUN = function(old, new){
        c(old, new) # Concatenate the elements
    }, SIMPLIFY = FALSE)

Counting the False Positives

The createTIEC() function counts the FPs and evaluates the p-values distributions:

TIEC_summary <- createTIEC(Stool_16S_mockDA)

A list of 5 data.frames is produced:

df_pval is a 5 columns and number_of_features x methods x comparisons rows data frame. The five columns are called Comparison, Method, variable (which contains the feature names), pval and padj;
df_FPR is a 5 columns and methods x comparisons rows data frame. For each set of method and comparison, the proportion of FPs, considering 3 threshold (0.01, 0.05, 0.1) is reported;
df_FDR is a 4 columns and number of methods rows data frame. For each method, the average False Discovery Rate is computed averaging the results across all comparisons (considering 3 threshold, 0.01, 0.05, and 0.1);
df_QQ contains the average p-value for each theoretical quantile, i.e. the QQ-plot coordinates to compare the mean observed p-values distribution across comparisons, with the theoretical uniform distribution. Indeed, the observed p-values should follow a uniform distribution under the null hypothesis of no differential abundant features presence;
df_KS is a 5 columns and methods x comparisons rows data frame. For each set of method and comparison, the Kolmogorov-Smirnov test statistics and p-values are reported in KS and KS_pval columns respectively.

Visualization

False Positive Rate

The false positive rate (FPR or observed $\alpha$), which is the observed proportion of significant tests, should match the nominal value because all the findings are FPs by construction. In this example edgeR.TMM, edgeR.TMM.weighted, limma.TMM.weighted, and metagenomeSeq.CSS.fitZig appear to be quite over all the thresholds (liberal behavior), differently ALDEx2.all.wilcox.unpaired and basic_t methods are below (conservative behavior) or in line with the thresholds (Figure \@ref(fig:FPRplot)).

cols <- createColors(variable = levels(TIEC_summary$df_pval$Method))
plotFPR(df_FPR = TIEC_summary$df_FPR, cols = cols)

False Discovery Rate

The false discovery rate $FDR = E\left[\frac{FP}{FP+TP}\right]$ is the expected value of the ratio between the false positives and all the positives. By construction, mock comparisons should not contain any TPs and when all the hypotheses are null, FDR and FWER (Family Wise Error Rate) coincide. For each set of method and comparison, the FDR is set equal to 1 (if at least 1 DA feature is found) or 0 (if no DA features are found). Hence, the estimated FDR is computed by averaging the values across all the mock comparisons. As the number of mock comparisons increases, the more precise the estimated FDR will be. Just as alpha is set as a threshold for the p-value to control the FPR, a threshold for the adjusted p-value, which is the FDR analog of the p-value, can be set. FDR values should match the nominal values represented by the red dashed lines. In this example, the number of mock comparisons is set to 3, so the estimates are unprecise (Figure \@ref(fig:FDRplot)).

plotFDR(df_FDR = TIEC_summary$df_FDR, cols = cols)

QQ-Plot

The p-values distribution under the null hypothesis should be uniform. This is qualitatively summarized in the QQ-plot in Figure \@ref(fig:QQplot) where the bisector represents a perfect correspondence between observed and theoretical quantiles of p-values. For each theoretical quantile, the corresponding observed quantile is obtained averaging the observed p-values' quantiles from all mock datasets. The plotting area is zoomed-in to show clearly the area between 0 and 0.1.

Methods over the bisector show a conservative behavior, while methods below the bisector a liberal one.

The starting point is determined by the total number of features. In our example the starting point for the theoretical p-values is computed as 1 divided by the number of taxa, rounded to the second digit. In real experiments, where the number of taxa is higher, the starting point is closer to zero.

plotQQ(df_QQ = TIEC_summary$df_QQ, zoom = c(0, 0.1), cols = cols) + 
    guides(colour = guide_legend(ncol = 1))

As the number of methods increases, distinguishing their curves becomes more difficult. For this reason it is also possible to plot each method singularly (Figure \@ref(fig:QQplotsplit)).

plotQQ(df_QQ = TIEC_summary$df_QQ, zoom = c(0, 1), cols = cols, split = TRUE)

Kolmogorov-Smirnov test

Departure from uniformity is quantitatively evaluated through the Kolmogorov-Smirnov test which is reported for each method across all mock datasets using the the plotKS function in Figure \@ref(fig:KSplot).

plotKS(df_KS = TIEC_summary$df_KS, cols = cols)

High KS values indicates departure from the uniformity while low values indicates closeness. All the clues we had seen in the previous figures \@ref(fig:QQplot) and \@ref(fig:QQplotsplit) are confirmed by the KS statistics: metagenomeSeq.CSS.fitZig, which was very liberal and its distribution of p-values is the farthest from uniformity among the tested methods. Also ALDEx2 based methods show high KS values, indeed they showed a very conservative behaviour.

Log distribution of p-values

Looking at the p-values' log-scale can also be informative. This is because behavior in the tail may be poor even when the overall p-value distribution is uniform, with a few unusually small p-values in an otherwise uniform distribution. Figure \@ref(fig:LogPplot) displays the distributions of all the p-values (in negative log scale) generated by each DA method across all the mock comparisons.

plotLogP(df_pval = TIEC_summary$df_pval, cols = cols)

Similarly, figure \@ref(fig:AveLogPplot) exploits the structure of the df_QQ data.frame generated by the createTIEC() function to display the distribution of the p-values (in negative log scale) generated by each DA method, averaged among mock comparisons (only two in this vignette). As this second graphical representation is only based on 1 averaged p-value for each quantile, it is also less influenced by anomalously large values.

plotLogP(df_QQ = TIEC_summary$df_QQ, cols = cols)

In the figure \@ref(fig:LogPplot) and \@ref(fig:AveLogPplot), the $-\log_{10}(p-value)$ IDEAL distribution is reported in red color as the first method. To highlight tail's behaviors, 3 percentiles (0.9, 0.95, 0.99) are reported using red-shaded vertical segments for each method. If the method's distribution of negative log-transformed p-values or average p-values is still uniform in the 3 selected quantiles of the tail, the 3 red vertical segments will align to the respective dotted line. Methods are ordered using the distances between the observed quantiles and the ideal ones. Usually, when a method has its red segments to the left of the IDEAL's ones is conservative (e.g., ALDEx2.iqlr.wilcox.unpaired and MAST.default). Indeed, for those methods, little p-values are fewer than expected. On the contrary, methods with red segments to the right of the IDEAL's ones are liberal (e.g., edgeR.TMM). Mixed results could be present: a method that has a lower quantile for one threshold and higher quantiles for the others (e.g., limma.TMM).

Discussion about TIEC

Putting all the previous graphical representations together gives a general overview of methods' ability to control FPs and p-values distribution under the null hypothesis (i.e. no differential abundance). It is clear that only methods that produce p-values can be included in this analysis. While figures \@ref(fig:QQplot) and \@ref(fig:QQplotsplit) have a main exploratory scope regarding the p-values distribution based on quantile-quantile comparison, figures \@ref(fig:FPRplot), \@ref(fig:KSplot), and \@ref(fig:AveLogPplot) are able to rank methods according to False Positive Rate, uniformity of p-values distribution, and departure from uniformity in the tail. The latter graphical representations could be used as a first tool to establish which DA method to consider for further analyses and which DA methods to exclude. Finally, the figure \@ref(fig:FDRplot) can be used to assess FDR control under the null scenario. This exposes the problem of a few extremely small p-values among a collection that looks roughly uniform. If that is the case, the Type I error would be under control but the FDR would be inflated.

Concordance

Assumption: Applying different methods to the same data may produce different results.

Questions: How much do the methods agree with each other? How much does a method agree with itself?

Concordance structure

To measure the ability of each method to produce replicable results from a dataset with two or more groups:

samples are divided to obtain the Subset1 and Subset2 datasets using the createSplits() function;
DA methods are run on both subsets using the runSplits() function;
the Concordance At the Top metric (CAT) between the lists of p-values is computed to obtain the Between Methods Concordance (BMC) and the Within Method Concordance (WMC);
steps 1-3 are repeated many times (N = 2, but at least 100 are suggested) and the results are averaged using the createConcordance() function.

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figconcordance).

knitr::include_graphics("./concordance_structure.svg")

Split datasets

Using the createSplits() function, the ps_plaque_16S dataset is randomly divided by half. In this dataset, samples are paired: 1 sample for supragingival plaque and 1 sample for subgingival plaque are considered for each subject. The paired parameter is passed to the method (it contains the name of the variable which describes the subject IDs) so the paired samples are inside the same split. In this specific case, the two groups of samples are balanced between conditions, reflecting the starting dataset. However, if the starting dataset had been unbalanced, the balanced option would have allowed to keep the two splits unbalanced or not.

set.seed(123)

# Make sure that groups and subject IDs are factors
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- 
    factor(sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE)
sample_data(ps_plaque_16S)$RSID <- 
    factor(sample_data(ps_plaque_16S)$RSID)

my_splits <- createSplits(
  object = ps_plaque_16S,
  varName = "HMP_BODY_SUBSITE",
  paired = "RSID",
  balanced = TRUE,
  N = 2
) # At least 100 is suggested

The structure produced by createSplits() function consists in a list of two matrices: Subset1 and Subset2. Each matrix contains the randomly chosen sample IDs. The number of rows of both matrices is equal to the number of comparisons/splits (2 in this example, but at least 100 are suggested).

Set up normalizations and DA methods

For some of the methods implemented in this package it is possible to perform differential abundance testings for the repeated measurements experimental designs (e.g., by adding the subject ID in the model formula of DESeq2).

Once again, to set the differential abundance methods to use, the set_<name_of_the_method>() methods can be exploited. For a faster demonstration, differential abundance methods without weighting are used:

my_edgeR_noWeights <- set_edgeR(
    group_name = "HMP_BODY_SUBSITE", 
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM")

my_DESeq2_noWeights <- set_DESeq2(
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"), 
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    norm = "poscounts")

my_limma_noWeights <- set_limma(
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM")

my_ALDEx2 <- set_ALDEx2(
    pseudo_count = FALSE,
    design = "HMP_BODY_SUBSITE",
    mc.samples = 128,
    test = "wilcox",
    paired.test = TRUE, 
    denom = "all", 
    contrast = c("HMP_BODY_SUBSITE", "Supragingival Plaque", "Subgingival Plaque"))

my_MAST <- set_MAST(
    pseudo_count = FALSE,
    rescale = "median",
    design = "~ 1 + RSID + HMP_BODY_SUBSITE",
    coefficient = "HMP_BODY_SUBSITESupragingival Plaque")

my_dearseq <- set_dearseq(
    pseudo_count = FALSE, 
    covariates = NULL,
    variables2test = "HMP_BODY_SUBSITE",
    sample_group = "RSID", 
    test = "asymptotic", 
    preprocessed = FALSE)

my_ANCOM <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "HMP_BODY_SUBSITE",
    rand_formula = "(1|RSID)",
    lme_control = lme4::lmerControl(),
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"),
    BC = c(TRUE, FALSE))

my_linda <- set_linda(
    formula = "~ HMP_BODY_SUBSITE + (1|RSID)",
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"),
    is.winsor = TRUE, 
    zero.handling = "pseudo-count",
    alpha = 0.05)

my_Maaslin2 <- set_Maaslin2(
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = "HMP_BODY_SUBSITE",
    random_effects = "RSID",
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"))

# Temporarily removed due to issues with ID_variables
# my_mixMC <- set_mixMC(
#     pseudo_count = 1,
#     ID_variable = "RSID",
#     contrast = c("HMP_BODY_SUBSITE",
#     "Supragingival Plaque", "Subgingival Plaque"))

my_ZicoSeq <- set_ZicoSeq(
    contrast = c("HMP_BODY_SUBSITE",
        "Supragingival Plaque", "Subgingival Plaque"),
    strata = "RSID",
    feature.dat.type = "count", 
    is.winsor = TRUE, outlier.pct = 0.03, winsor.end = "top",
    is.post.sample = TRUE, post.sample.no = 25, perm.no = 99,
    ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, 
    link.func = list(function(x) sign(x) * (abs(x))^0.5))

my_methods_noWeights <- c(
    my_edgeR_noWeights, 
    my_DESeq2_noWeights, 
    my_limma_noWeights,  
    my_ALDEx2,
    my_MAST,
    my_dearseq,
    my_ANCOM,
    my_linda,
    my_Maaslin2,
    # my_mixMC, # Temporary bug with multilevel
    my_ZicoSeq)

Similarly, to set the normalization methods, the setNormalizations() function can be used. In this case it has already been set up for the TIEC analysis:

str(my_normalizations)

The runSplits() function generates the subsets and performs DA analysis on the features with at least 1 (min_counts > 0) count in more than 2 samples (min_samples > 2):

# Set the parallel framework
# Remember that ANCOMBC based methods are compatible only with SerialParam()
bpparam <- BiocParallel::SerialParam()

# Make sure the subject ID variable is a factor
phyloseq::sample_data(ps_plaque_16S)[, "RSID"] <- as.factor(
    phyloseq::sample_data(ps_plaque_16S)[["RSID"]])

Plaque_16S_splitsDA <- runSplits(
    split_list = my_splits, 
    method_list = my_methods_noWeights, 
    normalization_list = my_normalizations, 
    object = ps_plaque_16S, 
    min_counts = 0, min_samples = 2, 
    verbose = FALSE,
    BPPARAM = bpparam)

Many warning messages could be shown after running the previous function. As suggested before, verbose = TRUE can be used to obtain the method name and the comparison where the warnings occured. In this case they are probably due to the low sample size of the example dataset but they should be carefully addressed in real data analysis.

The structure of the output in this example is the following:

Subset1 and Subset2 on the first level, which contains:
- Comparison1 to Comparison2 output lists on the second level:
  - Methods' results on the third level: edgeR with TMM scaling factors, DESeq2 with poscounts normalization factors, limma-voom with TMM scaling factors (all the 3 previous methods have the Subject identifier in the design formula), ALDEx2 with paired wilcox test and denom equals to all, MAST with median scaling and the subject identifier in the design formula, dearseq for repeated measures with asymptotic test, ANCOM with and without bias correction, linDA with winsorization and pseudo-count addition, Maaslin2 with TSS normalization, LOG transformation and LM analysis method, mixMC (temporarily removed), and ZicoSeq with default parameters. Their outputs are organized as always:
    - pValMat which contains the matrix of raw p-values and adjusted p-values in rawP and adjP columns respectively;
    - statInfo which contains the matrix of summary statistics for each feature, such as the logFC, standard errors, test statistics and so on;
    - dispEsts which contains the dispersion estimates for methods like edgeR and DESeq2;
    - name which contains the complete name of the used method.

Add a new DA method later in the analysis

Again, it may happen that at a later time the user wants to add to the results already obtained, the results of another group of methods. First of all, the same splits and the same object must be used to obtain the new results:

my_basic <- set_basic(
    pseudo_count = FALSE, 
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"), 
    test = "wilcox", 
    paired = TRUE)

Plaque_16S_splitsDA_basic <- runSplits(
    split_list = my_splits, 
    method_list = my_basic, 
    normalization_list = NULL, 
    object = ps_plaque_16S,
    min_counts = 0, min_samples = 2, 
    verbose = FALSE)

To put everything together, two nested mapplys can be used to exploit the output structures:

Plaque_16S_splitsDA_all <- mapply(
    Plaque_16S_splitsDA, # List of old results
    Plaque_16S_splitsDA_basic, # List of new results
    FUN = function(subset_old, subset_new){
        mapply( 
            subset_old, 
            subset_new,
            FUN = function(old, new){ 
                return(c(old, new))
        }, SIMPLIFY = FALSE)
    }, SIMPLIFY = FALSE)

Comparing the concordances

For each pair of methods the concordance is computed by the createConcordance() function. It produces a long format data frame object with several columns:

comparison which indicates the comparison number;
n_features which indicates the total number of taxa in the comparison dataset;
name of method1;
name of method2;
rank;
concordance which is defined as the cardinality of the intersection of the top rank elements of each list, divided by rank, i.e., $\frac{L_{1:rank} \bigcap M_{1:rank}}{rank}$, where L and M represent the lists of p-values of method1 and method2 respectively. A noise value ($<10^{-10}$) is added to each p-value (or statistic) in order to avoid duplicated values which could not be ordered.

concordance <- createConcordance(
    object = Plaque_16S_splitsDA_all,
    slot = "pValMat",
    colName = "rawP",
    type = "pvalue"
)

head(concordance)

The createConcordance() method is very flexible. In the example below the concordances are built using the log fold changes or other statistics instead of the p-values. To do so, it is necessary to know the column names generated by each differential abundance method in the statInfo matrix.

Firstly, the method order is extracted using the name slot:

names(Plaque_16S_splitsDA_all$Subset1$Comparison1)

Then, the column names of the statInfo slot are investigated:

cat("edgeR.TMM", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$edgeR.TMM$statInfo)
cat("DESeq2.poscounts", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$DESeq2.poscounts$statInfo)
cat("limma.TMM", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$limma.TMM$statInfo)
cat("ALDEx2.all.wilcox.paired", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ALDEx2.all.wilcox.paired$
    statInfo)
cat("MAST.median", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$MAST.median$statInfo)
cat("dearseq.repeated.asymptotic", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$dearseq.repeated.asymptotic$
    statInfo)
cat("ANCOM", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ANCOM$statInfo)
cat("ANCOMBC2", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ANCOM.BC$statInfo)
cat("linda", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$linda.win0.03.pc0.5$statInfo)
cat("Maaslin2", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$Maaslin2.TSSnorm.LOGtrans.LM$statInfo)
# cat("mixMC", "\n")
# names(Plaque_16S_splitsDA_all$Subset1$Comparison1$mixMC.pc1$statInfo)
cat("ZicoSeq", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ZicoSeq.winsor0.03top.ref0.5.excl0.2$statInfo)
cat("basic.wilcox.paired", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$basic.wilcox.paired$statInfo)

All methods, except for DESeq2, ALDEx2, dearseq, ANCOM, ANCOMBC2, mixMC, and ZicoSeq contain the log fold change values in the logFC column of statInfo matrix. Knowing this, the alternative concordance data frame can be built using:

concordance_alternative <- createConcordance(
    object = Plaque_16S_splitsDA_all,
    slot = "statInfo",
    colName = c("logFC", "log2FoldChange", "logFC", 
        "effect", "logFC", "rawP", "W", 
        "lfc_HMP_BODY_SUBSITESupragingival Plaque", 
        "log2FoldChange", "coef", 
        # "importance", # mixMC
        "effect", "logFC"),
    type = c("logfc", "logfc", "logfc", "logfc", 
        "logfc", "pvalue", "logfc",
        "logfc", "logfc", "logfc",
        #"logfc", # mixMC
        "logfc", "logfc")
)

Visualization

Starting from the table of concordances, the plotConcordance() function can produce 2 graphical results visible in Figure \@ref(fig:plotConcordance):

the dendrogram of methods, clustered by the area over the concordance bisector in concordanceDendrogram slot;
the heatmap of the between and within method concordances in concordanceHeatmap slot. For each tile of the symmetric heatmap, which corresponds to a pair of methods, the concordance from rank 1 to a threshold rank is drawn.

The area between the curve and the bisector is colored to highlight concordant methods (blue) and non-concordant ones (red). The two graphical results should be drawn together for the best experience.

pC <- plotConcordance(concordance = concordance, threshold = 30)
cowplot::plot_grid(plotlist = pC, ncol = 2, align = "h", axis = "tb",
    rel_widths = c(1, 3))

The WMC and BMC from rank 1 to rank 30 are reported in the plot above. More than 40 (use table(concordance$rank) to find out) is the maximum rank obtained by all split comparisons, i.e. the number of taxa for which all methods have been able to calculate p-values (in all comparisons). However, a custom threshold of 30 was supplied.

It is common that WMC values (in red rectangles) are lower than BMC ones. Indeed, BMC is computed between different methods on the same data, while WMC is computed for a single method, run in different datasets (some taxa are dataset-specific).

dearseq.repeated.asymptotic and Maaslin2.TSSnorm.LOGtrans.LM methods show the highest BMC values but they are also concordant with limma.TMM, edgeR.TMM, linda.win0.03.pc0.5, and basic.wilcox.paired. Differently, DESeq2.poscounts and MAST.median are not concordant with the other methods.

Regarding the WMC, ALDEx2.all.wilcox.paired has the highest value while MAST.median has the lowest values, maybe because it has not been implemented properly for repeated measure designs. Other methods have comparable WMC values.

Discussion about Concordance

Random splits allow to evaluate concordance between methods and within a method. These analyses do not assess the correctness of the discoveries. Even the method with the highest WMC could nonetheless consistently identify false positive DA taxa. For this reason, the concordance analysis framework should be used as a tool to detect groups of similar methods.

Enrichment analysis

Assumption: Previous analyses did not assess the correctness of the discoveries.

Question: If some prior knowledge about the experiment is available, would the findings be coherent with that knowledge?

Enrichment structure

While the lack of ground truth makes it challenging to assess the validity of DA results in real data, enrichment analysis can provide an alternative solution to rank methods in terms of their ability to identify, as significant, taxa that are known to be differentially abundant between two groups. To run methods, the runDA() function is used. Leveraging the prior knowledge (if present), the correctness of the findings is checked using the createEnrichment() and createPositives() functions. Many graphical outputs are available through the plotContingency(), plotEnrichment(), plotMutualFindings(), and plotPositives() functions.

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figenrichment).

knitr::include_graphics("./enrichment_structure.svg")

A priori knowledge

Here, we leveraged the peculiar environment of the gingival site [@plaque_dynamics]:

the supragingival biofilm is directly exposed to the open atmosphere of the oral cavity, favoring the growth of aerobic species;
in the subgingival biofilm, the atmospheric conditions gradually become strict anaerobic, favoring the growth of anaerobic species.

From the comparison of the two sites, an abundance of aerobic microbes in the supragingival plaque and of anaerobic bacteria in the subgingival plaque is expected. DA analysis should reflect this difference by finding an enrichment of aerobic (anaerobic) bacteria among the DA taxa with a positive (negative) log-fold-change.

Firstly, the microbial metabolism information is necessary. These data comes from [@cigarettes] research article's github repository (https://github.com/waldronlab/nychanesmicrobiome), but they can be loaded using data("microbial_metabolism"):

data("microbial_metabolism")
head(microbial_metabolism)

The microbial genus and its type of metabolism are specified in the first and second column respectively. To match each taxon of the phyloseq object to its type of metabolism the next chunk of code can be used:

# Extract genera from the phyloseq tax_table slot
genera <- tax_table(ps_plaque_16S)[, "GENUS"]

# Genera as rownames of microbial_metabolism data.frame
rownames(microbial_metabolism) <- microbial_metabolism$Genus

# Match OTUs to their metabolism
priorInfo <- data.frame(genera, "Type" =  microbial_metabolism[genera, "Type"])
unknown_metabolism <- is.na(priorInfo$Type)
priorInfo[unknown_metabolism, "Type"] <- "Unknown"

# Relabel 'F Anaerobic' to 'F_Anaerobic' to remove space
priorInfo$Type <- factor(priorInfo$Type, 
    levels = c("Aerobic","Anaerobic","F Anaerobic","Unknown"), 
    labels = c("Aerobic","Anaerobic","F_Anaerobic","Unknown"))

# Add a more informative names column
priorInfo[, "newNames"] <- paste0(rownames(priorInfo), "|",
    priorInfo[, "GENUS"])

Set up normalizations and DA methods

Both the normalization/scaling factors and the DA methods' instructions are available since the dataset is the same used in the previous section.

In concordance analysis, normalizations factor were added inside the runSlits() function, so the original object ps_plaque_16S does not contain the values. The normalization/scaling factors are added to the object:

ps_plaque_16S <- runNormalizations(my_normalizations, object = ps_plaque_16S)

A simple filter to remove rare taxa is applied:

ps_plaque_16S <- phyloseq::filter_taxa(physeq = ps_plaque_16S, 
    flist = function(x) sum(x > 0) >= 3, prune = TRUE)
ps_plaque_16S

Differently from the Type I Error Control and Concordance analyses, the enrichment analysis rely on a single phyloseq or TreeSummarizedExperiment object (no mocks, no splits, no comparisons). For this reason many methods can be assessed without computational trade-offs (e.g., ANCOM without sampling fraction bias correction and methods which use ZINB weights).

The observational weights are computed:

plaque_weights <- weights_ZINB(object = ps_plaque_16S, design = ~ 1,
    zeroinflation = TRUE)

The existing instructions are concatenated with the instructions of methods which use observational weights:

my_edgeR <- set_edgeR(
    group_name = "HMP_BODY_SUBSITE", 
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM", 
    weights_logical = TRUE)

my_DESeq2 <- set_DESeq2(
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"), 
    design = ~ 0 + RSID + HMP_BODY_SUBSITE, 
    norm = "poscounts",
    weights_logical = TRUE)

my_limma <- set_limma(
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM",
    weights_logical = TRUE)

my_methods <- c(my_methods_noWeights, my_edgeR, my_DESeq2, my_limma)

All the ingredients are ready to run DA methods:

Plaque_16S_DA <- runDA(method_list = my_methods, 
    object = ps_plaque_16S, weights = plaque_weights, verbose = FALSE)

Testing the enrichment

Plaque_16_DA object contains the results for the methods. In order to extract p-values, the optional direction of DA (DA vs non-DA, or UP Abundant vs DOWN Abundant), and to add any a priori information, the createEnrichment() function can be used.

In the direction argument, which is set to NULL by default, the column name containing the direction (e.g., logfc, logFC, logFoldChange...) of each method's statInfo matrix can be supplied.

Firstly, the order of methods is investigated:

names(Plaque_16S_DA)

Following the methods' order, the direction parameter is supplied together with other parameters:

threshold_pvalue, threshold_logfc, and top (optional), to set differential abundance thresholds;
slot, colName, and type, which specify where to apply the above thresholds;
priorKnowledge, enrichmentCol, and namesCol, to add enrichment information to DA analysis;

The createEnrichment() function, with the direction parameter for all method except for dearseq (which has only p-values), is used:

enrichment <- createEnrichment(
    object = Plaque_16S_DA[-c(6)], 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Type",
    namesCol = "newNames", 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    direction = c(
        "logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "effect", # ALDEx2
        "logFC", # MAST
        "direction", # ANCOM
        "lfc_HMP_BODY_SUBSITESupragingival Plaque", # ANCOMBC2
        "log2FoldChange", # linda
        "coef", # Maaslin2
        # "importance", # mixMC
        "effect", # ZicoSeq
        "logFC", # edgeR with weights
        "log2FoldChange", # DESeq2 with weights
        "logFC"), # limma with weights
    threshold_pvalue = c(0.1, 0.1, 0.1, 0.1, 0.1,
        0.4, # ANCOM threshold on 1-W/(ntaxa-1) 0.4 = liberal
        0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), # other methods
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = TRUE
)

The produced enrichment object consists in a list of elements as long as the number of tested methods:

the data slot contains information for each feature. P-values, adjusted p-values (or other statistics) in stats column, log fold changes (or other statistics, if specified) in direction column, differential abundance information in the DA column (according to the thresholds), the variable of interest for the enrichment analysis, and the name of the feature in the feature column;
in the tables slot a maximum of 2 x (levels of enrichment variable) contingency tables (2x2) are present;
in the tests slot, the list of Fisher exact tests produced by the fisher.test() function are saved for each contingency table;
in the summaries slot, the first elements of the contingency tables and the respective p-values are collected for graphical purposes.

Visualization

Contingency tables

Considering one of the methods, DESeq2.poscounts, 8 contingency tables are obtained. Both UP Abundant and DOWN Abundant taxa are found and the enrichment variable has Aerobic, Anaerobic, F_Anaerobic, and Unknown levels. For each level, 2 contingency tables could be built: one for DOWN Abundant vs non-DOWN Abundant features and one for UP Abundant vs non-UP Abundant features. The enrichment is tested using Fisher exact test. The plotContingency() function summarize all these information (Figure \@ref(fig:plotContingency)).

plotContingency(enrichment = enrichment, 
    levels_to_plot = c("Aerobic", "Anaerobic"), 
    method = "DESeq2.poscounts")

Enrichment plot

To summarize enrichment analysis for all the methods simultaneously, the plotEnrichment() function can be used. Only Aerobic and Anaerobic levels are plotted in Figure \@ref(fig:plotEnrichment):

plotEnrichment(enrichment = enrichment, enrichmentCol = "Type", 
    levels_to_plot = c("Aerobic", "Anaerobic"))

Since Subgingival Plaque is the reference level for each method, the coefficients extracted from the methods are referred to the Supragingival Plaque class. The majority of methods identify, as expected, a statistically significant ($0.001 < p \le 0.05$) amount of DOWN Abundant Anaerobic features in Supragingival Plaque (Figure \ref{fig:plotEnrichment}). Moreover, many of them find an enriched amount of UP Abundant Aerobic genera in Supragingival Plaque. Unexpectedly, both DESeq2.poscounts and DESeq2.poscounts.weighted find many Anaerobic genera as UP Abundant, they could be FPs.

Mutual Findings

To investigate the DA features, the plotMutualFindings() function can be used (Figure \@ref(fig:plotMutualFindings)). While levels_to_plot argument allows to choose which levels of the enrichment variable to plot, n_methods argument allows to extract only features which are mutually found as DA by more than 1 method.

plotMutualFindings(enrichment, enrichmentCol = "Type", 
    levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)

In this example (Figure \@ref(fig:plotMutualFindings)), many Anaerobic genera and 6 Aerobic genera are found as DA by more than 1 method simultaneously. Among them, all methods find Prevotella, Treponema, Fusobacterium, and Dialister genera DOWN Abundant in Supragingival Plaque, while the Actinomyces genus UP Abundant, even if it has an aerobic metabolism. Similarly, all methods find Corynebacterium, Leutropia, and Neisseria aerobic genera UP abundant in Supragingival Plaque.

True and False Positives

To evaluate the overall performances a statistic based on the difference between putative True Positives (TP) and the putative False Positives (FP) is used. To build the matrix to plot, the createPositives() can be used. In details, the correctness of the DA features is evaluated comparing the direction of the top ranked features to the expected direction supplied by the user in the TP and FP lists. The procedure is performed for several thresholds of top parameter in order to observe a trend, if present:

positives <- createPositives(
    object = Plaque_16S_DA[-c(6)], 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Type",
    namesCol = "newNames", 
    slot = "pValMat", 
    colName = "rawP", 
    type = "pvalue",
    direction = c(
        "logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "effect", # ALDEx2
        "logFC", # MAST
        "direction", # ANCOM
        "lfc_HMP_BODY_SUBSITESupragingival Plaque", # ANCOMBC2
        "log2FoldChange", # linda
        "coef", # Maaslin2
        # "importance", # mixMC
        "effect", # ZicoSeq
        "logFC", # edgeR with weights
        "log2FoldChange", # DESeq2 with weights
        "logFC"), # limma with weights
    threshold_pvalue = c(0.1, 0.1, 0.1, 0.1, 0.1,
        0.4, # ANCOM threshold on 1-W/(ntaxa-1) 0.4 = liberal
        0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), # other methods
    threshold_logfc = 0,
    top = seq.int(from = 0, to = 30, by = 3),
    alternative = "greater",
    verbose = FALSE,
    TP = list(c("DOWN Abundant", "Anaerobic"), c("UP Abundant", "Aerobic")),
    FP = list(c("DOWN Abundant", "Aerobic"), c("UP Abundant", "Anaerobic"))
)
head(positives)

The plotPositives() function can be used to summarize the methods' performances (Figure \@ref(fig:plotPositives)). Higher values usually represents better performances. In our example, all methods show similar values of the statistics for the top 10 ranked features.

plotPositives(positives)

Conservative and high-in-FP methods are located on the lower part of the Figure \@ref(fig:plotPositives). mixMC is also in the lower part of the figure probably because its aim is to find the minimum amount of features that have a discriminant power (sparse PLS-DA), so the findings depends on the dataset (sometimes few features are sufficient for perfectly separating the groups). The highest performances are of limma.TMM.weighted, edgeR.TMM.weighted, and ZicoSeq.winsor0.03top.post25.ref0.5.excl0.2. This means that their findings are in line with the a priori knowledge supplied by the user.

Enrichment without direction

When the user have a custom method where the direction of the differential abundance is not returned (e.g., NOISeq), or when the direction of DA is not of interest, the sole information about DA and not DA feature can be used. The createEnrichment() function is used without the direction parameter for all methods:

enrichment_nodir <- createEnrichment(
    object = Plaque_16S_DA, 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Type",
    namesCol = "newNames", 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    threshold_pvalue = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
        0.4, # ANCOM 
        0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1),
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = FALSE
)

To summarize enrichment analysis for all the methods simultaneously, the plotEnrichment() function is used. All levels are plotted in Figure \@ref(fig:plotEnrichmentnodir).

plotEnrichment(enrichment = enrichment_nodir, enrichmentCol = "Type")

The highest amount of DA features belongs to the Anaerobic metabolism, followed by F_Anaerobic, and Aerobic. The method that finds more DA features is DESeq2.poscounts, while ANCOM and ALDEx2 are the most conservative (even if the threshold value based on $1-\frac{W}{ntaxa-1}$ is set at liberal value of 0.4).

As for enrichment analysis with DA direction, the plotMutualFindings() function can be used here too (Figure \@ref(fig:plotMutualFindingsnodir)). While levels_to_plot argument allows to choose which levels of the enrichment variable to plot, n_methods argument allows to extract only features which are mutually found as DA by more than 1 method.

plotMutualFindings(enrichment_nodir, enrichmentCol = "Type", 
    levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)

In this example (Figure \@ref(fig:plotMutualFindingsnodir)), many OTUs are found as DA in Supragingival Plaque by all methods.

Enrichment analysis for simulated data

To enlarge the scope of the enrichment analysis, simulations could be used, e.g., by using the user's dataset as a template to generate simulated data, in which to know the DA features and provide this information as prior knowledge.

As an example, the SPsimSeq package is used (the tool to use is up to the user) to simulate only a single dataset (n.sim = 1) from the ps_plaque_16S dataset where two body sub sites are available (without considering the paired design). The data are simulated with the following properties - 100 features (n.genes = 100) - 50 samples (tot.samples = 50) - the samples are equally divided into 2 groups each with 25 samples (group.config = c(0.5, 0.5)) - all samples are from a single batch (batch.config = 1) - 20% DA features (pDE = 0.2) - the DA features have a log-fold-change of at least 0.5.

data("ps_plaque_16S")
counts_and_metadata <- get_counts_metadata(ps_plaque_16S)
plaque_counts <- counts_and_metadata[["counts"]]
plaque_metadata <- counts_and_metadata[["metadata"]]

set.seed(123)

sim_list <- SPsimSeq(
    n.sim = 1, 
    s.data = plaque_counts,
    group = plaque_metadata[, "HMP_BODY_SUBSITE"],
    n.genes = 100, 
    batch.config = 1,
    group.config = c(0.5, 0.5), 
    tot.samples = 50, 
    pDE = 0.2, 
    lfc.thrld = 0.5, 
    model.zero.prob = FALSE,
    result.format = "list")

Simulated data are organised into a TreeSummarizedExperiment object:

sim_obj <- TreeSummarizedExperiment::TreeSummarizedExperiment(
    assays = list("counts" = sim_list[[1]][["counts"]]),
    rowData = sim_list[[1]]["rowData"],
    colData = sim_list[[1]]["colData"],
)
# Group as factor
SummarizedExperiment::colData(sim_obj)[, "colData.Group"] <- as.factor(
    SummarizedExperiment::colData(sim_obj)[, "colData.Group"])

The apriori informations are readily available from the sim_list[[1]]["rowData"]:

priorInfo <- sim_list[[1]][["rowData"]]
priorInfo$Reality <- ifelse(priorInfo[, "DE.ind"], "is DA", "is not DA")

Once again, normalization/scaling factors are added:

sim_obj <- runNormalizations(
    normalization_list = my_normalizations, 
    object = sim_obj, 
    verbose = TRUE)

Rare and low variance taxa are filtered:

taxa_to_keep <- apply(assays(sim_obj)[["counts"]], 1, 
    function(x) sum(x > 0) >= 3 & sd(x) > 1)
sim_obj <- sim_obj[taxa_to_keep, ]
priorInfo <- priorInfo[taxa_to_keep, ]

Observational weights are computed:

sim_weights <- weights_ZINB(
    object = sim_obj, 
    design = ~ 1, 
    zeroinflation = TRUE)

DA methods are set up. The paired design is not considered and all the methods are used. The contrast, design, group, coef, and all the other parameters involved in the experimental design definition are changed:

my_basic <- set_basic(pseudo_count = FALSE,
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"), 
    test = c("t", "wilcox"), 
    paired = FALSE, 
    expand = TRUE)

my_edgeR <- set_edgeR(
    pseudo_count = FALSE,
    group_name = "colData.Group",
    design = ~ colData.Group,
    robust = FALSE,
    coef = 2,
    norm = "TMM",
    weights_logical = c(TRUE, FALSE),
    expand = TRUE)

my_DESeq2 <- set_DESeq2(
    pseudo_count = FALSE,
    design = ~ colData.Group,
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    norm = "poscounts",
    weights_logical = c(TRUE, FALSE),
    alpha = 0.1,
    expand = TRUE)

my_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ colData.Group,
    coef = 2,
    norm = "TMM",
    weights_logical = c(FALSE, TRUE),
    expand = TRUE)

my_ALDEx2 <- set_ALDEx2(
    pseudo_count = FALSE,
    design = "colData.Group",
    mc.samples = 128,
    test = "wilcox",
    paired.test = FALSE, 
    denom = c("all", "iqlr"), 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    expand = TRUE)

my_metagenomeSeq <- set_metagenomeSeq(
    pseudo_count = FALSE, 
    design = "~ colData.Group", 
    coef = "colData.GroupSupragingival Plaque", 
    norm = "CSS", 
    model = "fitFeatureModel",
    expand = TRUE)

my_corncob <- set_corncob(
    pseudo_count = FALSE,
    formula = ~ colData.Group,
    formula_null = ~ 1,
    phi.formula = ~ colData.Group,
    phi.formula_null = ~ colData.Group,
    test = c("Wald", "LRT"),
    boot = FALSE,
    coefficient = "colData.GroupSupragingival Plaque")

my_MAST <- set_MAST(
    pseudo_count = FALSE,
    rescale = c("default", "median"),
    design = "~ 1 + colData.Group",
    coefficient = "colData.GroupSupragingival Plaque",
    expand = TRUE)

my_Seurat <- set_Seurat(
    pseudo_count = FALSE,
    test = c("t", "wilcox"),
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    norm = c("LogNormalize", "CLR"),
    scale.factor = 10^5,
    expand = TRUE
)

my_ANCOM <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "colData.Group", 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    BC = c(TRUE, FALSE), 
    alpha = 0.1,
    expand = TRUE   
)

my_dearseq <- set_dearseq(
    pseudo_count = FALSE,
    covariates = NULL,
    variables2test = "colData.Group", 
    preprocessed = FALSE, 
    test = c("permutation", "asymptotic"),
    expand = TRUE)

my_NOISeq <- set_NOISeq(
    pseudo_count = FALSE, 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    norm = c("rpkm", "tmm"), 
    expand = TRUE)

my_linda <- set_linda(
    formula = "~ colData.Group", 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    is.winsor = TRUE,
    zero.handling = "pseudo-count",
    alpha = 0.1,
    expand = TRUE)

my_Maaslin2 <- set_Maaslin2(
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = "colData.Group", 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    expand = TRUE)

my_mixMC <- set_mixMC(
    pseudo_count = 1,
    contrast = c("colData.Group", "Supragingival Plaque",
        "Subgingival Plaque"),
    expand = TRUE
)

my_ZicoSeq <- set_ZicoSeq(
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    feature.dat.type = "count", strata = NULL,
    is.winsor = TRUE, outlier.pct = 0.03, winsor.end = "top",
    is.post.sample = TRUE, post.sample.no = 25, perm.no = 99,
    ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, 
    link.func = list(function(x) sign(x) * (abs(x))^0.5))

my_methods <- c(my_basic, my_edgeR, my_DESeq2, 
    my_limma, my_metagenomeSeq, my_corncob, 
    my_ALDEx2, my_MAST, my_Seurat, my_ANCOM, 
    my_dearseq, my_NOISeq, my_linda, my_Maaslin2, my_mixMC, 
    my_ZicoSeq)

DA methods are run using the runDA() function:

sim_DA <- runDA(
    method_list = my_methods, 
    object = sim_obj, 
    weights = sim_weights, 
    verbose = FALSE)

The createEnrichment() without the direction parameter for all methods is used. A 0.1 threshold for the adjusted p-values is chosen to define DA and non-DA taxa for all methods, a 0.4 threshold is used for ANCOM instead:

enrichment_nodir <- createEnrichment(
    object = sim_DA, 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Reality",
    namesCol = NULL, 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    threshold_pvalue = c(
        rep(0.1,19), # adjP thresholds 
        0.4, # adjP threshold for ANCOM on 1-W/(ntaxa-1)
        rep(0.1,7), # adjP thresholds for other methods
        0.9, # adjP threshold for mixMC 1 - stability 
        0.1), # ZicoSeq
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = FALSE
)

To summarize enrichment analysis for all the methods simultaneously, the plotEnrichment() function can be used. Both the numbers of "is DA" and "is not DA" features are plotted in Figure \@ref(fig:plotEnrichmentnodirsim). Their interpretation is quite straightforward: is DA are the positives, while the is not DA the negatives. Positives reported in Figure \@ref(fig:plotEnrichmentnodirsim) are the True Positives, while negatives are the FPs.

plotEnrichment(enrichment = enrichment_nodir, enrichmentCol = "Reality")

From this example, less than half of the methods are able to find an enriched amount of truly DA features without any false discovery: dearseq.counts.permutation.1000 in the first position. On the contrary, basic.t.counts, DESeq2.counts.poscounts.weighted, limma.counts.TMM.weighted, metagenomeSeq.counts.CSS.fitFeatureModel, ALDEx2.counts.iqlr.wilcox.unpaired, MAST.counts.default, MAST.counts.median, ANCOM.counts, and NOISeq.counts.tmm methods do not find any DA feature. This could be strongly related to the template taxa chosen to simulate the DA features.

To further assess methods' power, the createPositives() function can be used specifying as TPs the resulting DA features created as real DA features and as FPs the resulting DA features created as not DA features (Figure \@ref(fig:plotPositivessim)).

We use a threshold_pvalue = 0.1 (0.4 for ANCOM, 0.9 for mixMC) to call a feature DA based on its adjusted p-value. We compute the difference between TPs and FPs for several top thresholds (from 5 to 30, by 5) in order to observe a trend:

positives_nodir <- createPositives(
    object = sim_DA, 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Reality",
    namesCol = NULL, 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    threshold_pvalue = c(
        rep(0.1,19), # adjP thresholds 
        0.4, # adjP threshold for ANCOM on 1-W/(ntaxa-1)
        rep(0.1,7), # adjP thresholds for other methods
        0.9, # adjP threshold for mixMC 1 - stability 
        0.1), # ZicoSeq
    threshold_logfc = 0,
    top = seq(5, 30, by = 5),
    alternative = "greater",
    verbose = FALSE,
    TP = list(c("DA", "is DA")),
    FP = list(c("DA", "is not DA"))
)

Since the number of simulated DA feature is 20, the maximum number of TPs is 20 and it is added as an horizontal line to the figure.

plotPositives(positives = positives_nodir) +
    facet_wrap( ~ method) + 
    theme(legend.position = "none") +
    geom_hline(aes(yintercept = 20), linetype = "dotted", color = "red") +
    geom_hline(aes(yintercept = 0), color = "black") +
    ylim(NA, 21)

From figure \@ref(fig:plotPositivessim) it is clearly visible that linda.win0.03.pc0.5 and DESeq2.counts.poscounts reach the highest values of the difference, followed by dearseq.counts.permutation.1000, dearseq.counts.asymptotic, and corncob.counts.Wald and . As already mentioned the desired level of power which a methods should be able to reach is represented by the red dotted line, i.e. the total number of DA simulated features (20 in our case). These methods, in this specific example, have the highest power. Differently, methods characterized by flat lines have a fixed number of features with an adjusted p-value lower than the threshold. If their lines are above the zero line, it means that the number of True Positives is greater than the number of FPs. On the contrary, if their lines are below the zero line, it means that the number of FPs is greater (e.g., edgeR.counts.TMM.weighted and ANCOM.counts.BC have negative values. Maybe, it is due to poor weight estimates for weight-based methods and/or specific characteristics of this template).

Discussion about Enrichment

The enrichment analysis toolbox provides many methods to study DA in a dataset.

Firstly, when some prior knowledge is available, it allows to evaluate methods' power. Among the possible applications, it is especially useful to investigate conservative methods: are they calling only the most obvious taxa (also found by the other methods) or are they finding something new? The main drawback is that the availability of the prior knowledge is limited, especially for new datasets. For this reason, enrichment analysis could also be used in addition to simulation tools. Indeed, through parametric, semi-parametric, or non parametric assumptions it is possible to obtain an emulation of the prior knowledge.

Secondly, thanks to methods like plotMutualFindings() and plotEnrichment(), which produce graphical results like Figure \@ref(fig:plotEnrichmentnodir) and Figure \@ref(fig:plotMutualFindingsnodir), it is also possible to use the enrichment analysis to study the distribution of the findings across class of taxa (e.g., by using as prior knowledge the phylum of the features, it would be possible to study if a phylum is characterized by an increased number of DA compared to another phylum), or more simply, drawing biological conclusions based only on taxa found as DA by the majority of the methods.

Session Info

sessionInfo()

References

mcalgaro93/benchdamic documentation built on Nov. 28, 2024, 2:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mcalgaro93/benchdamic Benchmark of differential abundance methods on microbiome data

In mcalgaro93/benchdamic: Benchmark of differential abundance methods on microbiome data

Introduction

Installation

Data loading

Goodness of Fit

GOF structure

Parametric distributions

Negative Binomial and Zero-Inflated Negative Binomial Models

Zero-Inflated Gaussian Model

Truncated Gaussian Hurdle Model

Dirichlet-Multinomial Mixture Model

Comparing estimated and observed values

Visualization

Mean Differences

RMSE

Discussion about GOF

DA methods

Add a custom DA method

Type I Error Control

TIEC structure

Create mock comparisons

Set up normalizations and DA methods

Add a new DA method later in the analysis

Counting the False Positives

Visualization

False Positive Rate

False Discovery Rate

QQ-Plot

Kolmogorov-Smirnov test

Log distribution of p-values

Discussion about TIEC

Concordance

Concordance structure

Split datasets

Set up normalizations and DA methods

Add a new DA method later in the analysis

Comparing the concordances

Visualization

Discussion about Concordance

Enrichment analysis

Enrichment structure

A priori knowledge

Set up normalizations and DA methods

Testing the enrichment

Visualization

Contingency tables

Enrichment plot

Mutual Findings

True and False Positives

Enrichment without direction

Enrichment analysis for simulated data

Discussion about Enrichment

Session Info

References

R Package Documentation

Browse R Packages

We want your feedback!

mcalgaro93/benchdamic
Benchmark of differential abundance methods on microbiome data