ANOVA-Like Differential Expression tool for high throughput sequencing data

Introduction to r Biocpkg("ALDEx2")

This guide provides an overview of the R package ALDEx version 2 (r Biocpkg("ALDEx2")) for differential (relative) abundance analysis of high throughput sequencing count-compositional data^[all high throughput sequencing data are compositional [@gloorFrontiers:2017] because of constraints imposed by the instruments]. The package was developed and used initially for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-RNA-Seq)^[@macklaim:2013], but testing showed that it performed very well with traditional RNA-Seq datasets^[@Quinn:2018aa], 16S rRNA gene variable region sequencing^[@bian:2017] and selective growth-type (SELEX) experiments^[@mcmurrough:2014;@Wolfs:2016aa]. In principle, the analysis method should be applicable to nearly any type of data that is generated by high-throughput sequencing that generates tables of per-feature counts for each sample[@fernandes:2014]: in addition to the examples outlined above, this would include ChIP-Seq or metagenome sequencing.

What r Biocpkg("ALDEx2") does differently

The r Biocpkg("ALDEx2") package estimates per-feature technical variation within each sample using Monte-Carlo instances drawn from the Dirichlet distribution. Sampling from this distribution returns the posterior probability distribution of the observed data under a repeated sampling model. All outputs from r Biocpkg("ALDEx2") are outputs from the posterior distributions, either expected values or confidence intervals.

r Biocpkg("ALDEx2") uses the centred log-ratio (clr) transformation (or closely related log-ratio transforms) which ensures the data are scale invariant and sub-compositionally coherent^[@Aitchison:1986]. The scale invariance property removes the need for a between sample data normalization step since the data are all placed on a consistent numerical co-ordinate. The sub-compositional coherence property ensures that the answers obtained are consistent when parts of the dataset are removed (e.g., removal of rRNA reads from RNA-seq studies or rare OTU species from 16S rRNA gene amplicon studies). All feature abundance values are expressed relative to the geometric mean abundance of other features in a sample.This is conceptually similar to a quantitative PCR where abundances are expressed relative to a standard: in the case of the clr transformation, the standard is the per-sample geometric mean abundance. See Aitchison (1986) for a complete description.

In contrast, most commonly used tools to analyze differential abundance of high throughput sequencing (HTS) data make the assumption that throughput sequencing data are delivered as counts^[@Anders:2013aa;@Anders:2010;@Gierlinski:2015aa]. Much work has been done to normalize the datasets such that they approximate this assumption^[@Bullard:2010;@Dillies:2013]. Even so, that the data are counts can be simply disproven by running the same library on instruments with different capacities and attempting to normalize. The relative relationships between the features (genes, OTUs, functions) are preserved but the actual count values are not.

With this simple realization, a number of groups have realized that HTS datasets are actually count-compositions^[@lovell:2011;@Friedman:2012;@fernandes:2013;@fernandes:2014;@gloorFrontiers:2017;@Quinn206425] which have dramatically different statistical properties than do counts^[@Aitchison:1986;@pawlowsky2015modeling;@pawlowsky2011compositional]. Thus, r Biocpkg("ALDEx2") is an R package for differential relative abundance that takes into account the count-compositional nature of these data.

Installation

There are two ways to download and install the most current of r Biocpkg("ALDEx2"). The most recent version of the package will be found at github.com/ggloor/ALDEx_bioc. The package will run with only the base R packages and a minimal Bioconductor install, and is capable of running several functions with the parallel' package if installed. It has been tested with version R version 3, but should run on version 2.12 onward as long as dependencies are met. It is recommended that the package be run on the most up-to-date R and Bioconductor versions.r Biocpkg("ALDEx2")will make use of the BiocParallel package if possible, otherwise,r Biocpkg("ALDEx2")` will run in serial mode.

install.packages("devtools")
devtools::install_github("ggloor/ALDEx_bioc")
source("https://bioconductor.org/biocLite.R")
biocLite("ALDEx2")

Quick Start: aldex with 2 groups:

The r Biocpkg("ALDEx2") package in Bioconductor is modular and is suitable for the comparison of many different experimental designs. This is achieved by exposing the underlying centre log-ratio transformed Dirichlet Monte-Carlo replicate values to make it possible for anyone to add the specific R code for their experimental design --- a guide to these values is outlined below.

However, r Biocpkg("ALDEx2") contains an aldex wrapper function that can perform many simple analyses. This wrapper will link the modular elements together to emulate r Biocpkg("ALDEx2") prior to the modular approach. In the simplest incarnation, which we will use below, aldex does a two-sample t-test and calculates effect sizes. If the test is 't', then effect should be set to TRUE. The 't' option evaluates the data as a two-factor experiment using both the Welch's t and the Wilcoxon rank tests. In more complex incarnations for multi-sample tests using ANOVA modules the effect size should not be calculated and then the effect parameter should be FALSE. The 'kw' option evaluates the data as a one-way ANOVA using the glm and Kruskal-Wallace tests. All tests include a Benjamini-Hochberg correction of the raw P values. The data can be plotted onto Bland-Altmann^[@altman:1983] (MA) or effect (MW) plots^[@gloor:effect] for for two-way tests using the aldex.plot function. See the end of the modular section for examples of the plots.

Case study t-test of a growth selection type (SELEX) experiment.

This section contains an analysis of a dataset collected where a single gene library was made that contained 1600 sequence variants at 4 codons in the sequence^[@mcmurrough:2014]. These variants were cloned into an expression vector at equimolar amounts. The wild-type version of the gene conferred resistance to a topoisomerase toxin. Seven independent growths of the gene library were conducted under selective and non-selective conditions and the resulting abundances of each variant was read out by sequencing a pooled, barcoded library on an Illumina MiSeq. The data table is included as selex_table.txt in the package. In this data table, there are 1600 features and 14 samples. The analysis takes approximately 2 minutes and memory usage tops out at less than 1Gb of RAM on a mobile i7 class processor when we use 128 Dirichlet Monte-Carlo Instances (DMC). For speed concerns we use only the first 400 features and perform only 16 DMC. The command used for r Biocpkg("ALDEx2") is presented below:

First we load the library and the included selex dataset. Then we set the comparison groups. This must be a vector of conditions in the same order as the samples in the input counts table. The aldex command is calling several other functions in the background, and each of them returns diagnostics. These diagnostics are suppressed for this vignette.

library(ALDEx2)
data(selex)
#subset only the last 400 features for efficiency
selex.sub <- selex[1:400,]

conds <- c(rep("NS", 7), rep("S", 7))
x.all <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE,
     include.sample.summary=FALSE, denom="all", verbose=FALSE)

par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch", xlab="Log-ratio abundance",
    ylab="Difference")
aldex.plot(x.all, type="MW", test="welch", xlab="Dispersion",
    ylab="Difference")

Using r Biocpkg("ALDEx2") modules

The modular approach exposes the underlying intermediate data so that users can generate their own tests. The simple approach outlined above just calls aldex.clr, aldex.ttest, aldex.effect in turn and then merges the data into one dataframe called x.all. We will show these modules in turn, and then examine additional modules.

The aldex.clr module

The workflow for the modular approach first generates random instances of the centred log-ratio transformed values. There are three inputs: counts table, a vector of conditions, and the number of Monte-Carlo (DMC) instances; and several parameters: a string indicating if iqlr, zero or all feature are used as the denominator is required, and level of verbosity (TRUE or FALSE). We recommend 128 or more mc.samples for the t-test, 1000 for a rigorous effect size calculation, and at least 16 for ANOVA.^[in fact we recommend that the number of samples in the smallest group multiplied by the number of DMC be equal at least 1000 in order to generate a reasonably stable estimate of the posterior distribution]

This operation is fast.

# the output is in the S3 object 'x'
x <- aldex.clr(selex.sub, conds, mc.samples=16, denom="all", verbose=F)

The aldex.ttest module

The next operation performs the Welch's t and Wilcoxon rank test for the situation when there are only two conditions. There are only two inputs: the aldex object from aldex.clr and whether a paired test should be conducted or not (TRUE or FALSE).

This operation is reasonably fast.

x.tt <- aldex.ttest(x, paired.test=FALSE, verbose=FALSE)

The aldex.kw module

Alternatively to the t-test, the user can perform the glm and Kruskal Wallace tests for one-way ANOVA of two or more conditions. Here there are only two inputs: the aldex object from aldex.clr, and the vector of conditions. Note that this is slow! and is not evaluated for this documentation.

x.kw <- aldex.kw(x)

The aldex.effect module

Finally, we estimate effect size and the within and between condition values in the case of two conditions. This step is required for plotting, in our lab we base our conclusions primarily on the output of this function^[@macklaim:2013;@mcmurrough:2014;@bian:2017].. There is one input: the aldex object from aldex.clr, ; and several parameters: a flag as to whether to include values for all samples or not are used as the denominator, and the level of verbosity. It is also possible to include the 95% confidence interval information for the effect size estimate with the flag CI=TRUE. This can be helpful when deciding whether to include or exclude specific features from consideration. We find that a large effect but that is an outlier for the extremes of the effect distribution can be false positives.

x.effect <- aldex.effect(x, CI=T, verbose=FALSE)

The aldex.plot module

Finally, the t-test and effect data are merged into one object.

x.all <- data.frame(x.tt,x.effect)

And the data are plotted. We see that the plotted data in Figure 1 and 2 are essentially the same.

par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch")
aldex.plot(x.all, type="MW", test="welch")

The effect confidence interval

As noted above, the r Biocpkg("ALDEx2") package generates a posterior distribution of the probability of observing the count given the data collected. Here we show the importance of this approach by examining the 95% CI of the effect size. Throughout, we use a standardized effect size, similar to the Cohen's d metric, although our effect size is more robust and more conservative (being approximately 0.7 Cohen's d when the data are Normally distributed)^[Fernandes Gloor unpublished].

Comparing the outputs of Figures 2 and 3 shows a very important point: there is a tremendous amount of latent variation in sequencing data \emph{simply from random sampling alone} We see in Figure 2 that there are a few features that have an expected q value that is statistically significantly different, which are both relatively rare and which have a relatively small difference. Even when identifying features based on the expected effect size can be misleading. We find that the safest approach is to identify those features that where the 95% CI of the effect size does not cross 0.

Examining the figures we see that the features that are the rarest are least likely to reproduce the effect size with simple random sampling. The behaviour of the 95% CI metric is exactly in line with our intuition: the precision of estimation of rare features is poor---if you want more confidence in rare features you must spend more money to sequence more deeply.

With this approach we are accepting the biological variation in the data as received^[i.e. we are not inferring any additional biological variation: i.e., the experimental design is always as given], but are identifying those features where simple random sampling of the library would be expected to give the same result every time. This is the approach that was used in [@macklaim:2013], the results of which were independently validated and found to be very robust [@nelson:2015vaginal].

sgn <- sign(x.effect$effect.low) == sign(x.effect$effect.high)
par(mfrow=c(1,2))
plot(x.effect$rab.all, x.effect$diff.btw, pch=19, cex=0.3, col="grey", xlab="Abundance", ylab="Difference", main="Bland-Altman")
points(x.effect$rab.all[abs(x.effect$effect) >=2], x.effect$diff.btw[abs(x.effect$effect) >=2], pch=19, cex=0.5, col="red")
points(x.effect$rab.all[sgn], x.effect$diff.btw[sgn], cex=0.7, col="blue")

plot(x.effect$diff.win, x.effect$diff.btw, pch=19, cex=0.3, col="grey",xlab="Dispersion", ylab="Difference", main="Effect")
points(x.effect$diff.win[abs(x.effect$effect) >=2], x.effect$diff.btw[abs(x.effect$effect) >=2], pch=19, cex=0.5, col="red")
points(x.effect$diff.win[sgn], x.effect$diff.btw[sgn], cex=0.7, col="blue")
abline(0,2, lty=2, col="grey")
abline(0,-2, lty=2, col="grey")

Complex study designs and the aldex.glm module

The aldex.glm module has been included so that the probabilistic compositional approach can be used for complex study designs. This module is substantially slower than the two-comparison tests above, but we think it is worth it if you have complex study designs.

Essentially, the approach is the modular approach above but using a model matrix and covariates supplied to the glm function in R. The values returned are the expected values of the glm function given the inputs. In the example below, we are measuring the predictive value of variables A and B independently. See the documentation for the R formula function, or http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf for more information.

Validation of features that are differential under any of the variables identified by the aldex.glm function should be performed using the aldex.effect function as a post-hoc test.

Note that aldex.clr will accept the denom="all" or a user-defined vector of denominator offsets when a model matrix is supplied. Therefore, when it is intended that the downstream analysis is the aldex.glm function only those two denominator options are available. This will be addressed in a future release.

data(selex)
selex.sub <- selex[1:500, ]
 covariates <- data.frame("A" = sample(0:1, 14, replace = TRUE),
     "B" = c(rep(0, 7), rep(1, 7)),
     "Z" = sample(c(1,2,3), 14, replace=TRUE))
 mm <- model.matrix(~ A + Z + B, covariates)
 x <- aldex.clr(selex.sub, mm, mc.samples=4, denom="all", verbose=F)
 glm.test <- aldex.glm(x, mm)
 glm.effect <- aldex.glm.effect(x)

The aldex.glm.effect function will calculate the effect size for all models in the matrix that are binary. The effect size calculations for each binary predictor are output to a named list. These effect sizes, and outputs from aldex.glm can be plotted as in the example code block below which plots the BH-corrected glm values for the actual test case vs. the effect size for the binary predictor.

aldex.plot(glm.effect[["B"]], test="effect", cutoff=2)
sig <- glm.test[,20]<0.05
points(glm.effect[["B"]]$diff.win[sig],
    glm.effect[["B"]]$diff.btw[sig], col="blue")
sig <- glm.test[,20]<0.2
points(glm.effect[["B"]]$diff.win[sig],
    glm.effect[["B"]]$diff.btw[sig], col="blue")

\clearpage

ALDEx2 outputs

Expected values

a r Biocpkg("ALDEx2") returns expected values for summary statistics. It is important to note that ALDEx uses Bayesian sampling from a Dirichlet distribution to estimate the underlying technical variation. This is controlled by the number of \texttt{mc.samples}, in practice we find that setting this to 16 or 128 is sufficient for most cases as r Biocpkg("ALDEx2") is estimating the expected value of the distributions^[@fernandes:2013;@fernandes:2014;@gloorAJS:2016].

In practical terms, r Biocpkg("ALDEx2") takes the biological observations as given, but infers technical variation (sequencing the same sample again) multiple times using the aldex.clr function. Thus, the expected values returned are those that would likely have been observed \emph{if the same samples had been run multiple times}. The user is cautioned that the number of features called as differential will vary somewhat between runs because of this sampling procedure. However, only features with values close to the chosen significance cutoff will vary between runs.

Several papers have suggested that r Biocpkg("ALDEx2") is unable to properly control for the false discovery rate since the P values returned do not follow a random uniform distribution, but rather tend to cluster near a value of 0.5^[@hawinkel2017;]@Thorsen:2016aa. These studies indicate that point estimate approaches are very sensitive to particular experimental designs and differences in sparsity and read depth. However, r Biocpkg("ALDEx2") is not sensitive to these characteristics of the data, but seem to under-report the true FDR. The criticisms miss the mark on r Biocpkg("ALDEx2") because r Biocpkg("ALDEx2") reports the \emph{expected} P value across the Dirichlet Monte-Carlo replicates. Features that are differential simply because of the vagaries of random sampling will indeed have a random uniform P value as point estimates, but will have an expected P value after repeated random sampling of 0.5. In contrast, features that are differential because of true biological variation are robust to repeated random sampling. Thus, r Biocpkg("ALDEx2") identifies as differential only those features where simple random sampling (the minimal NULL hypothesis) cannot explain the difference.

In our experience, we observe that r Biocpkg("ALDEx2") returns a set of features that is very similar to the set returned as the intersect of multiple independent tools---a common recommendation when examining HTS datasets^[@Soneson:2013]

Explaining the outputs

variant | we.ep | we.eBH | wi.ep | wi.eBH | kw.ep -|-|-|-|-|- A:D:A:D | 4.03010e-01 | 0.63080705 | 0.239383012 | 0.43732819 | 0.21532060 A:D:A:E | 1.15463e-01 | 0.34744596 | 0.040901806 | 0.15725841 | 0.03745315 A:E:A:D | 8.98797e-05 | 0.00329076 | 0.000582750 | 0.00820759 | 0.00174511

variant | kw.eBH | glm.ep | glm.eBH | rab.all | rab.win.NS | rab.win.S -|-|-|-|-|-|- A:D:A:D | 0.3932743 | 3.61061e-01 | 5.23582e-01 | 1.42494 | 1.30886 | 2.45384 A:D:A:E | 0.1486590 | 8.12265e-02 | 1.92292e-01 | 1.71230 | 1.49767 | 4.23315 A:E:A:D | 0.0245786 | 7.73660e-08 | 3.35492e-06 | 3.97484 | 1.41163 | 11.02154

variant | diff.btw | diff.win | effect | overlap -|-|-|-|- A:D:A:D | 1.12261 | 1.72910 | 0.471043 | 0.267260701 A:D:A:E | 2.73090 | 2.38134 | 1.034873 | 0.135857781 A:E:A:D | 9.64287 | 2.85008 | 3.429068 | 0.000156632

In the list below, the aldex.ttest function returns the values highlighted with $\ast$, the aldex.kw function returns the values highlighted with $\circ$, and the aldex.effect function returns the values highlighted with $\diamondsuit$.

  1. $\ast$ we.ep - Expected P value of Welch's t test
  2. $\ast$ we.eBH - Expected Benjamini-Hochberg corrected P value of Welch's t test
  3. $\ast$ wi.ep - Expected P value of Wilcoxon rank test
  4. $\ast$ wi.eBH - Expected Benjamini-Hochberg corrected P value of Wilcoxon test
  5. $\circ$ kw.ep - Expected P value of Kruskal-Wallace test
  6. $\circ$ kw.eBH - Expected Benjamini-Hochberg corrected P value of Kruskal-Wallace test
  7. $\circ$ glm.ep - Expected P value of glm test
  8. $\circ$ glm.eBH - Expected Benjamini-Hochberg corrected P value of glm test
  9. $\diamondsuit$ rab.all - median clr value for all samples in the feature
  10. $\diamondsuit$ rab.win.NS - median clr value for the NS group of samples
  11. $\diamondsuit$ rab.win.S - median clr value for the S group of samples
  12. $\diamondsuit$ rab.X1_BNS.q50 - median expression value of features in sample X1_BNS if include.item.summary=TRUE
  13. $\diamondsuit$ dif.btw - median difference in clr values between S and NS groups
  14. $\diamondsuit$ dif.win - median of the largest difference in clr values within S and NS groups
  15. $\diamondsuit$ effect - median effect size: diff.btw / max(diff.win) for all instances
  16. $\diamondsuit$ overlap - proportion of effect size that overlaps 0 (i.e. no effect)

A word about effect size and overlap

The effect size metric used by r Biocpkg("ALDEx2") is a standardized distributional effect size metric developed specifically for this package. The measure is somewhat robust, allowing up to 20% of the samples to be outliers before the value is affected, returns an effect size that is 71% the size of Cohen's d on a Normal distribution, and requires at worst twice the number of samples as fully parametric methods (which are not robust) would to estimate values with the same precision. The metric is equally valid for Normal, random uniform and Cauchy distributions^(@Gloor submitted).

We prefer to use the effect size whenever possible rather than statistical significance since an effect size tells the scientist what they want to know---"what is reproducibly different between groups"; this is emphatically not something that P values deliver. We find that using the effect size returns a consistent set of true positive features irregardless of sample size, unlike P value based methods. Furthermore, over half of the the false positive features that are observed at low sample sizes have and effect size $> 0.5 \times \mathrm{E}$ the chosen effect size cutoff $\mathrm{E}$. This is true regardless of the source of the dataset (@Gloor submitted).

We suggest that an effect size cutoff of 1 or greater be used when analyzing HTS datasets. If preferred the user can also set a fold-change cutoff as is commonly done with P value based methods.

The plot below shows the relationship between effect size and P values and BH-adjusted P values in the test dataset.

par(mfrow=c(1,2))
plot(x.all$effect, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Effect size", ylab="P value", main="Effect size plot")
points(x.all$effect, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")
legend(15,1, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))

plot(x.all$diff.btw, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Difference", ylab="P value", main="Volcano plot")
points(x.all$diff.btw, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")

Alternative plotting of outputs

The built-in aldex.plot function described above will usually be sufficient, but for more user control the example in Figure 4 shows a plot that shows which features are found by the Welchs' or Wilcoxon test individually (blue) or by both (red).

# identify which values are significant in both the t-test and glm tests
found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)

# identify which values are significant in fewer than all tests
found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)

# plot the within and between variation of the data
plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=0.3, col=rgb(0,0,0,0.3),
 xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
 cex=0.7, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
 cex=0.7, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)

\clearpage

Troubleshooting datasets

Correcting for asymmetric datasets

In some cases we observe that the data returned by the centre log-ratio can be asymmetric. This occurs when the data are extremely asymmetric, such as when one group is largely composed of features that are absent in the other group. In this case the geometric mean will not accurately represent the appropriate basis of comparison for each group. An asymmetry can arise for many reasons: in RNA-seq it could arise because samples in one group contain a plasmid and the samples in the other group do not; in metagenomics or 16S rRNA gene sequencing it can arise when the samples in the two groups are taken from different environments; in a selex experiment it can arise because the two groups are under different selective constraints. The asymmetry can manifest either as a difference in sparsity (i.e., one group contains more 0 value features than the other) or as a systematic difference in abundance. When this occurs the geometric mean of each sample and group can be markedly different, and thus an inherent skew in the dataset can occur that leads to false positive and false negative feature calls. Asymmetry generally shows as a the centre of mass of the histogram for the x.all\$diff.btw or x.all\$effect being not centred around zero\footnote{Wu et al (in prep)}. We recommend that all datasets be examined for asymmetry.

The approach taken by r Biocpkg("ALDEx2") is to identify those features that are relatively invariant in all features in the entire dataset even though many features could be asymmetric between the groups. Fundamentally, the log-ratio approach requires that the denominator across all samples be comparable. The output of aldex.clr contains the offset of the features used for the denominator in the @denom slot.

Methods to correct for asymmetry

The aldex.clr function incorporates multiple approaches to select the denominator that can best deal with asymmetric datasets:

IMPORTANT: no rows should contain all 0 values as they will be removed by the aldex.clr function

  1. all: The default is to calculate the geometric mean of all features using the centred log-ratio of Aitchison^[Aitchison:1986]. This is the default method for the compositional data analysis approach.

  2. iqlr: The _iqlr method identifies those features that exhibit reproducible variance in the entire dataset. This is called the inter-quartile log-ratio or iqlr approach. For this, a uniform prior of 0.5 is applied to the dataset, the clr transformation is applied, and the variance of each feature is calculated. Those features that have variance values that fall between the first and third quartiles of variance for all features in all groups in the dataset are retained. When aldex.clr is called, the geometric mean abundance of only the retained features is calculated and used as the denominator for log-ratio calculations. Modelling shows that this approach is effective in dealing with datasets with up to 25% of the features being asymmetric. The approach has the advantage it has little or no effect on symmetric datasets and so is a safe approach if the user is unsure if the data is mildly asymmetric.

  3. lvha: This method identifies those features that in the bottom quartile for variance in each group and the top quartile for relative abundance for each sample and across the entire dataset. This method is appropriate when the groups are very asymmetric, but there are some features that are expected to be relatively constant. The basic idea here is to identify those features that are relatively constant across all samples, similar to the features that would be chosen as internal standards in qPCR. Experience suggests that meta-genomic and meta-transcriptomic datasets can benefit from this method of choosing the denominator. This method does not work with the selex dataset, since no features fit the criteria.

  4. zero: This approach identifies and uses only those features that are non-zero in each group. In this approach the per-group non-zero features are used when aldex.clr calculates the geometric mean in the clr transformation. This method is appropriate when the groups are very asymmetric, but the experimentalist must ask whether the comparison is valid in these extreme cases.

  5. user: The last new approach is to let the user define the set of `invariant' features. In the case of meta-rna-seq, it could be argued that the levels of housekeeping genes should be standard for all samples. In this case the user could define the row indices that correspond to the particular set of housekeeping genes to use as the standard. \emph{It is important that no row in the dataset contains all 0 values for any feature when this method is used.}

  6. iterate: This method identifies those features that are not statistically significantly different between groups using the statistical test of choice. It may be combined with other methods.

Figure 5 shows the effect of the iqlr correction on the example dataset. When the denominator is all, we see that the bulk of the points fall off the midpoint (dotted line), but that the bulk of the points are centered around 0 for the iqlr and lvha analysis. Thus, we have a demonstrably better centring of the data in the latter two methods. Practically speaking, we alter the p values and effect sizes of features near the margin of significance following iqlr or lvha transformation. The effect is largest for those features that are close to the bulk datapoints.

First the code:

# small synthetic dataset for illustration
# denominator features in x@denominator

data(synth2)
blocks <- c(rep("N", 10),rep("S", 10))
x <- aldex.clr(synth2, blocks, denom="all")
x.e <- aldex.effect(x)
plot(x.e$diff.win, x.e$diff.btw, pch=19, col=rgb(0,0,0,0.1), cex=0.5, xlab="dispersion", ylab="difference", main="all")
points(x.e$diff.win[x@denom], x.e$diff.btw[x@denom], pch=19, col=rgb(0.8,0.5,0,0.7), cex=0.5)
points(x.e$diff.win[47:86], x.e$diff.btw[47:86], col=rgb(0.8,0,0,0.7), cex=0.5)
points(x.e$diff.win[980:1000], x.e$diff.btw[980:1000], col=rgb(0.8,0,0,0.7), cex=0.5)
abline(0,1)
abline(0,-1)
abline(h=0, col="gray", lty=2)
# plot it
par(mfrow=c(1,3))
data(synth2)
blocks <- c(rep("N", 10),rep("S", 10))
x <- aldex.clr(synth2, blocks, denom="all")
x.e <- aldex.effect(x)

plot(x.e$diff.win, x.e$diff.btw, pch=19, col=rgb(0,0,0,0.1), cex=0.5, xlab="dispersion", ylab="difference", main="all")
points(x.e$diff.win[x@denom], x.e$diff.btw[x@denom], pch=19, col=rgb(0.8,0.5,0,0.7), cex=0.5)
points(x.e$diff.win[47:86], x.e$diff.btw[47:86], col=rgb(0.8,0,0,0.7), cex=0.5)
points(x.e$diff.win[980:1000], x.e$diff.btw[980:1000], col=rgb(0.8,0,0,0.7), cex=0.5)
abline(0,1)
abline(0,-1)
abline(h=0, col="gray", lty=2)

x <- aldex.clr(synth2, blocks, denom="iqlr")
x.e <- aldex.effect(x)
plot(x.e$diff.win, x.e$diff.btw, pch=19, col=rgb(0,0,0,0.1), cex=0.5, xlab="dispersion", ylab="difference", main="iqlr")
points(x.e$diff.win[x@denom], x.e$diff.btw[x@denom], pch=19, col=rgb(0.8,0.5,0,0.7), cex=0.5)
points(x.e$diff.win[47:86], x.e$diff.btw[47:86], col=rgb(0.8,0,0,0.7), cex=0.5)
points(x.e$diff.win[980:1000], x.e$diff.btw[980:1000], col=rgb(0.8,0,0,0.7), cex=0.5)
abline(0,1)
abline(0,-1)
abline(h=0, col="gray", lty=2)

x <- aldex.clr(synth2, blocks, denom="lvha")
x.e <- aldex.effect(x)
plot(x.e$diff.win, x.e$diff.btw, pch=19, col=rgb(0,0,0,0.1), cex=0.5, xlab="dispersion", ylab="difference", main="lvha")
points(x.e$diff.win[x@denom], x.e$diff.btw[x@denom], pch=19, col=rgb(0.8,0.5,0,0.7), cex=0.5)
points(x.e$diff.win[47:86], x.e$diff.btw[47:86], col=rgb(0.8,0,0,0.7), cex=0.5)
points(x.e$diff.win[980:1000], x.e$diff.btw[980:1000], col=rgb(0.8,0,0,0.7), cex=0.5)
abline(0,1)
abline(0,-1)
abline(h=0, col="gray", lty=2)

\clearpage

Contributors

I am grateful that r Biocpkg("ALDEx2") has taken on a life of its own.

Version information

Version 1.04 of ALDEx was the version used for the analysis in Macklaim et al.^[@macklaim:2013;@fernandes:2013]. This version was suitable only for two-sample two-group comparisons, and provided only effect size estimates of difference between groups. ALDEx v1.0.4 is available at: \begin{verbatim}https://github.com/ggloor/ALDEx2/blob/master/ALDEx_1.0.4.tar.gz\end{verbatim}. No further changes are expected for that version since it can be replicated completely within r Biocpkg("ALDEx2") by using only the aldex.clr and aldex.effect commands.

Versions 2.0 to 2.05 were development versions that enabled P value calculations. Version 2.06 of r Biocpkg("ALDEx2") was the version used for the analysis in^[@fernandes:2014]. This version enabled large sample comparisons by calculating effect size from a random sample of the data rather than from an exhaustive comparison.

Version 2.07 of r Biocpkg("ALDEx2") was the initial the modular version that exposed the intermediate calculations so that investigators could write functions to analyze different experimental designs. As an example, this version contains an example one-way ANOVA module. This is identical to the version submitted to Bioconductor as 0.99.1.

Future releases of r Biocpkg("ALDEx2") now use the Bioconductor versioning numbering.

sessionInfo()


Try the ALDEx2 package in your browser

Any scripts or data that you put into this service are public.

ALDEx2 documentation built on Nov. 8, 2020, 8:05 p.m.