Frequently Asked Questions
In propr: Calculating Proportionality Between Vectors of Compositional Data

Why does proportionality change when I remove features?

We use the term sub-compositional coherence to describe a method in which the results do not change when features (e.g., genes) are added or removed. The numerator portion of $\rho$ (i.e., the VLR) is sub-compositionally coherent. However, $\rho$ itself is not sub-compositionally coherent because the denominator portion of $\rho$ (and $\phi$ and $\phi_s$) scales the VLR by the variance of the centered log-ratio (clr) transformed features. The clr-transformation is not sub-compositionally coherent because the removal of a feature will change geometric mean reference for each composition (i.e., each sample) and therefore its transformation; consequently, the variance of each transformed feature vector changes too. An exception to this is when using the additive log-ratio (alr) transformation instead of clr.

Does the centered log-ratio have any limitations?

The major limitation of proportionality analysis is that its interpretation depends on the log-ratio transformation. If the majority of genes remain unchanged across all conditions, the clr approximates an unchanged reference and one can interpret results in absolute terms.

Excerpted from Erb et al. 2016:

This is known as the centered log-ratio transformation with g(x) the geometric mean over the genes for the given condition. The problem with this transformation is that it is sub-compositionally incoherent Aitchison (2003), so results will change to some extent when using subsets of genes for the analysis. In some cases, g(x) can approximate an unchanged reference. This applies whenever the majority of genes remains unchanged across conditions, so the unchanged genes will dominate the behaviour of the reference. Note that this is also the condition needed for a normalization by effective library size Robinson and Oshlack (2012).

What if the centered log-ratio isn't for me?

One option is to use a house-keeping gene or RNA spike-in that has an a priori known fixed abundance. If this is available, you can use the additive log-ratio (alr) transformation to effectively "back-calculate" absolute abundances. In practice, this is sometimes infeasible. Another option is to use a novel transformation introduced in Fernandes et al. 2013 called the inter-quartile log-ratio (iqlr) transformation.

Excerpted from the ALDEx2 vignette:

The [approach] is to include as the denominator for the geometric mean those features that are relatively invariant across all samples. This is termed the ‘iqlr’ method, and takes as the denominator the geometric mean of those features with variance calculated from the clr that are between the first and third quartile. This approach can be used until the asymmetry becomes so severe that more than 25% of the features are asymmetric between the groups. The iqlr approach has the advantage that it gives essentially the same answer as using the entire set of features in symmetric datasets.

Greg Gloor and I have coordinated to make the iqlr transformation and the Monte Carlo (MC) instances from ALDEx2 available for proportionality analysis in propr. You can use the aldex2propr function to build a propr object from an aldex.clr object. In practice, the code might look something like this:

data(caneToad.counts)
data(caneToad.groups)
counts <- as.data.frame(t(caneToad.counts))
x <- ALDEx2::aldex.clr(counts, caneToad.groups, denom = "iqlr")
rho <- propr::aldex2propr(x, how = "rho")

Note that this will average $\rho$ (or $\phi$ or $\phi_s$) across many MC instances which may or may not be desired. In theory, using MC instances should lessen the impact of zero or low counts on your final result. In practice, proportionality analysis will take a lot longer to run. Note, however, that you can also use the iqlr transformation without any MC instances:

rho <- propr(caneToad.counts, metric = "rho", ivar = "iqlr")

When using aldex2propr (or the iqlr transformation), please make sure to cite the appropriate ALDEx2 paper(s) to thank Greg for his generous contribution!

Why can't we just use the VLR by itself?

Quoting Lovell et al. 2015 (quoting Friedman et al. 2012):

Aitchison proposed logratio variance, var(log(x/y)), as a measure of association for variables that carry only relative information. When x and y are exactly proportional var(log(x/y)) = 0, but when x and y are not exactly proportional, “it is hard to interpret as it lacks a scale. That is, it is unclear what constitutes a large or small value… (does a value of 0.1 indicate strong dependence, weak dependence, or no dependence?)”

Why does my `smear` plot show perfectly proportional pairs?

Depending on how genes (or transcripts) get mapped, annotations with similar open reading frames (ORFs) can wrongly appear as if they have proportional abundance only because some sequences map to multiple annotations equally well. This is especially a problem when analyzing transcriptomic data: transcript isoforms tend to appear proportional with one another by virtue of their shared ORF. The discovery of proportional isoforms could reflect a real biological event (i.e., transcript isoforms under shared regulatory control) or an artifact of mapping. One could remove the set of extremely proportional pairs (e.g., $\rho > .995$) from the set of highly proportional pairs (e.g., $\rho > .95$):

data(caneToad.counts)
rho <- perb(caneToad.counts)
autoprop <- rho[">", .995]@pairs
highprop <- rho[">", .95]@pairs
rho@pairs <- setdiff(highprop, autoprop)

References

Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.
Fernandes, Andrew D., Jean M. Macklaim, Thomas G. Linn, Gregor Reid, and Gregory B. Gloor. “ANOVA-like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq.” PloS One 8, no. 7 (2013): e67019. http://dx.doi.org/10.1371/journal.pone.0067019.
Friedman, Jonathan, and Eric J. Alm. “Inferring Correlation Networks from Genomic Survey Data.” PLoS Computational Biology 8, no. 9 (2012): e1002687. http://dx.doi.org/10.1371/journal.pcbi.1002687.
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.