An Introduction to Proportionality
In propr: Calculating Proportionality Between Vectors of Compositional Data

Introduction

The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, correlation lacks validity when applied to relative data, including count data generated by next-generation sequencing. This vignette describes three metrics of proportionality that measure dependence between relative features using compositional data analysis: $\phi$ (Lovell 2015), $\rho$ (Lovell 2015; Erb 2016), and $\phi_s$ (Erb 2016; Quinn 2017). Unlike correlation, proportionality gives the same result for both relative and absolute data. Meanwhile, pairs that are strongly proportional in relative space are also strongly correlated in absolute space. Proportionality helps avoid the pitfall of spurious correlation.

Theory

Let $A_i$ and $A_j$ each represent a log-ratio transformed feature vector measured across $N$ samples. We then define the metrics $\phi$, $\rho$, and $\phi_s$:

$$\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}$$

$$\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}$$

$$\phi_s(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i + A_j)}$$

Log-ratio transformation is applied to each subject vector $\textrm{x}$ with $D$ features (e.g., genes or microbes). In this vignette, we consider the centered log-ratio (clr) transformation and the additive log-ratio (alr) transformation. In clr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, $g(\textrm{x}) = \sqrt[D]{x_i...x_D}$. In alr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and a chosen reference feature, $x_D$. We define these accordingly:

$$\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]$$

$$\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]$$

Note that this package also implements the interquartile log-ratio transformation (iqlr) as used by the ALDEx2 package. For more information on integrating propr with ALDEx2, we refer the reader to the "Frequently Asked Questions" vignette.

Calculating proportionality

The measures $\phi$ and $\rho$ differ in three ways. First, the values of $\phi$ range from $[0, \infty)$ (with lower $\phi$ values indicating more proportionality) while the values of $\rho$ range from $[-1, 1]$ (with greater $|\rho|$ values indicating more proportionality and negative $\rho$ values indicating inverse proportionality). Second, $\phi$ lacks symmetry, although one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE). Third, $\rho$ corrects for the individual variance of each feature in the pair, rather than for just one of the features. On the other hand, $\phi_s$ is a naturally symmetric variant of $\phi$ that also corrects for the individual variance of each feature in the pair. We advise against interpretting the negative proportionality directly; it does not always behave like one would expect!

Let us begin by building an arbitrary data set of 4 features (e.g., genes) measured across 100 subjects. In this example, the feature pairs "a" and "b" will be highly proportional.

set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
                c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))

Let $D$ represent any number of features measured across $N$ observations exposed to a binary or continuous event (for example, case-control status, treatment status, treatment dose, or time). These functions convert a "count matrix" with $N$ rows and $D$ columns into a proportionality matrix of $D$ rows and $D$ columns measuring proportionality for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of $\phi$) or a correlation matrix (in the case of $\rho$). The propr function returns a proportionality matrix bundled within an object of the class propr. This object contains four key slots:

@counts A matrix. Stores the original "count matrix" input.
@logratio A matrix. Stores the log-ratio transformed "count matrix".
@matrix A matrix. Stores the proportionality metrics.
@pairs A vector. Indexes proportionality of interest.

library(propr)
phi <- propr(X, metric = "phi", symmetrize = TRUE)
rho <- propr(X, metric = "rho", ivar = 0)
phs <- propr(X, metric = "phs", ivar = 0)

Note that the log-ratio transformation, by its nature, fails if the input data contain any zero values. By default, this function replaces all zero values with 1. Alternatively, the user may set the parameter alpha greater than zero to approximate log-ratios in the presence of zeros (via the Box-Cox transformation). However, the topic of zero replacement is controversial. Proceed carefully when analyzing data that contain zero values.

How proportional is proportional enough?

By default, the propr function creates 100 permutations of the original data set, optionally used later by updateCutoffs to permute the false discovery rate (FDR) for a set of proportionality cutoffs. This method works for all proportionality measures. In the example below, there are no false discoveries across the range of cutoffs.

updateCutoffs(rho, cutoff =  seq(.05, .95, .3))

Choose the largest cutoff that keeps the FDR below 0.05.

Subsetting propr objects

We have provided methods for indexing and subsetting objects belonging to the propr class. Using the familiar [ method, we can efficiently index the proportionality matrix (i.e., the @matrix slot) based on an inequality operator and a reference value.

In this first example, we use [ to index the matrix by $\rho > .95$. This indexes the location of all values satisfying that inequality (i.e., in the lower left triangle of the matrix), and saves those indices to the @pairs slot. Indexing helps guide some of the bundled visualization methods in lieu of copy-on-modify subsetting. Note that indexing an already indexed object appends the new index to the previous index (via a union merge).

rho99 <- rho[">", .95]
rho99@pairs

Alternatively, using the subset method, we can subset an entire propr object based on a vector of feature indices or names. However, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large data sets.

In this second example, we subset by the feature names "a" and "b".

rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix

The convenience function, simplify, can subset an entire propr object based on the index saved in its @pairs slot. This function converts the saved index into a paired list of coordinates and passes them along to the subset method. As such, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large data sets. Unlike subset, simplify returns an object with the @pairs slot updated. Most users will find simplify preferable to subset.

simplify(rho99)

Computational burden

High-throughput genomic sequencing has the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized $D^2$, this method uses a lot of RAM when applied to real biological data sets. To address this, propr harnesses the power of C++ (via the Rcpp package) to minimize the run-time and RAM overhead. Below, we provide a table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more RAM for subsequent [ indexing.

| Features | Peak RAM (MiB) | |---------:|---------------:| |1000 | 8| |2000 | 31| |4000 | 123| |8000 | 491| |16000 | 1959| |24000 | 4405| |32000 | 7829| |64000 | 31301| |100000 | 76406|

Limitations

Although we developed this package with biological count data in mind, compositional count data is not truly compositional in that count data contain integer values only. As such, measuring "Gene A" as $1$ in one subject and "Gene B" as $2$ in another subject (i.e., the feature vector $[1, 2]$) does not carry the same information as measuring "Gene A" as $1000$ in one subject and "Gene B" as $2000$ in another subject (i.e., the feature vector $[1000, 2000]$) due to how additive variation affects the relative abundance of small counts more than large counts (Quinn 2017). We advise the investigator to proceed with caution when analyzing low counts.

References

Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.
Quinn, Thomas P., Mark F. Richardson, David Lovell, and Tamsyn M. Crowley. “Propr: An R-Package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.” Scientific Reports 7, no. 1 (November 24, 2017): 16252. https://doi.org/10.1038/s41598-017-16520-0.