In this vignette, we use a real dataset to show how we can apply differential proportionality analysis to RNA-seq count data. We place a particular emphasis here on documenting the differential proportionality measures available through this package. This vignette discusses the propd
method.
Let us consider two non-zero positive vectors, $X$ and $Y$, measuring the relative abundance of raw sequence counts. One way to understand whether two vectors associate with one another is to consider the variance of the log-ratios (VLR), a compositionally valid measure of association that makes up part of the definition of the proportionality metrics. Although we can calculate VLR using log-ratio transformed data, the geometric mean or unchanged reference cancels. As such, we can use these counts without any transformation:
$$\textrm{VLR(X, Y)} = \textrm{var}(\log(\textrm{X} / \textrm{Y}))$$
The propd
method uses the VLR to test for differential proportionality. Recall that the reason we do not use the VLR for proportionality analysis is that we cannot compare the VLR for one pair against the VLR for another pair (thus we defined $\phi$ and $\rho$ as a modification to the VLR that gives it scale). However, in differential proportionality analysis, we compare the VLR for one pair across groups. Specifically, we extract the fraction of variance (interpreted as the residual variance of proportional features) where proportionality holds in both or only one of two groups.
We consider here two forms of differential proportionality. The first, which we call disjointed proportionality, considers the case where the proportionality of a pair holds in both groups, but the ratio between the partners changes between the groups (i.e., the slope of the proportionality changes). The second, which we call emergent proportionality, considers the case where there is proportionality in only one of the groups (i.e., the strength of the proportionality changes).
Given two groups, sized $k$ and $n - k$, we define $\theta_d$, a measure of disjointed proportionality, as the pooled (weighted) VLR within the two groups divided by the total VLR:
$$\theta_d(\textrm{X}, \textrm{Y}) = \frac{(k-1)\textrm{VLR}_1 + (n-k-1)\textrm{VLR}_2}{(n-1)\textrm{VLR}}$$
Likewise, we define $\theta_e$, a measure of emergent proportionality, as the fraction of variance that remains when subtracting the fraction of the dominating group variance:
$$\theta_e(\textrm{X}, \textrm{Y}) = 1 - \frac{\mathrm{max}[(k-1)\textrm{VLR}_1,(n-k-1)\textrm{VLR}_2]}{(n-1)\textrm{VLR}}$$
The propd
function estimates differential proportionality by calculating $\theta$ for all feature pairs. This function takes the following arguments as input:
Below, we run propd
using the iris
dataset.
library(propr) data(iris) keep <- iris$Species %in% c("setosa", "versicolor") counts <- iris[keep, 1:4] * 10 group <- ifelse(iris[keep, "Species"] == "setosa", "A", "B") pd <- propd(counts, group, alpha = NA, p = 100)
The resultant propd
object contains both $\theta_d$ and $\theta_e$ metrics (among others), although only $\theta_d$ is active by default. While a $\theta$ is active, it forms the basis for permutation testing (i.e., FDR estimation) and visualization. Users can easily change which $\theta$ is active using the functions setDisjointed
and setEmergent
.
theta_d <- setDisjointed(pd) theta_e <- setEmergent(pd)
Once the $\theta$ of interest is active, the user can estimate FDR using the updateCutoffs
function.
theta_d <- updateCutoffs(theta_d, cutoff = seq(0.05, 0.95, 0.3)) theta_e <- updateCutoffs(theta_e, cutoff = seq(0.05, 0.95, 0.3))
In order to reduce RAM overhead, the propd
object never stores the intermediate $\theta$ values for permutation testing. However, when a propd
object is created, it contains all the randomized group assignments needed for permutation testing, meaning that each updateCutoffs
run effectively uses the same random seed. One could exploit the data contained in the @permutes
slot to reproduce the intermediate calculations if needed.
To understand differential proportionality, we use two propd
objects, called pd.d
(for $\theta_d$) and pd.e
(for $\theta_e$), built from the bundled caneToad.counts
RNA-seq data (Rollins 2015). Specifically, we use the results of the propd
function as applied to cane toad transcripts with at least 40 counts in all 20 samples (thus removing any transcripts with 0 counts), subsetted to include only pairs with the top 1000 smallest $\theta$.
Note that in this vignette, we never apply updateCutoffs
to either data object. When estimating FDR, it is necessary to use an unfiltered propd
object to keep estimations unbiased.
data(pd.d, package = "propr") # top 1000 disjointed pairs data(pd.e, package = "propr") # top 1000 emergent pairs
We begin now by looking at disjointed proportionality in more detail. Based on its definition, we see that low values of $\theta_d$ select pairs where the total VLR far exceeds the weighted sum of the within-group VLRs. Often, the within-group VLRs are about the same size. However, this is not a requirement so long as the within-group VLRs are both small compared to the total VLR.
Below, we use the getResults
function to tabulate important pairwise measurements. Then, we show a scatter plot of the abundance for features "39" and "37", as colored by the experimental group, with the slopes of the trend lines equal to the ratio means.
tab <- getResults(pd.d) plot(pd.d@counts[, 39], pd.d@counts[, 37], col = ifelse(pd.d@group == "WA", "red", "blue")) grp1 <- pd.d@group == "WA" grp2 <- pd.d@group != "WA" abline(a = 0, b = pd.d@counts[grp1, 37] / pd.d@counts[grp1, 39], col = "red") abline(a = 0, b = pd.d@counts[grp2, 37] / pd.d@counts[grp2, 39], col = "blue")
Here, we see that these two features change proportionally across the samples within each group (as expected based on their small values of $\textrm{VLR}_1$ and $\textrm{VLR}_2$). However, when ignoring the group labels, the relationship between these two features appears noisy. Although "37" (y-axis) has increased in expression relative to "39" (x-axis), "37" is no less coordinated with "39". This change in ratio abundance is apparent when viewed through a per-sample projection.
plot(pd.d@counts[, 37] / pd.d@counts[, 39], col = ifelse(pd.d@group == "WA", "red", "blue"))
This figure shows a clear difference in the ratio abundances between the groups. It also highlights the analogy between disjointed proportionality and differential expression, although the interpretation of differentially abundant ratios differs considerably. Possible biological explanations for this event might include a reduction in the amount of mRNA degradation, a change in isoform splice bias, or an increase in the activity of a transcription factor.
In contrast, emergent proportionality has more in common with a test for differences in correlation coefficients. That is, emergent proportionality occurs when a pair is proportional in one group but not the other, such that the group with no proportionality contributes most of the total variance.
Below, we use the getResults
function again to tabulate important pairwise measurements. Then, we show a scatter plot of the abundance for features "106" and "2", as colored by the experimental group, with the slopes of the trend lines equal to the ratio means.
tab <- getResults(pd.e) plot(pd.e@counts[, 106], pd.e@counts[, 2], col = ifelse(pd.d@group == "WA", "red", "blue")) grp1 <- pd.e@group == "WA" grp2 <- pd.e@group != "WA" abline(a = 0, b = pd.e@counts[grp1, 2] / pd.e@counts[grp1, 106], col = "red") abline(a = 0, b = pd.e@counts[grp2, 2] / pd.e@counts[grp2, 106], col = "blue")
Here, we see that these two features change proportionally across the samples within one group but not the other. In other words, the experimental condition appears to have removed the coordination between the transcripts. Moreover, when ignoring the group labels, we see that one group happens to dominate the total VLR. Interestingly, this has happened here without much change in the average abundance ratio, as apparent when viewed through a per-sample projection.
plot(pd.e@counts[, 2] / pd.e@counts[, 106], col = ifelse(pd.d@group == "WA", "red", "blue"))
This figure confirms that the feature pair varies far more in one group than the other, all while the mean ratio abundances do not change considerably. Note that an increase in $\theta_e$ tends to impart a decrease in $\theta_d$. Precisely, $\theta_e$ relates to $\theta_d$ via the function:
$$\vartheta_\mathrm{e}=1-\vartheta + \frac{\mathrm{min}[(k-1)\textrm{VLR}_1,(n-k-1)\textrm{VLR}_2]}{(n-1)\textrm{VLR}}$$
From this, we establish the inequality:
$$1-\vartheta\le\vartheta_\mathrm{e}\le1-\vartheta/2,$$
As such, one can use $1 - \theta_e$ for a stricter definition of disjointed proportionality. This is implemented in propd
as $\theta_f$, which one can set active using the setActive
function.
pd.f <- setActive(pd, what = "theta_f")
The parallel
function shows the sample-wise distribution of log-ratio abundances for each pair relative to a reference feature. The include
argument can specify any feature by name. Note that it is not possible to tell from this figure which features have changed in absolute abundance: it is possible for a reference to change relative to its neighbors, for all neighbors to change relative to its reference, or for a reference and its neighbors to have changed simultaneously.
parallel(pd.d, cutoff = .15, include = "c19327_g2_i3")
parallel(pd.e, include = "c27054_g5_i1")
Erb, Ionas, Thomas Quinn, David Lovell, and Cedric Notredame. “Differential Proportionality - A Normalization-Free Approach To Differential Gene Expression.” bioRxiv, May 5, 2017, 134536. http://dx.doi.org/10.1101/134536.
Rollins, Lee A., Mark F. Richardson, and Richard Shine. “A Genetic Perspective on Rapid Evolution in Cane Toads (Rhinella Marina).” Molecular Ecology 24, no. 9 (May 2015): 2264-76. http://dx.doi.org/10.1111/mec.13184.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.