In this vignette, we use a real dataset to show how to apply proportionality analysis to RNA-seq count data. We place a particular emphasis here on documenting $\rho$ and the visualization tools included with this package. Although the user may feel eager to start here, we strongly recommend first reading the companion vignette, "An Introduction to Proportionality".
As a use case, we analyze raw RNA-seq counts from a published study on cane toad evolution and adaptation (Rollins 2015). This dataset contains the transcript counts for 20 toads, sampled from two locations in the wild. Sugar cane farmers introduced cane toads to Australia in 1935 as a pest control measure, but these toads quickly became pests themselves. Starting in Queensland (QLD), they have since spread out into parts of Western Australia (WA). The two locations sampled, which we will treat as the experimental groups, include the early settlement region in QLD and the front of expansion into WA. In this analysis, we try to tease apart the genomic differences between the regional and invasive toads. To begin, we load the propr
library along with the example data.
library(propr) data(caneToad.counts) data(caneToad.groups)
library(propr) data(caneToad.groups) data(top.counts) data(top.lr) best <- new("propr") best@counts <- as.data.frame(top.counts) best@logratio <- as.data.frame(top.lr) best@matrix <- propr:::lr2rho(top.lr) best@metric <- "rho" best <- best[">", .995]
Next, we calculate proportionality between all transcripts in the dataset. When working with a large number of transcripts (57,580 in this case), we may first wish to filter non-informative transcripts to minimize the computational burden of analysis. However, while the numerator portion of $\rho$ remains fixed regardless of pre-filtering, the denominator portion of $\rho$ does change. In other words, although the numerator portion is sub-compositionally coherent, the denominator portion is not. Therefore, pre-filtering before calculating $\rho$ impacts the final result.
In some circumstances, it is simply infeasible to calculate a proportionality matrix using all features. For example, calculating pairwise proportionality here requires at least 32 GB of RAM, not counting the additional RAM required for indexing and visualization. However, beginning with version 2.0.3, we can now pre-filter while calculating $\rho$, which does not impact the final result. For this pre-filter step, we remove all transcripts that do not have at least 10 counts in at least 10 samples. We base this on the intuition that lowly-expressed transcripts likely have high relative error in cross-sample comparisons due to random noise from the assay technology and alignment protocol. Note that beginning with version 2.2.0, caneToad.counts
is bundled as already pre-filtered in order to satisfy strict file size limits imposed by CRAN.
keep <- apply(caneToad.counts, 2, function(x) sum(x >= 10) >= 10) rho <- propr(caneToad.counts, metric = "rho", select = keep)
Next, we index the most highly proportional pairs based on a cutoff (preferably chosen with updateCutoffs
to minimize false discovery). Here, we use an arbitrarily high threshold so that the vignette renders quickly.
best <- rho[">", .995]
After indexing, it is helpful to check the pairwise distribution of proportional pairs using the plot
(or, equivalently, smear
) function. As an "index-aware" function, plot
only plots feature pairs indexed with [
. When interpreting this figure, a smear of straight diagonal lines confirms that the pairs are proportional. This means that as one transcript increases in log-ratio transformed abundance, so does the other.
plot(best)
You can also inspect how the indexed pairs cluster using the dendrogram
function. Specifically, this function clusters features based on a hierarchical clustering of the proportionality matrix. In this package, we define the dissimilarity measure as as.dist(1-abs(rho@matrix))
. Like plot
, this function is "index-aware" and only plots feature pairs indexed with [
. Take note that this dendrogram matches the dendrogram used to label co-clusters in the prism
, bokeh
, and bucket
plots below, but differs from the dendrogram produced by the snapshot
plot. Heatmap intensity is not scaled.
dendrogram(best)
The remaining visualization methods do not restrict plotting to indexed pairs, but instead incorporate all features in the propr
object. To exclude features that do not comprise at least one indexed pair, we simplify
the indexed propr
object.
best <- simplify(best)
Compositional data are not distributed in Euclidean space, but rather exist in a space known as the Aitchison simplex (Fernandes 2014). The centered log-ratio transformation moves these data into Euclidean space, allowing us to summarize the data with conventional statistics.
First, we use pca
to visualize the first two principal components. The group
argument allows us to color the sample IDs by group membership.
pca(best, group = caneToad.groups)
Second, we look at snapshot
, a function for visualizing the intensity of the log-ratio transformed data across samples. Heatmap intensity is not scaled.
snapshot(best)
Finally, we look at the prism
and bokeh
functions for visualizing the co-clustering of proportional features. We mention these plots together because they share some key similarities. First, they are all "index-naive". Second, they identify the feature pairs where both constituents co-cluster (with the total number of clusters toggled by k
). Third, they return a vector of cluster memberships for all features in the propr
object.
The prism
function plots the variance of the ratio of the log-ratio transformed feature pair (VLR) versus the sum of the individual variances of each log-ratio transformed feature (VLS). The ratio of the VLR to the VLS equals $1 - \rho$. As such, we use here seven rainbow colored lines to indicate where $\rho = [.01, .05, .50, 0, 1.50, 1.95, 1.99]$, going from red to violet. A low VLR with a high VLS suggests that the feature pair remains in an equilibrium despite high variability among the individual features.
clusts <- prism(best, k = 5)
The bokeh
function plots pairs across the individual variances of the constituent log-ratio transformed features. For clarity of visualization, this figure projects the data on a log-fold scale. Therefore, the highly variable co-clusters appear in the top-right of the figure while the lowly variable co-clusters appear in the bottom-left. Meanwhile, highly proportional pairs tend to aggregate around the $y=x$ diagonal.
clusts <- bokeh(best, k = 5)
These plots can help us conceptualize high-dimensional data and select a highly proportional module for further analysis. In this example, we choose co-cluster 2 for further analysis because it shows high proportionality in the setting of high individual feature variance.
We can extract co-cluster 2 from the propr
object using the subset
method.
sub <- subset(best, select = (clusts == 2))
We can use the pca
function to see how well this cluster differentiates the two experimental groups. We see here that the first two principal components of this highly proportional module results in good separation between the experimental groups. This matches the separation achieved in the source publication which used edgeR
for feature selection (Rollins 2015), suggesting that our unsupervised method identified a feature subset that is relevant to the experimental condition.
pca(sub, group = caneToad.groups)
Having identified a cluster that separates the experimental groups, the next step in this pipeline might involve a gene set enrichment analysis (GSEA) of the transcripts participating in this highly proportional module. The application of GSEA is beyond the scope of this vignette, although we can easily extract the names of the transcripts that belong to this cluster.
transcripts <- colnames(sub@logratio)
Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.
Fernandes, Andrew D., Jennifer Ns Reid, Jean M. Macklaim, Thomas A. McMurrough, David R. Edgell, and Gregory B. Gloor. “Unifying the Analysis of High-Throughput Sequencing Datasets: Characterizing RNA-Seq, 16S rRNA Gene Sequencing and Selective Growth Experiments by Compositional Data Analysis.” Microbiome 2 (2014): 15. http://dx.doi.org/10.1186/2049-2618-2-15.
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.
Rollins, Lee A., Mark F. Richardson, and Richard Shine. “A Genetic Perspective on Rapid Evolution in Cane Toads (Rhinella Marina).” Molecular Ecology 24, no. 9 (May 2015): 2264-76. http://dx.doi.org/10.1111/mec.13184.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.