title: "Introduction to Compositional Normalization with sleuth-ALR"
author: "Warren A. McGee"
date: "r format(Sys.Date(), '%m/%d/%Y')
"
abstract: >
Most approaches to detecting differentially expressed features from bulk RNA-Seq
data treat the data as count data, and further these approaches make the
assumption when normalizing the data that the total RNA per cell is the same
across every sample. In reality, because of the protocol for generating RNA-Seq
data, the number of fragments sequenced is not proportional to the total RNA per
present. Thus, the data is inherently compositional, meaning that only units
expressing percentages/proportions are meaningful and that values only carry
relative information, even though absolute RNA copy numbers are often of
interest. Sleuth-ALR is an extension of the popular sleuth package that
explicitly models the data as compositional. This vignette explains the
use of this package and demonstrates a tpyical workflow for and interpretation
of an RNA-Seq analysis, starting from results after pseudoalignment (HDF5 files
produced by kallisto or Salmon).
sleuthALR package version: r packageVersion("sleuthALR")
output:
rmarkdown::html_vignette:
toc: true
fig_width: 5
bibliography: library.bib
vignette: >
%\VignetteIndexEntry{Introduction to Compositional Normalization with sleuth-ALR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
\usepackage[utf8]{inputenc}
knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, dev = "png", message = FALSE, error = FALSE, collapse = T, comment = "#>") library(sleuthALR)
When conducting an RNA-Seq differential analysis, most pipelines do the following with the raw FASTQ data: 1. Estimate the expression of features (transcripts, genes, etc.) 2. Normalize the data to allow comparison of samples 3. Statistical modeling to detect differences between conditions of interest
Most approaches treat RNA-Seq data as count data, i.e. the number of fragments that map to a particular feature, (see [@Love2014] and [@Anders2013] for examples advocating this approach). Most normalization approaches, to quote [@Robinson2010] directly, normalize "the overall expression levels of genes[features] between samples under the assumption that the majority of them are not DE [differentially expressed]".
In reality, as described in [@Lovell2011], [@Lovell2015], and [@Quinn2018], RNA-Seq data is actually compositional data, meaning that the only meaningful units are proportions or percentages (i.e. what percent of the total fragments map to a feature?). This is due to the constraints introduced by the library preparation protocol only preparing a sample of the total RNA, usually selecting for a subpopulation of RNAs, and sequencing to an arbitrary sequencing depth. Thus, only relative information (how is one feature behaving relative to another feature) is available.
Here, "compositional normalization" is defined as any approach that uses principles from Compositional Data Analysis, and features pre-determined to be reliable negative control features, to normalize RNA-Seq data for inter-sample comparison and statistical modeling. The most common examples of negative control features are external spike-in RNAs (e.g. ERCC spike-in mix), and validated reference genes. This allows for conclusions about changes in copy numbers per cell from the inherently compositional RNA-Seq.
The basic idea is analogous to how reference genes are used in qPCR: after choosing one or more "reference features", all other features in a sample are normalized to the reference feature(s) by taking the ratio of each feature's expression to the expression of the reference feature(s), and then taking the logarithm.
For more information about compositional normalization, please refer to the forthcoming preprint.
Sleuth-ALR has implemented compositional normalization for the popular sleuth package. In this vignette, we will introduce the full pipeline of a sleuth analysis using sleuth-ALR.
First we make sure the yeastStarvationData
package is installed, and if not, install it from on Github:
bool <- suppressMessages(suppressWarnings(require("yeastStarvationData"))) if (!bool) { devtools::install_github('warrenmcg/yeastStarvationData') }
This package contains processed data from [@Marguerat2012], which conducted an experiment in fission yeast (S. pombe) to see the effects of nitrogen starvation on gene expression. They grew yeast cells in either Edinburgh minimal media (EMM), or EMM without a nitrogen source, for 24 hours. They prepared poly-A-enriched RNA-Seq libraries from two independent cultures, and from the same samples also used an aliquot to estimate the copy numbers per cell of all yeast genes using a NanoString assay. See [@Marguerat2012] for more details, and [@Lovell2015] for an informative re-analysis of the data, illustrating the dangers of analyzing this data without the external information provided by the Nanostring assay.
Once the package is installed, we load the relevant data that we will need for this analysis:
data(denom, package = "yeastStarvationData") data(yeastAnnos, package = "yeastStarvationData") data(yeastS2C, package = "yeastStarvationData") sys_path <- system.file("extdata", package = "yeastStarvationData", mustWork = TRUE) yeastS2C$path <- file.path(sys_path, yeastS2C$accession)
denom
is a character vector containing the transcript that had the most consistent estimate of copy numbers per cell
across all yeast genes estimated using Nanostring; this will be used for our compositional normalization example.
yeastAnnos
is a data frame with fission yeast annotations taken from Ensembl Fungi Genomes release 37.
yeastS2C
is a "samples-to-covariates" data frame with sample metadata taken from ArrayExpress.
The path
column contains the raw data from [@Marguerat2012] processed using kallisto 0.44.0 and
annotations taken from Ensembl.
Only two functions are necessary to perform a full sleuth-ALR analysis: make_lr_sleuth_object
and sleuth_alr_results
.
The make_lr_sleuth_object
is a wrapper function that performs the steps of the sleuth pipeline: sleuth_prep
, and
optionally sleuth_fit
, sleuth_wt
, and sleuth_lrt
. Before doing so, though, it generates the necessary transformation
and normalization functions that perform compositional normalization.
The sleuth_alr_results
function is a wrapper for sleuth_results
with one main additional purpose.
When doing p-value aggregation (as described in [@Yi2017]), p-values are weighted by the average normalized expression value of
each transcript. The method requires the weights to be zero or greater. However, after compositional normalization, normalized
expression values can be negative, which would lead to invalid results during p-value aggregation. The sleuth_alr_results
method changes the default weighting function so that normalized values are transformed from logratios to ratios, which are
guaranteed to be greater than or equal to zero.
To run a basic sleuth-ALR analysis, run the following code:
alr_so <- make_lr_sleuth_object(sample_to_covariates = yeastS2C, target_mapping = yeastAnnos, beta = 'conditionnoN', full_model = ~condition, null_model = ~1, denom_name = denom, run_models = TRUE ) wt_res <- sleuth_alr_results(obj = alr_so, test = "conditionnoN") lrt_res <- sleuth_alr_results(obj = alr_so, test = "reduced:full", test_type = "lrt")
Some of the options used in sleuth_prep
are present here, and have the same purpose: sample_to_covariates
,
target_mapping
, full_model
, and aggregation_column
. See ?sleuth_prep
for more information on these arguments.
make_lr_sleuth_object
can take any arguments used by sleuth_prep
or sleuth_fit
(except the arguments specifying
the normalization and transformation functions, as those are produced internally). sample_to_covariates
is the only
required argument.
Two other options provide additional information to perform a full
sleuth-ALR analysis: null_model
to provide a null model to compare for a likelihood ratio test using sleuth_lrt
(in this case, a model using condition as a covariate versus an intercept-only model), and beta
to provide
a beta for the Wald test (in this case, noN vs control).
The unique argument for sleuth-ALR is denom_name
: this argument specifies the reference feature(s) to be used for
normalization.
For sleuth_alr_results
, the interface is the same as sleuth_results
, and all of the work is done under the hood.
If you have a validated negative control feature (e.g. external spike-ins, a validated reference gene, etc), then using sleuth-ALR with these features will result in much improved estimate of the changes occurring in the RNA copy numbers, since the data is now anchored to features with approximately constant absolute expression.
However, if a dataset does not have such a feature available (the vast majority of already published bulk RNA-Seq datasets are in this category), the data will never be able to directly reveal how RNA copy numbers are changing. It will only be able to reveal how features are changing relative to other features. There is some innovative work look at differential proportionality [@Erb2018], which would not need normalization. However, for those that wish to still use the traditional pipeline and normalization, sleuth-ALR offers a solution with the trade-off of interpretability: one can use a feature with the most consistent abundance (estimated counts, TPMs, etc), or a set of features that with approximately consistence abundances, then these could be used for normalization. In this situation, the new interpretation is "How are features changing relative to the global trend?" or "How are features changing relative to the total change in this RNA population (e.g. poly-A+ RNA)?"
If there is a global trend in total RNA per cell, the data cannot in principle reveal that information. However, it can tell you how features are changing relative to that trend. There will be features that are increasing or decreasing in an extreme that will be detected as such, but there will be a large number of features that have more modest changes or no change at all that will be detected as changing or changing in the opposite direction. These can still yield interesting biology, as the competitive approach taken by GSEA [@Maciejewski2014] can attest, but one would have to let go of putting stock into the magnitude or even the direction of change.
See the preprint for more information.
Coming Soon!
Coming Soon!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.