4. RNA-Seq expression example

# To examine objects:
# devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/V4_airway/html")

knitr::opts_chunk$set(cache=TRUE, autodep=TRUE)

Let's look at the airway dataset as an example of a typical small-scale RNA-Seq experiment. In this experiment, four Airway Smooth Muscle (ASM) cell lines are treated with the asthma medication dexamethasone.

The function weitrix_calibrate_all will be used to assign precision weights to log Counts Per Million values.

library(weitrix)
library(ComplexHeatmap)
library(EnsDb.Hsapiens.v86)
library(edgeR)
library(limma)
library(reshape2)
library(tidyverse)
library(airway)

set.seed(1234)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
# The most reliable backed is to use serial processing
BiocParallel::register( BiocParallel::SerialParam() )
data("airway")
airway

Initial processing

Initial steps are the same as for a differential expression analysis.

counts <- assay(airway,"counts")

design <- model.matrix(~ dex + cell, data=colData(airway))

good <- filterByExpr(counts, design=design) 
table(good)

airway_dgelist <- DGEList(counts[good,]) %>% calcNormFactors()

airway_lcpm <- cpm(airway_dgelist, log=TRUE, prior.count=1)

log2 CPM values have been calculated, with an added "prior" count of (on average) 1 read, so that zeros aren't sent to negative infinity.

Conversion to weitrix

airway_weitrix <- as_weitrix(airway_lcpm)

# Include row and column information
colData(airway_weitrix) <- colData(airway)
rowData(airway_weitrix) <- 
    mcols(genes(EnsDb.Hsapiens.v86))[rownames(airway_weitrix),c("gene_name","gene_biotype")]

airway_weitrix

Calibration

To calibrate, we need predicted expression levels so we can calculate residuals. The function weitrix_components can provide linear model fits to each gene. (We will see a more advanced use of this function later.)

fit <- weitrix_components(airway_weitrix, design=design)

Currently the weitrix has weights uniformly equal to 1. Examining residuals, we see a variance trend relative to the linear model prediction.

Each dot in the plot below is a residual weighted by sqrt(weight). The x-axis is the linear model prediction. The y-axis is the weighted residual (all weights are currently 1). The red lines show the mean and the mean +/- one standard deviation.

weitrix_calplot(airway_weitrix, fit, covar=mu, guides=FALSE)

We will use the function weitrix_calibrate_all to set the weights by fitting a gamma GLM with log link function to the weighted squared residuals. 1 over the predictions from this GLM are used as weights. Here we fit a natural spline based on the linear model predictions from fit, which are referred to in the formula as mu. well_knotted_spline is a wrapper around splines::ns for natural splines with improved choice of knot locations.

airway_cal <- weitrix_calibrate_all(
    airway_weitrix, 
    design = fit, 
    trend_formula = ~well_knotted_spline(mu,5))

metadata(airway_cal)

weitrix_calplot(airway_cal, fit, covar=mu)

The trend lines (red) for the calibrated weitrix are now uniformly close to 1, 0, -1 (guide lines shown in blue). Weights can now be treated as inverse residual variance.

Another way to check this is to plot the unweighted residuals against 1/sqrt(weight), i.e. the residual standard deviation.

weitrix_calplot(airway_cal, fit, funnel=TRUE)

We can also check each sample individually.

weitrix_calplot(airway_cal, fit, covar=mu, cat=col)

Advanced calibration

If there was an odd sample, we could use a more complex trend formula such as ~col*well_knotted_spline(mu,4):

airway_cal <- weitrix_calibrate_all(
    airway_weitrix, 
    design = fit, 
    trend_formula = ~col*well_knotted_spline(mu,4))

Similar to limma voom

While the precise details differ, what we have done is very similar to the voom function in limma.

weitrix_calplot can be used with voomed data as well.

airway_voomed <- voom(airway_dgelist, design, plot=TRUE)

weitrix_calplot(airway_voomed, design, covar=mu)

voom achieves a similar result to weitrix_calibrate_all (but note the input to voom is a DGEList of counts, not a matrix that is already log transformed). limma EList objects can be converted to weitrix objects with as_weitrix. Weitrix objects can be converted to limma EList objects with weitrix_elist.

Exploration

RNA-Seq expression is well trodden ground. The main contribution of the weitrix package here is to aid exploration by discovering components of variation, providing not just column scores but row loadings and respecting precision weights.

Find genes with excess variation

weitrix_sd_confects will find genes with excess variation in a calibrated weitrix.

confects <- weitrix_sd_confects(airway_cal)
confects

"effect" is root-mean-square variation in residuals relative to a fitted model in excess of what is expected from the calibrated weights. Here the model only has an intercept term, so the residuals represent variation from the weighted mean. "confect" scores are lower confidence bounds on the effect, adjusted for multiple testing using the topconfects method.

In single-cell experiments, this should be a good way to find marker genes.

Top genes can then be examined to find a reason for their variation. For example, we see that XIST is highly expressed in a particular cell type.

interesting <- confects$table$index[1:20]

centered <- weitrix_x(airway_cal) - rowMeans(weitrix_x(airway_cal))
rownames(centered) <- rowData(airway_cal)$gene_name
Heatmap(
    centered[interesting,], 
    name="log2 RPM\nvs row mean", 
    cluster_columns=FALSE)

Find components of variation

The code below will find various numbers of components, from 1 to 6. In each case, the components discovered have varimax rotation applied to their gene loadings to aid interpretability. The result is a list of Components objects.

comp_seq <- weitrix_components_seq(airway_cal, p=6, n_restarts=1)
comp_seq

We can compare the proportion of variation explained to what would be explained in a completely random weitrix. Random normally distributed values are generated with variances equal to one over the weights.

rand_weitrix <- weitrix_randomize(airway_cal)
rand_comp <- weitrix_components(rand_weitrix, p=1, n_restarts=1)

components_seq_screeplot(comp_seq, rand_comp)

Examining components

Up to 4 components may be justified.

comp <- comp_seq[[4]]

comp$col[,-1] %>% melt(varnames=c("Run","component")) %>%
    left_join(as.data.frame(colData(airway)), by="Run") %>%
    ggplot(aes(y=cell, x=value, color=dex)) + 
    geom_vline(xintercept=0) + 
    geom_point(alpha=0.5, size=3) + 
    facet_grid(~ component) +
    labs(title="Sample scores for each component", x="Sample score", y="Cell line", color="Treatment")

comp$row[,-1] %>% melt(varnames=c("name","component")) %>%
    ggplot(aes(x=comp$row[name,"(Intercept)"], y=value)) + 
    geom_point(cex=0.5, alpha=0.5) + 
    facet_wrap(~ component) +
    labs(title="Gene loadings for each component vs average log2 RPM", x="average log2 RPM", y="gene loading")

Without varimax rotation, components may be harder to interpret

If varimax rotation isn't used, weitrix_components and weitrix_components_seq will produce a Principal Components Analysis, with components ordered from most to least variance explained.

Without varimax rotation the treatment effect is still mostly in the first component, but has also leaked a small amount into the other components.

comp_nonvarimax <- weitrix_components(airway_cal, p=4, use_varimax=FALSE)

comp_nonvarimax$col[,-1] %>% melt(varnames=c("Run","component")) %>%
    left_join(as.data.frame(colData(airway)), by="Run") %>%
    ggplot(aes(y=cell, x=value, color=dex)) + 
    geom_vline(xintercept=0) + 
    geom_point(alpha=0.5, size=3) + 
    facet_grid(~ component) +
    labs(title="Sample scores for each component, no varimax rotation", x="Sample score", y="Cell line", color="Treatment")

col can potentially be used as a design matrix

If you're not sure of the experimental design, for example the exact timing of a time series or how evenly a drug treatment was applied, the extracted component might actually be more accurate.

Note that this ignores uncertainty about the col matrix itself.

This may be useful for hypothesis generation -- finding some potentially interesting genes, while discounting noisy or lowly expressed genes -- but don't use it as proof of significance.

First by the topconfects method. This will find the largest confident effect sizes, while still correcting for multiple testing.

weitrix_confects(airway_cal, comp$col, "C1")

If you prefer limma and p-values:

airway_elist <- weitrix_elist(airway_cal)

fit <- 
    lmFit(airway_elist, comp$col) %>% 
    eBayes()

fit$df.prior
fit$s2.prior

topTable(fit, "C1")

all_top <- topTable(fit, "C1", n=Inf, sort.by="none")
plotMD(fit, "C1", status=all_top$adj.P.Val <= 0.01)


Try the weitrix package in your browser

Any scripts or data that you put into this service are public.

weitrix documentation built on Nov. 8, 2020, 8:10 p.m.