# To examine objects: # devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/V4_airway/html") knitr::opts_chunk$set(cache=TRUE, autodep=TRUE)
Let's look at the airway dataset as an example of a typical small-scale RNA-Seq experiment. In this experiment, four Airway Smooth Muscle (ASM) cell lines are treated with the asthma medication dexamethasone.
weitrix_calibrate_all will be used to assign precision weights to log Counts Per Million values.
library(weitrix) library(ComplexHeatmap) library(EnsDb.Hsapiens.v86) library(edgeR) library(limma) library(reshape2) library(tidyverse) library(airway) set.seed(1234) # BiocParallel supports multiple backends. # If the default hangs or errors, try others. # The most reliable backed is to use serial processing BiocParallel::register( BiocParallel::SerialParam() )
Initial steps are the same as for a differential expression analysis.
counts <- assay(airway,"counts") design <- model.matrix(~ dex + cell, data=colData(airway)) good <- filterByExpr(counts, design=design) table(good) airway_dgelist <- DGEList(counts[good,]) %>% calcNormFactors() airway_lcpm <- cpm(airway_dgelist, log=TRUE, prior.count=1)
log2 CPM values have been calculated, with an added "prior" count of (on average) 1 read, so that zeros aren't sent to negative infinity.
airway_weitrix <- as_weitrix(airway_lcpm) # Include row and column information colData(airway_weitrix) <- colData(airway) rowData(airway_weitrix) <- mcols(genes(EnsDb.Hsapiens.v86))[rownames(airway_weitrix),c("gene_name","gene_biotype")] airway_weitrix
To calibrate, we need predicted expression levels so we can calculate residuals. The function
weitrix_components can provide linear model fits to each gene. (We will see a more advanced use of this function later.)
fit <- weitrix_components(airway_weitrix, design=design)
Currently the weitrix has weights uniformly equal to 1. Examining residuals, we see a variance trend relative to the linear model prediction.
Each dot in the plot below is a residual weighted by sqrt(weight). The x-axis is the linear model prediction. The y-axis is the weighted residual (all weights are currently 1). The red lines show the mean and the mean +/- one standard deviation.
weitrix_calplot(airway_weitrix, fit, covar=mu, guides=FALSE)
We will use the function
weitrix_calibrate_all to set the weights by fitting a gamma GLM with log link function to the weighted squared residuals. 1 over the predictions from this GLM are used as weights. Here we fit a natural spline based on the linear model predictions from
fit, which are referred to in the formula as
well_knotted_spline is a wrapper around
splines::ns for natural splines with improved choice of knot locations.
airway_cal <- weitrix_calibrate_all( airway_weitrix, design = fit, trend_formula = ~well_knotted_spline(mu,5)) metadata(airway_cal) weitrix_calplot(airway_cal, fit, covar=mu)
The trend lines (red) for the calibrated weitrix are now uniformly close to 1, 0, -1 (guide lines shown in blue). Weights can now be treated as inverse residual variance.
Another way to check this is to plot the unweighted residuals against 1/sqrt(weight), i.e. the residual standard deviation.
weitrix_calplot(airway_cal, fit, funnel=TRUE)
We can also check each sample individually.
weitrix_calplot(airway_cal, fit, covar=mu, cat=col)
If there was an odd sample, we could use a more complex trend formula such as
airway_cal <- weitrix_calibrate_all( airway_weitrix, design = fit, trend_formula = ~col*well_knotted_spline(mu,4))
While the precise details differ, what we have done is very similar to the
voom function in limma.
weitrix_calplot can be used with voomed data as well.
airway_voomed <- voom(airway_dgelist, design, plot=TRUE) weitrix_calplot(airway_voomed, design, covar=mu)
voom achieves a similar result to
weitrix_calibrate_all (but note the input to
voom is a DGEList of counts, not a matrix that is already log transformed). limma EList objects can be converted to weitrix objects with
as_weitrix. Weitrix objects can be converted to limma EList objects with
RNA-Seq expression is well trodden ground. The main contribution of the weitrix package here is to aid exploration by discovering components of variation, providing not just column scores but row loadings and respecting precision weights.
weitrix_sd_confects will find genes with excess variation in a calibrated weitrix.
confects <- weitrix_sd_confects(airway_cal) confects
"effect" is root-mean-square variation in residuals relative to a fitted model in excess of what is expected from the calibrated weights. Here the model only has an intercept term, so the residuals represent variation from the weighted mean. "confect" scores are lower confidence bounds on the effect, adjusted for multiple testing using the
In single-cell experiments, this should be a good way to find marker genes.
Top genes can then be examined to find a reason for their variation. For example, we see that XIST is highly expressed in a particular cell type.
interesting <- confects$table$index[1:20] centered <- weitrix_x(airway_cal) - rowMeans(weitrix_x(airway_cal)) rownames(centered) <- rowData(airway_cal)$gene_name Heatmap( centered[interesting,], name="log2 RPM\nvs row mean", cluster_columns=FALSE)
The code below will find various numbers of components, from 1 to 6. In each case, the components discovered have varimax rotation applied to their gene loadings to aid interpretability. The result is a list of Components objects.
comp_seq <- weitrix_components_seq(airway_cal, p=6, n_restarts=1) comp_seq
We can compare the proportion of variation explained to what would be explained in a completely random weitrix. Random normally distributed values are generated with variances equal to one over the weights.
rand_weitrix <- weitrix_randomize(airway_cal) rand_comp <- weitrix_components(rand_weitrix, p=1, n_restarts=1) components_seq_screeplot(comp_seq, rand_comp)
Up to 4 components may be justified.
comp <- comp_seq[] comp$col[,-1] %>% melt(varnames=c("Run","component")) %>% left_join(as.data.frame(colData(airway)), by="Run") %>% ggplot(aes(y=cell, x=value, color=dex)) + geom_vline(xintercept=0) + geom_point(alpha=0.5, size=3) + facet_grid(~ component) + labs(title="Sample scores for each component", x="Sample score", y="Cell line", color="Treatment") comp$row[,-1] %>% melt(varnames=c("name","component")) %>% ggplot(aes(x=comp$row[name,"(Intercept)"], y=value)) + geom_point(cex=0.5, alpha=0.5) + facet_wrap(~ component) + labs(title="Gene loadings for each component vs average log2 RPM", x="average log2 RPM", y="gene loading")
If varimax rotation isn't used,
weitrix_components_seq will produce a Principal Components Analysis, with components ordered from most to least variance explained.
Without varimax rotation the treatment effect is still mostly in the first component, but has also leaked a small amount into the other components.
comp_nonvarimax <- weitrix_components(airway_cal, p=4, use_varimax=FALSE) comp_nonvarimax$col[,-1] %>% melt(varnames=c("Run","component")) %>% left_join(as.data.frame(colData(airway)), by="Run") %>% ggplot(aes(y=cell, x=value, color=dex)) + geom_vline(xintercept=0) + geom_point(alpha=0.5, size=3) + facet_grid(~ component) + labs(title="Sample scores for each component, no varimax rotation", x="Sample score", y="Cell line", color="Treatment")
colcan potentially be used as a design matrix
If you're not sure of the experimental design, for example the exact timing of a time series or how evenly a drug treatment was applied, the extracted component might actually be more accurate.
Note that this ignores uncertainty about the
col matrix itself.
This may be useful for hypothesis generation -- finding some potentially interesting genes, while discounting noisy or lowly expressed genes -- but don't use it as proof of significance.
First by the topconfects method. This will find the largest confident effect sizes, while still correcting for multiple testing.
weitrix_confects(airway_cal, comp$col, "C1")
If you prefer limma and p-values:
airway_elist <- weitrix_elist(airway_cal) fit <- lmFit(airway_elist, comp$col) %>% eBayes() fit$df.prior fit$s2.prior topTable(fit, "C1") all_top <- topTable(fit, "C1", n=Inf, sort.by="none") plotMD(fit, "C1", status=all_top$adj.P.Val <= 0.01)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.