This document shows typical Topconfects usage with limma, edgeR, or DESeq2.
The first step is to load a dataset. Here, we're looking at RNA-seq data that
investigates the response of Arabodopsis thaliana to a bacterial pathogen.
Besides the experimental and control conditions, there is also a batch effect.
This dataset is also examined in section 4.2 of the edgeR
user manual, and
I've followed the initial filtering steps in the edgeR
manual.
knitr::opts_chunk$set(fig.width=7, fig.height=6)
library(topconfects) library(NBPSeq) library(edgeR) library(limma) library(dplyr) library(ggplot2) data(arab) # Retrieve symbol for each gene info <- AnnotationDbi::select( org.At.tair.db::org.At.tair.db, rownames(arab), "SYMBOL") %>% group_by(TAIR) %>% summarize( SYMBOL=paste(unique(na.omit(SYMBOL)),collapse="/")) arab_info <- info[match(rownames(arab),info$TAIR),] %>% select(-TAIR) %>% as.data.frame rownames(arab_info) <- rownames(arab) # Extract experimental design from sample names Treat <- factor(substring(colnames(arab),1,4)) %>% relevel(ref="mock") Time <- factor(substring(colnames(arab),5,5)) y <- DGEList(arab, genes=as.data.frame(arab_info)) # Keep genes with at least 3 samples having an RPM of more than 2 keep <- rowSums(cpm(y)>2) >= 3 y <- y[keep,,keep.lib.sizes=FALSE] y <- calcNormFactors(y)
design <- model.matrix(~Time+Treat) design[,] fit <- voom(y, design) %>% lmFit(design)
Find largest fold changes that we can be confident of at FDR 0.05.
confects <- limma_confects(fit, coef="Treathrcc", fdr=0.05) confects
Here the usual logFC
values estimated by limma
are shown as dots, with lines
to the confect
value.
confects_plot(confects)
confects_plot_me
overlays the confects (red/blue) on a Mean-Difference Plot
(grey) (as might be produced by limma::plotMD
). As we should expect, the very
noisy differences with low mean expression are removed if we look at the
confects.
confects_plot_me(confects)
Let's compare this to the ranking we obtain from topTable
.
fit_eb <- eBayes(fit) top <- topTable(fit_eb, coef="Treathrcc", n=Inf) rank_rank_plot(confects$table$name, rownames(top), "limma_confects", "topTable")
You can see that the top 19 genes from topTable are all within the top 40 for topconfects ranking, but topconfects has also highly ranked some other genes. These have a large effect size, and sufficient if not overwhelming evidence of this.
An MD-plot highlighting the positions of the top 40 genes in both rankings also illustrates the differences between these two ways of ranking genes.
plotMD(fit, legend="bottomleft", status=paste0( ifelse(rownames(fit) %in% rownames(top)[1:40], "topTable ",""), ifelse(rownames(fit) %in% confects$table$name[1:40], "confects ","")))
An analysis in edgeR produces similar results. Note that only quasi-likelihood testing from edgeR is supported.
y <- estimateDisp(y, design, robust=TRUE) efit <- glmQLFit(y, design, robust=TRUE)
A step of 0.05 is used here merely so that the vignette will build quickly.
edger_confects
calls edgeR::glmTreat
repeatedly, which is necessarily slow.
In practice a smaller value such as 0.01 should be used.
econfects <- edger_confects(efit, coef="Treathrcc", fdr=0.05, step=0.05) econfects
confects_plot(econfects) confects_plot_me(econfects)
etop <- glmQLFTest(efit, coef="Treathrcc") %>% topTags(n=Inf) plotMD(efit, legend="bottomleft", status=paste0( ifelse(rownames(efit) %in% econfects$table$name[1:40], "confects ", ""), ifelse(rownames(efit) %in% rownames(etop)[1:40], "topTags ","")))
DESeq2 does its own filtering of lowly expressed genes, so we start from the original count matrix. The initial steps are as for a normal DESeq2 analysis.
library(DESeq2) dds <- DESeqDataSetFromMatrix( countData = arab, colData = data.frame(Time, Treat), rowData = arab_info, design = ~Time+Treat) dds <- DESeq(dds)
The contrast or coefficient to test is specified as in the DESeq2::results
function. The step of 0.05 is merely so that this vignette will build quickly,
in practice a smaller value such as 0.01 should be used. deseq2_confects
calls results
repeatedly, and in fairness results
has not been optimized for this.
dconfects <- deseq2_confects(dds, name="Treat_hrcc_vs_mock", step=0.05)
DESeq2 offers shrunken estimates of LFC. This is another sensible way of ranking genes. Let's compare them to the confect values.
shrunk <- lfcShrink(dds, coef="Treat_hrcc_vs_mock", type="ashr") dconfects$table$shrunk <- shrunk$log2FoldChange[dconfects$table$index] dconfects
DESeq2 filters some genes, these are placed last in the table. If your intention
is to obtain a ranking of all genes, you should disable this with
deseq2_confects(..., cooksCutoff=Inf, independentFiltering=FALSE)
.
table(dconfects$table$filtered) tail(dconfects$table)
Shrunk LFC estimates are shown in red.
confects_plot(dconfects) + geom_point(aes(x=shrunk, size=baseMean, color="lfcShrink"), alpha=0.75)
lfcShrink
aims for a best estimate of the LFC, whereas confect is a
conservative estimate. lfcShrink
can produce non-zero values for genes which
can't be said to significantly differ from zero -- it doesn't do double duty as
an indication of significance -- whereas the confect value will be NA
in this
case. The plot below compares these two quantities. Only un-filtered genes are
shown (see above).
filter(dconfects$table, !filtered) %>% ggplot(aes( x=ifelse(is.na(confect),0,confect), y=shrunk, color=!is.na(confect))) + geom_point() + geom_abline() + coord_fixed() + theme_bw() + labs(color="Significantly\nnon-zero at\nFDR 0.05", x="confect", y="lfcShrink using ashr")
rank_rank_plot(confects$table$name, econfects$table$name, "limma confects", "edgeR confects") rank_rank_plot(confects$table$name, dconfects$table$name, "limma confects", "DESeq2 confects") rank_rank_plot(econfects$table$name, dconfects$table$name, "edgeR confects", "DESeq2 confects")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.