DOCNAME = knitr::current_input() knitr::opts_chunk$set(autodep = TRUE, cache = FALSE, cache.path = paste0("cache/", DOCNAME, "/"), cache.comments = TRUE, echo = TRUE, error = FALSE, fig.align = "center", fig.path = paste0("../reports/figures/", DOCNAME, "/"), fig.width = 10, fig.height = 8, message = FALSE, warning = FALSE)
This is a bulk RNAseq analysis template using the DESEQ2 package
library(DESeq2) library("RColorBrewer") library(pheatmap) library(ggplot2) library(pasilla) # this package contains RNAseq dataset
In order to procede with the analysis we need two dataframes: - Count matrix dataframe with gene names as rows (and rownames) and samples as columns (and column names). - Sample metadata dataframe with samples as rows (and rownames) and metadata variables as columns (and column names).
Ideally, the sample metadata will contain information about the conditions, treatments, replicates, etc.
The pasilla dataset contains a count matrix and the metadata, which is a good example to use as a template
pasCts <- system.file("extdata", "pasilla_gene_counts.tsv", package="pasilla", mustWork=TRUE) pasAnno <- system.file("extdata", "pasilla_sample_annotation.csv", package="pasilla", mustWork=TRUE)
The count matrix looks like this:
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id")) head(cts)
While the sample metadata looks like this:
coldata <- read.csv(pasAnno, row.names=1) coldata <- coldata[,c("condition","type")] head(coldata)
Making the metadata factors will establish the order of the metadata variables. If you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. This is important since the first level of the metadata will be the reference sample/condition for the DE analysis.
coldata$condition <- factor(coldata$condition) paste("The order of the conditions is:", paste(unique(coldata$condition), collapse=", ")) coldata$type <- factor(coldata$type) paste("The order of the types is:", paste(unique(coldata$type), collapse=", "))
If you want to change the order of the factor levels you can use one of these two. NOTE: this has to be done before running the DESeq() function.
#coldata$condition <- factor(coldata$condition, levels = c("untreated","treated")) coldata$condition <- relevel(coldata$condition, ref = "untreated") #Specifies "untreated" as the reference level
We create the DESeq object from the count matrix and the metadata. We specify that the analysis will be based on the design column.
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ condition) dds
We can reduce the computational resorces of the analysis by removing genes that are very hardly expressed.
keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,]
You can still change the order of the factors of the object "dds" before running the DESeq() function.
Also, you can collapse technical replicates using the function collapseReplicates().
Now we can run the DEA using the DESeq() function.
dds <- DESeq(dds) res <- results(dds) head(data.frame(res))
Information about the results columns can be retrieved using the following snippet:
mcols(res)$description
In this example, we only have two conditions: "untreated" and "treated", so there is no need to specify which comparison we would like to make. If your experiment contains more than two conditions, there are different ways of specifying which comparison you would like to make.
First, the "contrast" argument will take a vector that specifies the metadata and the order of the comparison. In this case, you can swap the order of the condition levels.
res <- results(dds, contrast=c("condition","untreated","treated")) #head(data.frame(res))
Second, "name" argument will use the results of the resultsNames() function. This function creates possible combinations of conditions, but always using the reference condition for the comparison.
resultsNames(dds) res <- results(dds, name="condition_treated_vs_untreated") #head(data.frame(res))
The output of the res() function is the one you want to use/save to identify differentially expressed genes using log2 Fold Change or adjusted p-value thresholds.
There is a chance that the adjusted p-value is NA. If you will work on this results, it might be helpful to change all NA to an adjusted p-value of 1.
res$padj[is.na(res$padj)] <- 1
We can have a quick look of the adjusted p-value results by using the following custom function:
results_summary <- function(x, alpha = 0.05, LFC = 1) { ngenes <- nrow(x) signif <- sum(x$padj < alpha, na.rm = T) up <- sum(x$padj < alpha & x$log2FoldChange > LFC, na.rm = T) down <- sum(x$padj < alpha & x$log2FoldChange < -LFC, na.rm = T) results <- c(paste0("Number of genes: ", ngenes), paste0("Number of genes with adjusted p-value < ",alpha,": ", signif, " (", round((signif/ngenes)*100,digits = 2),"%)"), " Of those:", paste0(" with LFC < ", -LFC, ": ", down, " (", round((down/ngenes)*100,digits = 2),"%)"), paste0(" with LFC > ", LFC, ": ", up, " (", round((up/ngenes)*100,digits = 2),"%)")) writeLines(paste(results, collapse = "\n")) return(paste(results, collapse = "\n")) } res_summary <- results_summary(res, alpha = 0.05, LFC = 1)
Extracting results based on adjusted p-value and LFC. this way we will get a filteres table with genes with LFC > 1 or LFC < -1 with an adjusted p-value of 0.05
LFC <- 1 adj_pvalue <- 0.05 sig_res <- res[res$padj < adj_pvalue & abs(res$log2FoldChange) > LFC,] #write.table(x = sig_res, "./significant_results.tsv", quote = F, col.names = T, row.names = T, sep = "\t")
It can be useful to examine the counts of reads for a single gene across the groups. A simple function for making this plot is plotCounts, which normalizes counts by the estimated size factors (or normalization factors if these were used) and adds a pseudocount of 1/2 to allow for log scale plotting. The counts are grouped by the variables in intgroup, where more than one variable can be specified. Here we specify the gene which had the smallest p value from the results table created above. You can select the gene to plot by rowname or by numeric index.
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", returnData=TRUE) ggplot(d, aes(x=condition, y=count, color = condition)) + geom_point(position=position_jitter(w=0.1,h=0)) + scale_y_log10(breaks=c(25,100,400)) + theme_bw()
In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. One makes use of the concept of variance stabilizing transformations (VST), and the other is the regularized logarithm or rlog, which incorporates a prior on the sample differences. Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.
vsd <- vst(dds) #Variance Stabilizing Transformation, vst is faster with larger number of samples rld <- rlog(dds) #Regularized
We can do some sanity checks of the samples. We can see how they correlate to each other in a heatmap or see their Principal Components with a PCA
This plot shows how far away are each sample from each other. The darker the blue, the closer they are.
sampleDists <- dist(t(assay(vsd))) sampleDistMatrix <- as.matrix(sampleDists) rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-") colnames(sampleDistMatrix) <- NULL colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists, col=colors)
PCA plot using the first two components
pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE) percentVar <- round(100 * attr(pcaData, "percentVar")) ggplot(pcaData, aes(PC1, PC2, color=condition, shape=type)) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar[1],"% variance")) + ylab(paste0("PC2: ",percentVar[2],"% variance")) + coord_fixed() + theme_bw()
We can visualize different heatmaps depending on what we are interested on. We can see the top most expressed genes, the top most variable genes, and the DE genes.
ntop <- 20 select <- order(rowSds(counts(dds,normalized=TRUE)), decreasing=TRUE)[1:ntop] df <- as.data.frame(colData(dds)[,c("condition","type")]) pheatmap(assay(vsd)[select,], cluster_rows=T, show_rownames=FALSE, scale = "row", cluster_cols=FALSE, annotation_col=df)
ntop <- 20 select <- order(rowMeans(counts(dds,normalized=TRUE)), decreasing=TRUE)[1:ntop] df <- as.data.frame(colData(dds)[,c("condition","type")]) pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE, cluster_cols=FALSE, annotation_col=df)
select <- rownames(sig_res) df <- as.data.frame(colData(dds)[,c("condition","type")]) pheatmap(assay(vsd)[rownames(vsd) %in% select,], cluster_rows=TRUE, show_rownames=FALSE, scale = "row", cluster_cols=FALSE, annotation_col=df)
For fold change plots, it is useful to shrunk log2 fold changes, removing the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds.
resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm") resLFC$padj[is.na(resLFC$padj)] <- 1
alpha <- 0.05 ggplot(data.frame(resLFC)) + theme_bw() + labs(title = "MA plot", subtitle = res_summary, color = paste0("Adjusted p-value < ",alpha)) + geom_point(aes(x = log10(baseMean), y = log2FoldChange, color = factor((padj < alpha), levels = c("TRUE","FALSE"))), size = 1)
alpha = 0.05 LFC = 1 ggplot(data.frame(resLFC)) + theme_bw() + labs(color ="Significant genes", title = "Volcano plot", subtitle = res_summary) + geom_hline(yintercept = -log10(alpha), linetype = "dashed") + geom_vline(xintercept = c(-LFC,LFC), linetype = "dashed") + geom_point(aes(x = log2FoldChange, y = -log10(padj), color = factor((padj < alpha & abs(log2FoldChange) > LFC), levels = c("TRUE","FALSE"))), size = 1)
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.