In brickmanlab/commonR: Helper scripts used in Brickman Lab

DOCNAME = knitr::current_input()
knitr::opts_chunk$set(autodep        = TRUE,
                      cache          = FALSE,
                      cache.path     = paste0("cache/", DOCNAME, "/"),
                      cache.comments = TRUE,
                      echo           = TRUE,
                      error          = FALSE,
                      fig.align      = "center",
                      fig.path       = paste0("../reports/figures/", DOCNAME, "/"),
                      fig.width      = 10,
                      fig.height     = 8,
                      message        = FALSE,
                      warning        = FALSE)

Introduction

This is a bulk RNAseq analysis template using the DESEQ2 package

Load libraries

library(DESeq2)
library("RColorBrewer")
library(pheatmap)
library(ggplot2)
library(pasilla) # this package contains RNAseq dataset

Load data

In order to procede with the analysis we need two dataframes: - Count matrix dataframe with gene names as rows (and rownames) and samples as columns (and column names). - Sample metadata dataframe with samples as rows (and rownames) and metadata variables as columns (and column names).

Ideally, the sample metadata will contain information about the conditions, treatments, replicates, etc.

The pasilla dataset contains a count matrix and the metadata, which is a good example to use as a template

pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)

The count matrix looks like this:

cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
head(cts)

While the sample metadata looks like this:

coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
head(coldata)

Making the metadata factors will establish the order of the metadata variables. If you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. This is important since the first level of the metadata will be the reference sample/condition for the DE analysis.

coldata$condition <- factor(coldata$condition)
paste("The order of the conditions is:", paste(unique(coldata$condition), collapse=", "))
coldata$type <- factor(coldata$type)
paste("The order of the types is:", paste(unique(coldata$type), collapse=", "))

Metadata factor levels

If you want to change the order of the factor levels you can use one of these two. NOTE: this has to be done before running the DESeq() function.

#coldata$condition <- factor(coldata$condition, levels = c("untreated","treated"))
coldata$condition <- relevel(coldata$condition, ref = "untreated") #Specifies "untreated" as the reference level

Create DESeq object

We create the DESeq object from the count matrix and the metadata. We specify that the analysis will be based on the design column.

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds

Pre-filtering

We can reduce the computational resorces of the analysis by removing genes that are very hardly expressed.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

You can still change the order of the factors of the object "dds" before running the DESeq() function.

Also, you can collapse technical replicates using the function collapseReplicates().

Differential expression analysis

Now we can run the DEA using the DESeq() function.

dds <- DESeq(dds)
res <- results(dds)
head(data.frame(res))

Information about the results columns can be retrieved using the following snippet:

mcols(res)$description

In this example, we only have two conditions: "untreated" and "treated", so there is no need to specify which comparison we would like to make. If your experiment contains more than two conditions, there are different ways of specifying which comparison you would like to make.

First, the "contrast" argument will take a vector that specifies the metadata and the order of the comparison. In this case, you can swap the order of the condition levels.

res <- results(dds, contrast=c("condition","untreated","treated"))
#head(data.frame(res))

Second, "name" argument will use the results of the resultsNames() function. This function creates possible combinations of conditions, but always using the reference condition for the comparison.

resultsNames(dds)
res <- results(dds, name="condition_treated_vs_untreated")
#head(data.frame(res))

Extracting results

The output of the res() function is the one you want to use/save to identify differentially expressed genes using log2 Fold Change or adjusted p-value thresholds.

There is a chance that the adjusted p-value is NA. If you will work on this results, it might be helpful to change all NA to an adjusted p-value of 1.

res$padj[is.na(res$padj)] <- 1

We can have a quick look of the adjusted p-value results by using the following custom function:

results_summary <- function(x, alpha = 0.05, LFC = 1) {
  ngenes <- nrow(x)
  signif <- sum(x$padj < alpha, na.rm = T)
  up <- sum(x$padj < alpha & x$log2FoldChange > LFC, na.rm = T)
  down <- sum(x$padj < alpha & x$log2FoldChange < -LFC, na.rm = T)
  results <- c(paste0("Number of genes: ", ngenes),
    paste0("Number of genes with adjusted p-value < ",alpha,": ", signif, " (", round((signif/ngenes)*100,digits = 2),"%)"),
    "  Of those:",
    paste0("    with LFC < ", -LFC, ": ", down, " (", round((down/ngenes)*100,digits = 2),"%)"),
    paste0("    with LFC > ", LFC, ": ", up, " (", round((up/ngenes)*100,digits = 2),"%)"))
  writeLines(paste(results, collapse = "\n"))
  return(paste(results, collapse = "\n"))
}
res_summary <- results_summary(res, alpha = 0.05, LFC = 1)

Save significant results

Extracting results based on adjusted p-value and LFC. this way we will get a filteres table with genes with LFC > 1 or LFC < -1 with an adjusted p-value of 0.05

LFC <- 1
adj_pvalue <- 0.05

sig_res <- res[res$padj < adj_pvalue & abs(res$log2FoldChange) > LFC,]
#write.table(x = sig_res, "./significant_results.tsv", quote = F, col.names = T, row.names = T, sep = "\t")

Visualizations

Plot counts

It can be useful to examine the counts of reads for a single gene across the groups. A simple function for making this plot is plotCounts, which normalizes counts by the estimated size factors (or normalization factors if these were used) and adds a pseudocount of 1/2 to allow for log scale plotting. The counts are grouped by the variables in intgroup, where more than one variable can be specified. Here we specify the gene which had the smallest p value from the results table created above. You can select the gene to plot by rowname or by numeric index.

d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", 
                returnData=TRUE)
ggplot(d, aes(x=condition, y=count, color = condition)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400)) + theme_bw()

Data transformations

In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. One makes use of the concept of variance stabilizing transformations (VST), and the other is the regularized logarithm or rlog, which incorporates a prior on the sample differences. Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.

vsd <- vst(dds) #Variance Stabilizing Transformation, vst is faster with larger number of samples
rld <- rlog(dds) #Regularized

Samples comparisons

We can do some sanity checks of the samples. We can see how they correlate to each other in a heatmap or see their Principal Components with a PCA

Heatmap of the sample-to-sample distances

This plot shows how far away are each sample from each other. The darker the blue, the closer they are.

sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

Principal component plot of the samples

PCA plot using the first two components

pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=type)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed() + theme_bw()

Count matrix heatmaps

We can visualize different heatmaps depending on what we are interested on. We can see the top most expressed genes, the top most variable genes, and the DE genes.

Top most variable genes

ntop <- 20
select <- order(rowSds(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:ntop]
df <- as.data.frame(colData(dds)[,c("condition","type")])
pheatmap(assay(vsd)[select,], cluster_rows=T, show_rownames=FALSE, scale = "row",
         cluster_cols=FALSE, annotation_col=df)

Top expressed genes

ntop <- 20
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:ntop]
df <- as.data.frame(colData(dds)[,c("condition","type")])
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)

Differentially expressed genes

select <- rownames(sig_res)
df <- as.data.frame(colData(dds)[,c("condition","type")])
pheatmap(assay(vsd)[rownames(vsd) %in% select,], cluster_rows=TRUE, show_rownames=FALSE, scale = "row",
         cluster_cols=FALSE, annotation_col=df)

Fold change plots

For fold change plots, it is useful to shrunk log2 fold changes, removing the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds.

resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm")
resLFC$padj[is.na(resLFC$padj)] <- 1

MA plot

alpha <- 0.05

ggplot(data.frame(resLFC)) + theme_bw() + 
  labs(title = "MA plot", subtitle = res_summary,
       color = paste0("Adjusted p-value < ",alpha)) +
  geom_point(aes(x = log10(baseMean), y = log2FoldChange, 
                 color = factor((padj < alpha), levels = c("TRUE","FALSE"))), size = 1)

Volcano plot

alpha = 0.05
LFC = 1
ggplot(data.frame(resLFC)) + theme_bw() + 
  labs(color ="Significant genes", title = "Volcano plot", subtitle = res_summary) + 
  geom_hline(yintercept = -log10(alpha), linetype = "dashed") + 
  geom_vline(xintercept = c(-LFC,LFC), linetype = "dashed") + 
  geom_point(aes(x = log2FoldChange, y = -log10(padj),
                 color = factor((padj < alpha & abs(log2FoldChange) > LFC), levels = c("TRUE","FALSE"))), size = 1)

Session info

devtools::session_info()

brickmanlab/commonR documentation built on Dec. 19, 2021, 11:44 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

brickmanlab/commonR
Helper scripts used in Brickman Lab

In brickmanlab/commonR: Helper scripts used in Brickman Lab

Introduction

Load libraries

Load data

Metadata factor levels

Create DESeq object

Pre-filtering

Differential expression analysis

Extracting results

Save significant results

Visualizations

Plot counts

Data transformations

Samples comparisons

Heatmap of the sample-to-sample distances

Principal component plot of the samples

Count matrix heatmaps

Top most variable genes

Top expressed genes

Differentially expressed genes

Fold change plots

MA plot

Volcano plot

Session info

R Package Documentation

Browse R Packages

We want your feedback!

brickmanlab/commonR Helper scripts used in Brickman Lab

In brickmanlab/commonR: Helper scripts used in Brickman Lab

Introduction

Load libraries

Load data

Metadata factor levels

Create DESeq object

Pre-filtering

Differential expression analysis

Extracting results

Save significant results

Visualizations

Plot counts

Data transformations

Samples comparisons

Heatmap of the sample-to-sample distances

Principal component plot of the samples

Count matrix heatmaps

Top most variable genes

Top expressed genes

Differentially expressed genes

Fold change plots

MA plot

Volcano plot

Session info

R Package Documentation

Browse R Packages

We want your feedback!

brickmanlab/commonR
Helper scripts used in Brickman Lab