knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" ) library(DESeq2) library(ComplexHeatmap) library(RColorBrewer) library(tidyverse) library(cowplot) library(pbda)
Matrices can contain characters or logical values, but we're really only interested in numeric matrices:
# Build 4x3 matrix with row-wise series of integers M <- matrix(1:12, nrow = 4, byrow = TRUE) M
To return the value in the first row, second column:
M[1,2]
With named rows and columns:
# Assign column names colnames(M) <- c("col1", "col2", "col3") # Assign row names row.names(M) <- c("row1", "row2", "row3", "row4") M["row1","col2"]
Index ranges and character vectors of row/column names also work:
M[1:2,2:3] M[c("row1","row2"),c("col2","col3")]
As does negative selection:
# Exclude rows 1 and 4 M[-c(1,4),]
Full rows and columns can be extracted as well:
# print 1st row M[1,] # print 2nd column M[,2] # print 2nd and 3rd rows M[2:3,]
For example, taking the base 2 logarithm of each value:
M_log <- log2(M) M_log
Or normalizing all columns relative to an arbitrary reference column:
M_norm <- M/M[,1] M_norm
mat
.mat_sub
, with columns 2-10 and rows 40-60 from the mat
.mat_sub
.rowMeans()
function, subtract the row-wise mean from each value in mat_sub
.Step | Tools | Description --------------------------- | -------------------- | -------------------------------------------- Align reads to genome | STAR, HISAT2 | Align reads to genomic sequence, retaining strand information if present Assign reads to features | Subread, StringTie | Assign aligned reads to genes and/or transcripts Calculate differential expression | DESeq2, edgeR, Ballgown | Fit count matrices to model to determine differential expression
Step | Tools | Description --------------------------- | -------------------- | -------------------------------------------- Estimate transcript abundance | Salmon / Kallisto | Estimate transcript abundance directly on transcriptome sequences Calculate differential expression | DESeq2, edgeR, Ballgown | Fit count matrices to model to determine differential expression
# View first few rows of count matrix head(drug_resistant_counts) # View phenotype data drug_resistant_coldata
SummarizedExperiment objects provide a structured, self-contained way to store phenotype, gene and assay data, along with analysis intermediates and parameters.
knitr::include_graphics("img/class-4-se.png")
# Create DESeqDataSet from drug_resistant data dds <- DESeqDataSetFromMatrix(countData = drug_resistant_counts, colData = drug_resistant_coldata, design = ~drug) # View DESeqDataSet dds # View design dds@design # View colData colData(dds) # View counts assay(dds) %>% head()
DESeq2 performs internal filtering to remove genes with low expression, but explicitly filtering extreme cases beforehand can speed up processing.
# number of genes before filtering nrow(dds) # remove genes with 0-1 total reads dds <- dds[rowSums(counts(dds)) > 1,] # number of genes after filtering nrow(dds)
Many analysis methods for multidimensional data rely on variance being consistent across mean values. Raw count data has the greatest variance at high mean values (largest raw values), while log2-transformed data has the greatest variance at low mean values (largest fold-changes). To get around this, DESeq2 provides two different functions - vst()
and rlog()
- that stabilize variance across the mean.
# rlog transform rld <- rlog(dds, blind = FALSE)
Sample distance plots are used to assess overall similarity between samples: which samples are similar to each other, which are different? We calculate the euclidean distance between samples using dist()
on the rlog-transformed data to ensure roughly equal contributions from all genes.
# Transpose count matrix and calculate distances using dist() sampleDists <- assay(rld) %>% t() %>% dist() # Convert distance dataframe to matrix sampleDistMatrix <- as.matrix(sampleDists) # Choose continuous palette from RColorBrewer colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255) # Plot sample distance heatmap with ComplexHeatmap Heatmap(sampleDistMatrix, cluster_rows = F, cluster_columns = F, col = colors)
Another method for visualizing relationships in multidimensional data is principle component analysis. PCA identifies clusters of variables (in this case, genes) that explain the majority of the variance present. PCA plots are x-y plots of the first two principle components; they spread the samples out in the two directions that explain most of the differences.
``` {r pca, fig.height=5, fig.width=6}
plotPCA(rld, intgroup = "drug") + scale_color_brewer(palette = "Set1")
## Calculating differential expression DESeq2 provides a wrapper function `DESeq()` that takes a DESeqDataSet and calculates differential expression. Progress through analysis steps are printed as the function runs. ``` {r fig.height=5, fig.width=6} # Run DESeq2 dds <- DESeq(dds)
DESeq2 results are extracted with the results()
function.
``` {r fig.height=5, fig.width=6}
res <- results(dds)
summary(res)
res_tibble <- as_tibble(res, rownames = "gene")
## MA-plots MA plots - plots of mean expression (x-axis) vs. log2-fold change (y-axis) - are useful for visualizing the distribution of gene expression in your comparisons. ``` {r fig.height=5, fig.width=6} # Plot MA without moderating log2 fold-changes plotMA(res, ylim = c(-6,6)) # Return results with moderated log2 fold-changes. res_ma <- lfcShrink(dds, contrast=c("drug","none","A"), res=res, type='normal') # Plot MA plotMA(res_ma, ylim = c(-6,6))
DESeq2 can plot normalized counts using the plotCounts()
function.
``` {r fig.height=5, fig.width=6}
plotCounts(dds, gene = "ESR1", intgroup=c("drug"))
dds_plot <- plotCounts(dds, gene = "ESR1", intgroup=c("drug"), returnData = TRUE)
ggplot(dds_plot, aes(x = drug, y = count, color = drug)) + scale_y_log10() + geom_point(cex = 3)
## Plotting heatmaps of differentially expressed genes The ComplexHeatmap package can be used to plot differential gene expression across samples/replicates. ``` {r fig.height=5, fig.width=6} # extract the 20 most significantly differentially expressed genes top_genes <- res_tibble %>% arrange(padj) %>% head(20) # subset rlog matrix based on gene list mat <- assay(rld)[ top_genes$gene, ] # normalize expression to per-gene average mat <- mat - rowMeans(mat) # plot heatmap Heatmap(mat)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.