knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" ) library(tidyverse) library(ComplexHeatmap) library(pbda) library(cowplot)
library(tidyverse) library(ComplexHeatmap) library(pbda)
Goals:
1) Learn additional operations on matrices.
2) Demonstrate principles to effectively visualize large datasets with heatmaps.
3) Use clustering algorithms to identify patterns in the data.
4) Exploring datasets with PCA
For some of the exercises today we will be using a single cell RNA-seq dataset that
is part of the pbda
package. The esc_mat
matrix contains read counts
for each gene in each cell. The read counts are an abundance measurement for mRNAs in each cell.
This dataset contains r ncol(esc_mat)
cells from mouse embryos
at various stages of development (Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells). This dataset, as well as many
others are available in a preprocessed form from the Hemberg lab (
https://hemberg-lab.github.io/scRNA.seq.datasets/). Note that the number of cells has been reduced for complexity reasons.
A common task with matrices is to compute a row-wise or column-wise summary. Tidyverse verbs don't naively support operations on matrices. base R provides some useful summary functions, as well as the matrixStats
package.
colSums(esc_mat) %>% head() rowSums(esc_mat) %>% head() summary(esc_mat[, 1:3])
There are a few base R plotting functions that are very useful for exploratory analysis with matrices. For example the hist()
function can be used to quickly generate a histogram from a matrix. Various clustering functions also provide a plot()
method that can produce useful summaries. Often I use base R plots for interactive work, then use ggplot (or heatmaps) to make focused publication quality figures.
hist(esc_mat)
How would we apply an arbitrary function to operate row-wise or column-wise?
In base R there is a function called apply
.
?apply
We can use apply to calculate rowSums and colSums.
apply(esc_mat, 1, sum) %>% head() apply(esc_mat, 2, sum) %>% head()
We can also define custom functions and use them in apply. (See also lapply
, vapply
, sapply
, etc. for variants for operating on different data structures)
#predefine function my_sum <- function(x){ sum(x) + 42 } ex <- apply(esc_mat, 1, my_sum) ex[1:5] # or use an anonymous function ex2 <- apply(esc_mat, 1, function(a){ sum(a) + 42 }) ex2[1:5]
Next we are going to remove genes (rows) with low values from the esc_mat
matrix and save this smaller matrix as filtered_mat
.
to_keep <- (rowSums(esc_mat > 0) > 10) filtered_mat <- esc_mat[to_keep, ] nrow(filtered_mat) # or alternatively
or using apply
to_keep <- apply(esc_mat, 1, function(x){ sum(x > 0) > 10 }) filtered_mat <- esc_mat[to_keep, ] nrow(filtered_mat)
Normalize each the expression values for each cell. Divide each column by the total number of counts in each column, multiply by 10000, and log transform (use the log1p
function). Save the new matrix as norm_mat
.
# normalize a matrix norm_mat <- apply(esc_mat, 2, function(x) x / sum(x)) norm_mat <- 10000 * norm_mat norm_mat <- log(norm_mat + 1) # or log1p(norm_mat) # another approach lognorm <- function(vec){ norm_values <- 10000 * (vec / sum(vec)) log(norm_values + 1) } norm_mat <- apply(esc_mat, 2, lognorm)
# to data.frame df <- as.data.frame(esc_mat) df <- df[1:3, 1:5] # to tibble as_tibble(df, rownames = "gene") # or equivalently rownames_to_column(df, "gene") %>% as_tibble() # as one step tbl_data <- esc_mat %>% as.data.frame() %>% as_tibble(rownames = "gene") tbl_data[1:5, 1:5]
Converting data.frames to matrices is also easily accomplished using as.matrix()
tbl_mat <- tbl_data %>% as.data.frame() rownames(tbl_mat) <- tbl_mat$gene tbl_mat <- tbl_mat[, -1] tbl_mat <- as.matrix(tbl_mat) tbl_mat[1:3, 1:3] class(tbl_mat) # in one step tbl_mat <- tbl_data %>% as.data.frame() %>% column_to_rownames("gene") %>% as.matrix()
Converting vectors to tibbles is also very useful.
# make a vector from colnames col_ids <- colnames(tbl_mat) #build tibble directly tibble(sample_names = col_ids) #additional columns can be added as long as all of the vectors are the same length tibble(sample_names = col_ids, id = 1:72)
generating data.frames
data.frame(sample_names = col_ids, id = 1:72, stringsAsFactors = FALSE) # very important
Tidy the norm_mat
matrix into long format and extract out the timepoint information from the column name (e.g. zy
, early2cell
, etc.). Plot the expression values for each time point (geom_violin
or geom_boxplot
).
tidy_mat <- norm_mat %>% as.data.frame() %>% as_tibble(rownames = "gene") %>% pivot_longer(cols = -gene, names_to = "sample") %>% separate(sample, c("timepoint", "rep"), sep = "_") ggplot(tidy_mat, aes(timepoint, value)) + geom_violin()
How do we visualize this dataset to examine relationships between samples? Visualizing the entire matrix will be computationally expensive and likely completely unintepretable as it will obscure interesting patterns. Ideally we'd like to reduce the dataset to less than a few thousand genes for visualization.
How do we select which features to plot in a heatmap?
1) Use statistics to find features that are significant based on some hypothesis (i.e. run DESeq2
for RNA-seq).
2) Pick features of interest related to the hypothesis.
3) Select features with high variance. Low variance features are probably not very interesting
4) Down sample the matrix to a smaller size
gene_variance <- apply(norm_mat, 1, var) %>% sort(decreasing = TRUE) # top 30 most variable genes vargenes <- names(gene_variance)[1:30] head(vargenes)
var_mat <- norm_mat[vargenes, ] Heatmap(var_mat, name = "log scale", cluster_rows = FALSE, cluster_columns = FALSE)
Use clustering techniques to naturally order the data.
var_mat <- norm_mat[vargenes, ] Heatmap(var_mat, name = "log scale", cluster_rows = TRUE, cluster_columns = FALSE)
Often visualizing gene expression abundance can obscure some patterns, as lower expression values are not as strongly represented. A common data transformation is mean-centering and scaling each gene. This is also known as a z-score.
(x - mean(x)) / sd(x)
By generating z-scores, the gene expression data now represents standard deviations away from the mean. A built in function in R called scale
can generate z-scores. This is also easy to do using an apply function.
?scale
The scale function performs scaling per column in a matrix. However, it is common in genomics for matrices to be represented with samples (i.e. cells) as columns and genes as rows. This is done as commonly the number of genes (i.e. features) is much greater than the number of samples, and so it is conviently (and computationally easier) to represent data in this format.
However, is this a tidy format?
Are the variables stored as rows or columns?
var_mat[1:3, 1:3] t(var_mat) %>% .[1:3, 1:3] zmat <- t(scale(t(var_mat), center = TRUE, scale = TRUE)) # confirm that columns are the same after scaling all(colnames(zmat) == colnames(var_mat))
Heatmap(zmat, name = "z-score", cluster_rows = TRUE, cluster_columns = FALSE)
Multiple heatmaps can be generated and plotted together if they share genes. For example, if we wanted to compare the zscore values to the log expression values.
# save each heatmap as an object h1 <- Heatmap(zmat, name = "z-score", cluster_rows = TRUE, cluster_columns = FALSE) h2 <- Heatmap(var_mat, name = "log", cluster_rows = TRUE, cluster_columns = FALSE) # use the + operator to combine h1 + h2
Heatmaps can be saved in a variety of formats using built in r functions, pdf
, png
, svg
, and jpeg
. ggplots
plots can be saved with ggsave
or the cowplot wrapper save_plot
.
pdf("figure1.pdf") # name of plot h1 + h2 # code that generates the plot dev.off() # saves the plot png("figure1.png") h1 + h2 dev.off() svg("figure1.svg") h1 + h2 dev.off() tiff("figure1.tiff") h1 + h2 dev.off()
p <- ggplot(mtcars, aes(cyl, mpg)) + geom_point() save_plot("mtcars.pdf", p)
Create a heatmap containing the top 200 variables genes, using log normalized data. Consult the help menu (?Heatmap) to generate a heatmap and save it to a file.
Your heatmap should be made in the following manner:
1) cluster by row 2) hide the rownames 3) add a title to the legend 4) add a title to the columns
What appears to be happening between the early2cell
and mid2cell
stages?
gene_variance <- apply(norm_mat, 1, var) %>% sort(decreasing = T) vargenes <- names(gene_variance)[1:200] var_mat <- norm_mat[vargenes, ] Heatmap(var_mat, name = "log scale", cluster_rows = TRUE, show_row_names = FALSE, cluster_columns = FALSE, column_title = "Samples", row_title = "Genes")
Reviews:
By default many Heatmap packages perform clustering to aid in identifying patterns in the data.
We can see the euclidean distances between samples using dist
.
dist(t(var_mat[1:5, 1:5]))
The hierarchical clustering is performed by default in ComplexHeatmap using hclust
with the output of dist
as input. hclust
provides a plotting method using plot()
to visualized the results as a dendogram.
hc <- hclust(dist(t(var_mat))) plot(hc)
We can change the distance metric used for the Heatmap by changing the clustering_distance_rows
argument. A correlation measure, for example Spearman
, is not sensitive to abundance, and can recover the shared patterns in the data.
cor(var_mat[1:5, 1:5], method = "spearman")
# use pretty colors vcols <- viridis::viridis(256) Heatmap(var_mat, col = vcols, clustering_distance_rows = "spearman", cluster_rows = TRUE, cluster_columns = FALSE, show_row_names = FALSE)
Heatmaps are also very useful for quickly examining the similarities between samples
cor_mat <- cor(var_mat, method = "pearson") Heatmap(cor_mat, col = vcols)
How do we select out individual clusters from hierachical clustering? How do we define clusters?
h1 <- Heatmap(var_mat, col = vcols, name = "log scale", cluster_rows = TRUE, show_row_names = FALSE, cluster_columns = FALSE, row_title = "Genes") h1
First let's generate the dendogram:
hc <- hclust(dist(var_mat)) plot(hc)
Next we can use cutree()
to cut the tree into groups
?cutree
# split into 5 groups kclusters <- cutree(hc, k = 5) plot(hc, labels = kclusters) # split based on height of tree hclusters <- cutree(hc, h = 20) plot(hc, labels = hclusters) hclusters[1:5]
Now that we have assigned clusters we can replot our heatmaps split out by each cluster, using the split
argument.
h1 <- Heatmap(var_mat, col = vcols, name = "log scale", cluster_rows = TRUE, show_row_names = FALSE, cluster_columns = FALSE, split = kclusters, row_title = "Genes") h1
K-means clustering is another simple, yet powerful clustering techinuqe. It also scales to much larger datasets than hierarchical clustering.
Kmeans is easily implemented using kmeans()
clusters <- kmeans(var_mat, centers = 5) names(clusters)
Cluster assignments can be extracted using $cluster
clusters$cluster %>% head()
To recover a reproducible kmeans classification set a seed for the random number generator
set.seed(42) clusters <- kmeans(var_mat, centers = 5) clusters$cluster %>% head()
1) Make a Heatmap split by each kmeans cluster
2) Use your tidying and ggplot skills to plot the gene expression values per cluster, by timepoint.
km <- kmeans(var_mat, centers = 5) km_clusters <- km$cluster Heatmap( var_mat, split = km_clusters, show_row_names = FALSE, cluster_columns = FALSE )
cluster_df <- tibble( gene = names(km_clusters), cluster = km_clusters ) tidy_mat <- var_mat %>% as.data.frame() %>% as_tibble(rownames = "gene") plt_dat <- left_join(tidy_mat, cluster_df, by = "gene") plt_dat %>% pivot_longer(cols = zy_1:lateblast_4) %>% mutate(name = str_remove(name, "_[0-9]")) %>% ggplot(aes(name, value)) + geom_violin(aes(fill = name)) + facet_grid(~cluster) + theme_cowplot() + theme(axis.text = element_blank())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.