knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
library(tidyverse)
library(ComplexHeatmap)
library(pbda)
library(cowplot)
library(tidyverse)
library(ComplexHeatmap)
library(pbda)

Goals:

1) Learn additional operations on matrices.

2) Demonstrate principles to effectively visualize large datasets with heatmaps.

3) Use clustering algorithms to identify patterns in the data.

4) Exploring datasets with PCA

Dataset description

For some of the exercises today we will be using a single cell RNA-seq dataset that is part of the pbda package. The esc_mat matrix contains read counts for each gene in each cell. The read counts are an abundance measurement for mRNAs in each cell.

This dataset contains r ncol(esc_mat) cells from mouse embryos at various stages of development (Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells). This dataset, as well as many others are available in a preprocessed form from the Hemberg lab ( https://hemberg-lab.github.io/scRNA.seq.datasets/). Note that the number of cells has been reduced for complexity reasons.

Apply functions per row or per column of a matrix

A common task with matrices is to compute a row-wise or column-wise summary. Tidyverse verbs don't naively support operations on matrices. base R provides some useful summary functions, as well as the matrixStats package.

colSums(esc_mat) %>% head()
rowSums(esc_mat) %>% head()

summary(esc_mat[, 1:3])

There are a few base R plotting functions that are very useful for exploratory analysis with matrices. For example the hist() function can be used to quickly generate a histogram from a matrix. Various clustering functions also provide a plot() method that can produce useful summaries. Often I use base R plots for interactive work, then use ggplot (or heatmaps) to make focused publication quality figures.

hist(esc_mat)

Apply family of functions

How would we apply an arbitrary function to operate row-wise or column-wise?

In base R there is a function called apply.

?apply

We can use apply to calculate rowSums and colSums.

apply(esc_mat, 1, sum) %>% head()
apply(esc_mat, 2, sum) %>% head()

We can also define custom functions and use them in apply. (See also lapply, vapply, sapply, etc. for variants for operating on different data structures)

#predefine function
my_sum <- function(x){
  sum(x) + 42
}

ex <- apply(esc_mat, 1, my_sum)

ex[1:5]

# or use an anonymous function 
ex2 <- apply(esc_mat, 1, function(a){
  sum(a) + 42
})

ex2[1:5]

Filtering a matrix to exclude noisy low-abundance genes

Next we are going to remove genes (rows) with low values from the esc_mat matrix and save this smaller matrix as filtered_mat.

to_keep <- (rowSums(esc_mat > 0) > 10) 

filtered_mat <- esc_mat[to_keep, ]
nrow(filtered_mat)
# or alternatively

or using apply

to_keep <- apply(esc_mat, 1, function(x){
  sum(x > 0) > 10
})

filtered_mat <- esc_mat[to_keep, ]

nrow(filtered_mat)

Normalizing data

Exercise

Normalize each the expression values for each cell. Divide each column by the total number of counts in each column, multiply by 10000, and log transform (use the log1p function). Save the new matrix as norm_mat.

# normalize a matrix
norm_mat <- apply(esc_mat, 2, function(x) x / sum(x)) 
norm_mat <- 10000 * norm_mat
norm_mat <- log(norm_mat + 1) # or log1p(norm_mat)


# another approach 
lognorm <- function(vec){
  norm_values <- 10000 * (vec / sum(vec))
  log(norm_values + 1)
}

norm_mat <- apply(esc_mat, 2, lognorm)

Converting matrices to data.frames and tibbles

# to data.frame
df <- as.data.frame(esc_mat)

df <- df[1:3, 1:5]

# to tibble
as_tibble(df, rownames = "gene")

# or equivalently
rownames_to_column(df, "gene") %>% 
  as_tibble()

# as one step 
tbl_data <- esc_mat %>%
  as.data.frame() %>% 
  as_tibble(rownames = "gene")

tbl_data[1:5, 1:5]

Converting data.frames to matrices is also easily accomplished using as.matrix()

tbl_mat <- tbl_data %>% 
  as.data.frame()

rownames(tbl_mat) <- tbl_mat$gene
tbl_mat <- tbl_mat[, -1]

tbl_mat <- as.matrix(tbl_mat)

tbl_mat[1:3, 1:3]
class(tbl_mat)

# in one step
tbl_mat <- tbl_data %>% 
  as.data.frame() %>% 
  column_to_rownames("gene") %>% 
  as.matrix()

Converting vectors to tibbles is also very useful.

# make a vector from colnames
col_ids <- colnames(tbl_mat)

#build tibble directly
tibble(sample_names = col_ids)

#additional columns can be added as long as all of the vectors are the same length
tibble(sample_names = col_ids,
       id = 1:72)

generating data.frames

data.frame(sample_names = col_ids,
           id = 1:72,
           stringsAsFactors = FALSE) # very important

Exercise

Tidy the norm_mat matrix into long format and extract out the timepoint information from the column name (e.g. zy, early2cell, etc.). Plot the expression values for each time point (geom_violin or geom_boxplot).

tidy_mat <- norm_mat %>% 
  as.data.frame() %>% 
  as_tibble(rownames = "gene") %>% 
  pivot_longer(cols = -gene, names_to = "sample")  %>%
  separate(sample, c("timepoint", "rep"), sep = "_")

ggplot(tidy_mat, aes(timepoint, value)) + 
   geom_violin() 

Generating Heatmaps using ComplexHeatmap

How do we visualize this dataset to examine relationships between samples? Visualizing the entire matrix will be computationally expensive and likely completely unintepretable as it will obscure interesting patterns. Ideally we'd like to reduce the dataset to less than a few thousand genes for visualization.

How do we select which features to plot in a heatmap?

1) Use statistics to find features that are significant based on some hypothesis (i.e. run DESeq2 for RNA-seq).

2) Pick features of interest related to the hypothesis.

3) Select features with high variance. Low variance features are probably not very interesting

4) Down sample the matrix to a smaller size

gene_variance <- apply(norm_mat, 1, var) %>%
  sort(decreasing = TRUE)

# top 30 most variable genes
vargenes <- names(gene_variance)[1:30]

head(vargenes)
var_mat <- norm_mat[vargenes, ]

Heatmap(var_mat, 
        name = "log scale",
        cluster_rows = FALSE,
        cluster_columns = FALSE)

Use clustering techniques to naturally order the data.

var_mat <- norm_mat[vargenes, ]

Heatmap(var_mat, 
        name = "log scale",
        cluster_rows = TRUE,
        cluster_columns = FALSE)

Often visualizing gene expression abundance can obscure some patterns, as lower expression values are not as strongly represented. A common data transformation is mean-centering and scaling each gene. This is also known as a z-score.

(x - mean(x)) / sd(x)

By generating z-scores, the gene expression data now represents standard deviations away from the mean. A built in function in R called scale can generate z-scores. This is also easy to do using an apply function.

?scale

The scale function performs scaling per column in a matrix. However, it is common in genomics for matrices to be represented with samples (i.e. cells) as columns and genes as rows. This is done as commonly the number of genes (i.e. features) is much greater than the number of samples, and so it is conviently (and computationally easier) to represent data in this format.

However, is this a tidy format?

Are the variables stored as rows or columns?

var_mat[1:3, 1:3]

t(var_mat) %>% .[1:3, 1:3]


zmat <- t(scale(t(var_mat), 
                center = TRUE, 
                scale = TRUE)) 

# confirm that columns are the same after scaling
all(colnames(zmat) == colnames(var_mat))
Heatmap(zmat, 
        name = "z-score",
        cluster_rows = TRUE,
        cluster_columns = FALSE)

Multiple heatmaps can be generated and plotted together if they share genes. For example, if we wanted to compare the zscore values to the log expression values.

# save each heatmap as an object
h1 <- Heatmap(zmat, 
        name = "z-score",
        cluster_rows = TRUE,
        cluster_columns = FALSE)

h2 <- Heatmap(var_mat,
        name = "log",
        cluster_rows = TRUE,
        cluster_columns = FALSE)

# use the + operator to combine
h1 + h2

Heatmaps can be saved in a variety of formats using built in r functions, pdf, png, svg, and jpeg. ggplots plots can be saved with ggsave or the cowplot wrapper save_plot.

pdf("figure1.pdf") # name of plot
  h1 + h2 # code that generates the plot
dev.off() # saves the plot

png("figure1.png") 
  h1 + h2 
dev.off() 

svg("figure1.svg") 
  h1 + h2 
dev.off() 

tiff("figure1.tiff") 
  h1 + h2 
dev.off() 
p <- ggplot(mtcars, aes(cyl, mpg)) +
  geom_point()

save_plot("mtcars.pdf", p)

Exercises

Create a heatmap containing the top 200 variables genes, using log normalized data. Consult the help menu (?Heatmap) to generate a heatmap and save it to a file.

Your heatmap should be made in the following manner:

1) cluster by row 2) hide the rownames 3) add a title to the legend 4) add a title to the columns

What appears to be happening between the early2cell and mid2cell stages?

gene_variance <- apply(norm_mat, 1, var) %>%
  sort(decreasing = T)

vargenes <- names(gene_variance)[1:200]

var_mat <- norm_mat[vargenes, ]

Heatmap(var_mat, 
        name = "log scale",
        cluster_rows = TRUE,
        show_row_names = FALSE,
        cluster_columns = FALSE,
        column_title = "Samples", 
        row_title = "Genes")

Clustering

Reviews:

By default many Heatmap packages perform clustering to aid in identifying patterns in the data.

We can see the euclidean distances between samples using dist.

dist(t(var_mat[1:5, 1:5]))

The hierarchical clustering is performed by default in ComplexHeatmap using hclust with the output of dist as input. hclust provides a plotting method using plot() to visualized the results as a dendogram.

hc <- hclust(dist(t(var_mat)))
plot(hc)

We can change the distance metric used for the Heatmap by changing the clustering_distance_rows argument. A correlation measure, for example Spearman, is not sensitive to abundance, and can recover the shared patterns in the data.

cor(var_mat[1:5, 1:5], method = "spearman")
# use pretty colors
vcols <- viridis::viridis(256)

Heatmap(var_mat, 
        col = vcols,
        clustering_distance_rows = "spearman",
        cluster_rows = TRUE, 
        cluster_columns = FALSE,
        show_row_names = FALSE)

Heatmaps are also very useful for quickly examining the similarities between samples

cor_mat <- cor(var_mat, method = "pearson")

Heatmap(cor_mat,
        col = vcols)

Extracting out clusters

How do we select out individual clusters from hierachical clustering? How do we define clusters?

h1 <- Heatmap(var_mat, 
              col = vcols,
              name = "log scale",
        cluster_rows = TRUE,
        show_row_names = FALSE,
        cluster_columns = FALSE,
        row_title = "Genes")
h1

First let's generate the dendogram:

hc <- hclust(dist(var_mat))
plot(hc)

Next we can use cutree() to cut the tree into groups

?cutree
# split into 5 groups
kclusters <- cutree(hc, k = 5)

plot(hc, labels = kclusters)

# split based on height of tree
hclusters <- cutree(hc, h = 20)

plot(hc, labels = hclusters)

hclusters[1:5]

Now that we have assigned clusters we can replot our heatmaps split out by each cluster, using the split argument.

h1 <- Heatmap(var_mat,
              col = vcols, 
              name = "log scale",
              cluster_rows = TRUE,
              show_row_names = FALSE,
              cluster_columns = FALSE,
              split = kclusters, 
              row_title = "Genes")
h1

K-means

K-means clustering is another simple, yet powerful clustering techinuqe. It also scales to much larger datasets than hierarchical clustering.

Kmeans is easily implemented using kmeans()

clusters <- kmeans(var_mat, centers = 5)
names(clusters)

Cluster assignments can be extracted using $cluster

clusters$cluster %>% head()

To recover a reproducible kmeans classification set a seed for the random number generator

set.seed(42)
clusters <- kmeans(var_mat, centers = 5)
clusters$cluster %>% head()

Exercise:

1) Make a Heatmap split by each kmeans cluster

2) Use your tidying and ggplot skills to plot the gene expression values per cluster, by timepoint.

km <- kmeans(var_mat, centers = 5)
km_clusters <- km$cluster

Heatmap(
  var_mat, 
  split = km_clusters,
  show_row_names = FALSE,
  cluster_columns = FALSE
)
cluster_df <- tibble(
  gene = names(km_clusters),
  cluster = km_clusters
)

tidy_mat <- var_mat %>% 
  as.data.frame() %>% 
  as_tibble(rownames = "gene")

plt_dat <- left_join(tidy_mat, 
          cluster_df,
          by = "gene")

plt_dat %>% 
  pivot_longer(cols = zy_1:lateblast_4) %>% 
  mutate(name = str_remove(name, "_[0-9]")) %>% 
  ggplot(aes(name, value)) +
    geom_violin(aes(fill = name)) +
    facet_grid(~cluster) +
  theme_cowplot() +
  theme(axis.text = element_blank())


IDPT7810/practical-data-analysis documentation built on July 8, 2022, 9:32 p.m.