library(heatmaply)
library(knitr)
knitr::opts_chunk$set(
   # cache = TRUE,
   dpi = 60,
  comment = "#>",
  tidy = FALSE)

# http://stackoverflow.com/questions/24091735/why-pandoc-does-not-retrieve-the-image-file
# < ! -- rmarkdown v1 -->

Author: Tal Galili ( Tal.Galili@gmail.com )

Introduction

A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more.

Interactive heatmaps allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the heatmap by dragging a rectangle around the relevant area.

This work is based on the ggplot2 and plotly.js engine. It produces similar heatmaps as d3heatmap, with the advantage of speed (plotly.js is able to handle larger size matrix), and the ability to zoom from the dendrogram.

Installation

To install the stable version on CRAN:

install.packages('heatmaply')

To install the GitHub version:

# You'll need devtools
install.packages.2 <- function (pkg) if (!require(pkg)) install.packages(pkg);
install.packages.2('devtools')
# make sure you have Rtools installed first! if not, then run:
#install.packages('installr'); install.Rtools()

devtools::install_github("ropensci/plotly") 
devtools::install_github('talgalili/heatmaply')

And then you may load the package using:

library("heatmaply")

Basic usage

Default

The default settings in heatmaply attempt to be both useful yet not too computationally intensive. Here is an example based on the mtcars dataset:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

library(heatmaply)
heatmaply(mtcars)

The margins parameter

By default heatmaply tries to fix the margins of the plot based on the length of the labels, but this system is not perfect. Hence, we sometimes need to manually fix the margins (hopefully this will be fixed in future versions of plotly). This is also helpful when including xlab/ylab/main texts.

heatmaply(mtcars, xlab = "Features", ylab = "Cars", 
        main = "An example of title and xlab/ylab",
        margins = c(60,100,40,20))

We can use the margins parameter with correlation heatmaps. Notice the use of limits to set the range of the colors, and how we color the branches (see dendextend::color_branches for further detail on k_row and k_col):

heatmaply(cor(mtcars), margins = c(40, 40),
          k_col = 2, k_row = 2,
          limits = c(-1,1))
# Better to use:
# heatmaply_cor(cor(mtcars), margins = c(40, 40),
#           k_col = 2, k_row = 2)

Data transformation (scaling, normalize, and percentize)

The variables in mtcars includes values that reflect different types of measurement, each with its own range (and meaning) of values. In such a case, it is best to transform the data so to have all the variables have comparable values.

scale

If we would assume all variables come from some normal distribution, then scaling (i.e.: subtract the mean and divide by the standard deviation) would bring them all close to the standard normal distribution. In such a case, each value would reflect the distance from the mean in units of standard deviation. The "scale" parameter in heatmaply supports column and row scaling, and can be used as follows:

heatmaply(mtcars, xlab = "Features", ylab = "Cars", 
          scale = "column",
        main = "Data transformation using 'scale'",
        margins = c(60,100,40,20))

normalize

When variables in the data comes from possibly different (and non-normal) distributions, other transformations may be in order. For example, scaling on a binary variable with many 0's and just a few 1's will lead to a column with a very extreme value that would cause the color legend to be very skewed by that variable, leading to a masking of the distribution of the rest of the variables.

Another possibility is to use the normalize function to brings data to the 0 to 1 scale by subtracting the minimum and dividing by the maximum of all observations. This preserves the shape of each variable's distribution while making them easily comparable on the same "scale". Using the function on mtcars easily reveals columns with only two (am, vs) or three (gear, cyl) variables compared with variables that have a higher resolution of possible values:

heatmaply(normalize(mtcars), xlab = "Features", ylab = "Cars", 
        main = "Data transformation using 'normalize'",
        margins = c(60,100,40,20))

percentize

An alternative to normalize is the percentize function. This is similar to ranking the variables, but instead of keeping the rank values, divide them by the maximal rank. This is done by using the ecdf of the variables on their own values, bringing each value to its empirical percentile. The benefit of the percentize function is that each value has a relatively clear interpretation, it is the percent of observations that got that value or below it.

heatmaply(percentize(mtcars), xlab = "Features", ylab = "Cars", 
        main = "Data transformation using 'percentize'",
        margins = c(60,100,40,20))

Notice that for binary variables (0 and 1), percentize will turn all 0 values to their proportion and all 1 values will remain 1. This means the transformation is not symmatric for 0 and 1. Hence, if scaling for clustering, it might be better to use rank for dealing with tie values (if no ties are present, then percentize will perform similarly to rank).

is.na10 (missing values)

Reviewing missing values can easily be done using the is.na10 function. When using it with heatmaply, it is often helpful to use grid_gap = 1.

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))
# Better to use:
# heatmaply_na(airquality, k_col =3, k_row = 3,
            # margins = c(55, 30))


# warning - using grid_color cannot handle a large matrix!
# airquality[1:10,] %>% is.na10 %>% 
#   heatmaply(color = c("white","black"), grid_color = "grey",
#             k_col =3, k_row = 3,
#             margins = c(40, 50)) 
# airquality %>% is.na10 %>% 
#   heatmaply(color = c("grey80", "grey20"), # grid_color = "grey",
#             k_col =3, k_row = 3,
#             margins = c(40, 50)) 
# 

Changing color palettes

We can use colors other than the default viridis. For example, we may want to use other color palettes in order to get divergent colors for the correlations (these will, sadly, be less useful for colorblind people):

# divergent_viridis_magma <- c(rev(viridis(100, begin = 0.3)), magma(100, begin = 0.3))
# rwb <- colorRampPalette(colors = c("darkred", "white", "darkgreen"))
# library(RColorBrewer)
# # display.brewer.pal(11, "BrBG")
# BrBG <- colorRampPalette(brewer.pal(11, "BrBG"))
# Spectral <- colorRampPalette(brewer.pal(11, "Spectral"))

heatmaply(cor(mtcars), margins = c(40, 40),
          k_col = 2, k_row = 2,
          colors = BrBG,
          limits = c(-1,1))
# Better to use:
# heatmaply_cor(cor(mtcars), margins = c(40, 40),
#           k_col = 2, k_row = 2)

Another example for using colors:

heatmaply(percentize(mtcars), margins = c(40, 130),
          colors = heat.colors(100))

Or even more customized colors using scale_fill_gradient_fun:

heatmaply(mtcars, margins = c(40, 130), 
          scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(low = "blue", high = "red", midpoint = 200, limits = c(0, 500)))

Customized dendrograms and side annotation

Various seriation options

heatmaply uses the seriation package to find an optimal ordering of rows and columns. Optimal means to optimize the Hamiltonian path length that is restricted by the dendrogram structure. This, in other words, means to rotate the branches so that the sum of distances between each adjacent leaf (label) will be minimized. This is related to a restricted version of the travelling salesman problem.

The default options is "OLO" (Optimal leaf ordering) which optimizes the above criterion (in O(n^4)). Another option is "GW" (Gruvaeus and Wainer) which aims for the same goal but uses a potentially faster heuristic. The option "mean" gives the output we would get by default from heatmap functions in other packages such as gplots::heatmap.2. The option "none" gives us the dendrograms without any rotation that is based on the data matrix.

# The default of heatmaply:
heatmaply(percentize(mtcars)[1:10,], margins = c(40, 130),
          seriate = "OLO")
# Similar to OLO but less optimal (since it is a heuristic)
heatmaply(percentize(mtcars)[1:10,], margin = c(40, 130),
          seriate = "GW")
# the default by gplots::heatmaply.2
heatmaply(percentize(mtcars)[1:10,], margins = c(40, 130),
          seriate = "mean")
# the default output from hclust
heatmaply(percentize(mtcars)[1:10,],  margins = c(40, 130),
          seriate = "none")

This works heavily relies on the seriation package (their vignette is well worth the read), and also lightly on the dendextend package (see vignette )

Customized dendrograms using dendextend

A user can supply their own dendrograms for the rows/columns of the heatmaply using the Rowv and the Colv parameters:

x  <- as.matrix(datasets::mtcars)

# now let's spice up the dendrograms a bit:
library(dendextend)

row_dend  <- x %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 3) %>% set("branches_lwd", c(1,3)) %>%
   ladderize
#    rotate_DendSer(ser_weight = dist(x))
col_dend  <- x %>% t %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 2) %>% set("branches_lwd", c(1,2)) %>%
   ladderize
#    rotate_DendSer(ser_weight = dist(t(x)))

heatmaply(percentize(x), Rowv = row_dend, Colv = col_dend)

Replicating the dendrogram ordering of heatmap.2

The following example shows how to get the same result in heatmaply as with heatmap.2:

x  <- as.matrix(datasets::mtcars)
gplots::heatmap.2(x, trace = "none", col = viridis(100), key = FALSE)

With heatmaply, the only difference is the side of the row dendrogram. This is because the ggplotly function from plotly does not (yet) handle axes placed in different locations than the default.

library(heatmaply)
heatmaply(x, seriate = "mean")

We can get a more similar version by using the following:

library(heatmaply)
heatmaply(x, seriate = "mean", 
          row_dend_left = TRUE, plot_method = "plotly",
          margins = c(40,NA,NA,NA))

Some options may not always work nicely when Using plot_method = "plotly", but for your purpose it might work well enough. Also, using this option would mean the heatmap will be faster and could handle larger matrix sizes.

Adding annotation based on additional factors using RowSideColors

With heatmap.2

# Example for using RowSideColors

x  <- as.matrix(datasets::mtcars)
rc <- colorspace::rainbow_hcl(nrow(x))

library(gplots)
library(viridis)
heatmap.2(x, trace = "none", col = viridis(100),
          RowSideColors=rc, key = FALSE)

With heatmaply

heatmaply(x, seriate = "mean",
          RowSideColors=rc)

# heatmaply(x, seriate = "mean",
#           RowSideColors=factor(rc))

A more sophisticated heatmap (the hover at the top row doesn't work due to an issue with plotly. We hope this would get resolved in the future):

heatmaply(x[,-c(8,9)], seriate = "mean",
          col_side_colors = c(rep(0,5), rep(1,4)),
          row_side_colors = x[,8:9])

Sidenotes

We considred using the fastcluster package for clustering, but the time gain was too small compared to the rest of the bottlenecks in the package.

# library(microbenchmark)
# 
# 
# library(heatmaply)
# x <- matrix(1:1000, 500, 2)
# 
# microbenchmark(
#   heatmaply(x, hclustfun = stats::hclust),
#   heatmaply(x, hclustfun = fastcluster::hclust),
#   times = 10
# )
# 
# x <- matrix(1:1000, 1000, 2)
# microbenchmark(
#   stats::hclust(dist(x)),
#   fastcluster::hclust(dist(x)),
#   times = 10
# )

Saving your heatmaply into a file

You can save an interactive version of your heatmaply into an HTML file using the following code:

dir.create("folder")
library(heatmaply)
heatmaply(mtcars, file = "folder/heatmaply_plot.html")
browseURL("folder/heatmaply_plot.html")

Similar code can be used for saving a static file (png/jpeg/pdf)

dir.create("folder")
library(heatmaply)
# Before the first time using this code you may need to first run:
# webshot::install_phantomjs()
heatmaply(mtcars, file = "folder/heatmaply_plot.png")
browseURL("folder/heatmaply_plot.png")

If you only wish to save the file, without plotting it in the console, you can assign the output to a temporatay object name:

# This saves the file, but does not plot it in the RStudio viewer
tmp <- heatmaply(mtcars, file = "folder/heatmaply_plot.png")
rm(tmp)

Credit

This package is thanks to the amazing work done by MANY people in the open source community. Beyond the many people working on the pipeline of R, thanks should go to the plotly team, and especially to Carson Sievert and others working on the R package of plotly. Also, many of the design elements were inspired by the work done on heatmap, heatmap.2 and d3heatmap, so special thanks goes to the R core team, Gregory R. Warnes, and Joe Cheng from RStudio. The dendrogram side of the package is based on the work in dendextend, in which special thanks should go to Andrie de Vries for his original work on the ggdendro package was the first to bring dendrograms to ggplot2 (this later evolved into the richer ggdend objects, as implemented in dendextend). Thanks should also go to Alan O'Callaghan for his many contributions to getting the package to work better with plotly, as well as for Jonathan Sidi for his work on the shinyHeatmply package. Lastely, my thanks goes to Yoav Benjamini for his support and helpful comments on this work.

Contact

You are welcome to:

Latest news

You can see the most recent changes to the package in the NEWS.md file

Session info

sessionInfo()


ggmota/heatmaply documentation built on May 17, 2019, 7:06 p.m.