knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Clustermap is a collection of R functions for hierarchical clustering, determination of the number of clusters, and heatmap visualisation. The package produces high-quality annotated heatmaps, makes the user determine the visual design, and has a user-friendly interface. Even users with little R experience will be able to produce pretty figures at a fast pace, while more experienced users will appreciate the modular design that allows the package to be extended with new functionality.
Clustermap is designed with the philosophy that technical parameters and quantities calculated from the data should be hidden from sight whenever possible, while still being available to the advanced user. All technical parameters have default values that works well for most applications, but they can be changed easily when desired. Full access to such parameters and the modular design allows advanced users to integrate their own code with Clustermap.
Clustermap offers a variety of solutions and options through a simple, intuitive interface. Colors in heatmaps and annotation color bars are easily changed, requiring no prior knowledge of color handling in R. Heatmaps may be clustered row-wise, column-wise or both, and the user determines which of the cluster dendrograms should be plotted and where to plot them relative to the heatmap. The number of clusters may be found automatically with one of the two provided algorithms Gap or PART, or may be set manually by the user. Diagnostic plots such as silhouette plots and Gap plots are fully integrated in the package and can be produced in an instant to guide the manual choice of the number of clusters.
This manual will present some general features of the package clustermap
and provide a more in-depth description of the functions implemented in the package. Finally, a number of examples of increasing sophistication and complexity will guide the user through many of the features of the package. These examples are designed to contain only the essential code and can be used as program patterns to be copied and pasted into one's own code.
To install clustermap from GitHub, you need the package devtools. Once you have installed devtools, you can install clustermap using the following code.
library(devtools) install_github("cbsteen/clustermap")
Clustermap accepts any numerical matrix as input for the heatmap, and any vector of numerical values or character values for annotation bars. Missing values (NA's) are accepted in both cases and will be shown in a user-defined color (default is grey for both heatmap and annotation bars).
The main output from Clustermap is a plot depicting a clustered heatmap with or without annotation. Additional output may include diagnostic plots (e.g. to assess then number of clusters) as well as tables providing information about the cluster identify of individual rows and columns. By default, the output is shown in a graphics window on the screen; however, the output is easily re-directed to file by encapsulating the code with appropriate statements as shown below. Try help(Devices)
in R to read more about graphics device options; note that the options available may depend on your computing platform and your version of R.
pdf("Figure.pdf") <code to plot a clustered heatmap> dev.off()
The main plot produced by Clustermap has the following general structure:
Core region: this is where the heatmap is shown
Inner margin: surrounds the heatmap on all sides and this is where the annotation bars are shown, providing additional information about rows or columns
Outer margin: surrounds the inner margin on all sides and this is where cluster trees, row/column labels and the legend for the annotation bars are shown
The size of the inner and outer margins is automatically set by Clustermap during the initialization, where the user declares what type of information the plot will contain. For example, if the user wants to show a heatmap with no additional information (not even cluster trees), then inner and outer margins will be set to (essentially) zero. If the user wants to show a heatmap with cluster trees on the left hand side of the heatmap, then the left outer margin will be set to 0.3 (meaning 30% of the width of the core region). Likewise, if the user wants to show a heatmap with annotation bars above the heatmap, then the top inner margin will be set to 0.1 (meaning 10% of the height of the core region). Margin sizes defined by Clustermap are not always appropriate for the problem at hand. For example, the user may want to show many annotation bars above the heatmap, in which case the default top inner margin could be too narrow. During the initialization of the plot, the user may add or subtract space from inner and outer margins using the arguments inner
and outer
. For example, the argument inner=c(0, 0, 0.3, 0)
would extend the top inner margin by 30% of the height of the core region. To see what the current inner and outer margins are set to, you can look at the system variable .CLUSTERMAP$margin.
Note that system variables are not defined before you have performed plot initialization.
Clustermap plots are constructed by supplying a sequence of commands, each performing a particular computational task or adding a particular feature to the draw. This approach offers an alternative to "all-in-one" solutions performing most of the computations and the graphical rendering by a single call to a function with a huge number of arguments. The Clustermap approach allows the user to control (and if desired save results from) every step of the process, and individual elements can be left out or included as desired. In order to keep the code simple and intuitive, most of the parameters and variables that need to be transferred between function calls are stored in a global environment variable .CLUSTERMAP and are hidden to the user.
A certain ordering of the function calls must be respected in order to ensure that all required parameters are defined prior to a given function call. For example, a call to draw.tree() to plot a cluster dendrogram can only be done after a call to hcluster() to compute the clustering. There are multiple advantages of separating the clustering computation and the plotting of the cluster, some of which will be illustrated in the examples later.
As a general guideline, the functions in clustermap should be called in the following order:
1. draw.init # Always required 2. hcluster # If clustering is required 3. subclust # If subcluster identification is required 4. draw.hmap # If heatmap is required 5. draw.tree # If annotation is required draw.text set.color draw.cbar 6. draw.cbar.key # If a legend for the color bars is required 7. draw.hmap.key # If a color map for the heatmap is required 8. draw.silhouette # Diagnostic plots draw.mean.silhouette get.silhouette draw.gap
The function draw.init
erases any memory of past calls made to any of the functions in the package, so you get a fresh start. The function hcluster
is used to cluster either columns or rows and you specify what type of clustering you would like (linkage, distance, etc). The function subclust
may or may not be required in your analysis; what it does is to determine how many distinct clusters are actually present in the preceding clustering. And then comes a series of plot functions to show the heatmap, any color bars you may want to show on the left/right or bottom/top of the heatmap, the cluster trees, a color key (draw.hmap.key
) that explains what values the heatmap colors represent, etc.
The packages graphics
and gclus
are utilized by clustermap
. If these are not already installed on your computer, give the following command to R:
install.packages("graphics") install.packages("gclus")
or use the Package Installer inside R, found under the menu item ‘Packages & Data’.
To change the colors used in the heatmap, provide your preferred color scale as an argument to draw.hmap
. For example, the call
draw.hmap(X, colorscale="green-white-red")
plots a heatmap for the input data matrix X, using a color scale ranging from green (for negative values) via white (for value 0) to red (for positive values). The highest green intensity is achieved for values in X that are equal to all lower than -xmax, while the highest red intensity is achieved for values in X that are equal to or higher than xmax, where xmax is a positive constant. The default value of xmax is max(abs(X)). To overrule the default value, use the argument xmax in draw.hmap as in this example:
draw.hmap(X, colorscale="yellow-white-blue", xmax=1)
For the color specification, you may provide any three color names supported by R; here are a few examples:
blue-white-red
green-black-red
lightgrey-grey-darkgrey
magenta-cyan-pink
To obtain a textual overview of all colors available, give the command colors() or to see a visual overview over colors available, see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.
Clustermap: an extensible R package for clustering and generating annotated heatmaps. Ole Christian Lingjærde*, Chloé Beate Steen*, Miriam Ragle Aure, Vilde Drageset Haakensen (submitted)
* Corresponding authors, contributed equally to this work.
clustermap
functions`draw.init Call this first to set up correct margins etc
draw.hmap
Plot heatmap using your preferred colors
draw.cbar
Add color bar to one of the sides (to add extra info)
draw.tree
Add column or row dendrogram to a heatmap
hcluster
Perform hierarchical clustering of rows or columns
subclust
Guided or automatic identification of subclusters
get.subclust
Get subtype labels (for original row/col ordering of X)
set.color
Set color coding for accompanying data
draw.hmap.key
Plot color key for heatmap
draw.cbar.key
Plot color key for color bar (annotation bars)
draw.text
Add text to one of the sides (for row/column labels)
draw.silhouette
Silhouette plot
draw.mean.silhouette
Mean silhouette score plot
draw.gap
Gap score plot
blocks
Generate an example data set with 110 rows and 40 columns
draw.init
is used to initiate a draw.
The purpose is to make room on the four sides of the heatmap for additional elements such as cluster dendrograms, color bars with additional info, and labels. Supply as arguments the sides where you want such elements, with the coding
side=1 below the heatmap side=2 left side of the heatmap side=3 above the heatmap side=4 right side of the heatmap
If you want a tree (=dendrogram) on the left and top, let tree=c(2,3)
.
If you want text labels on the right, let text=4
.
If you want color bars below the plot, let cbar=1
.
Examples:
draw.init() # Dendrogram on the left and on the top draw.init(tree=c()) # Make no room for additional elements draw.init(tree=3, text=4) # Dendrogram on the top, labels on the right
The last two arguments of draw.init
(inner
and outer
) do not overrule the other parameters, but allows the user to add (or even subtract) a specific amount of space in the inner and outer margins on a specified side of the heatmap. They may be used alone, or in combination with other parameters. They should always be vectors of length 4 giving the desired margin extension for each of the four sides. An example of use would be to extend an inner margin by a certain amount because we plan to show many bar plots of accompanying data there that we want to give some space. Note that inner=c(0.4,0.4,0.4,0.4) means that each inner margin should be extended by a length corresponding to 0.4 times the total length (in that direction) of the heatmap itself.
Use the function subclust
to identify subclusters in a dendrogram. The effect is that later calls to draw.tree
will show the subclusters in different colors. You can either provide the desired number of clusters using the argument k
, or you can set k=NA
to automatically estimate the number of cluster with one of two algorithms: gap (Tibshirani et al (2001), J Roy Statist Soc B, 63: 411-423) or PART (Nilsen et al (2012), Stat Appl Genet Mol Biol, 12: 637-652).
Using gap (Tibshirani et al., 2001), the result is an estimate of the optimal 'flat cut' of the dendrogram, while using PART (Nilsen et al., 2012), the result is an estimate of the optimal 'nonflat cut' of the dendrogram. In essence the PART algorithm is an extension of gap that applies gap recursively on subclusters previously identified. B
and min.size
are used by PART (only if k=NA) and determines the number of permutations (B=25
is low, but computation time increases with increasing B
; consider B=200
or higher for high-quality estimates) and the least number of elements in a single cluster (min.size=5
is quite arbitrary and you may want to change this). Note that PART may call some individuals as outliers and these will be shown in a separate color but may not be close to each other and will stick out as "clusters" of size potentially smaller than min.size
.
subclust
retrieves a vector of cluster labels after identification of subclusters. By default, labels are returned in the same order as shown in the cluster tree (order="tree"
). To obtain labels in the original ordering in the input data X, use order="orig"
.
By default, the function returns a vector of cluster labels. If you supply row/column labels using the argument labels
the function returns a data frame with two columns, the first being the cluster labels and the second being the supplied row/column labels.
draw.hmap()
plota heatmaps of a numerical data matrix X. It is recommended that you print to a file (e.g. a pdf-file) instead to increase speed (and use fast=F). Use the argument colorscale to control how numerical values are represented as colors in the heatmap. The argument should be three (valid R) colors separated by hyphens.
Some examples: "green-black-red", "yellow-black-blue", "black-white-black" (to ignore signs), and "white-grey-darkgrey".
draw.text()
is used to add labels to the sides of the heatmap. The first argument should be a vector of the same length as the number of columns in the heatmap (when side=1 or side=3), and a vector of the same length as the number of rows in the heatmap (when side=2 or side=4). The size of the text can be changed with the cex
argument. Text labels can be truncated to a certain maximal length (for example 4 characters) with the maxchar
argument.
clustermap
The following code sets up a simple example data set that can be used to run the examples below. It creates a hypothetical gene expression matrix consisting of 110 genes as rows, and 40 patient samples as columns. The y1
vector contains information about cancer subtype of each sample. The y2
vector contains information about stage, and finally the y3
vector about immune score.
library(clustermap) # Make package functions available tmp = blocks() # Contains the 110x40-matrix X and 40-vectors y1,y2,y3 X = tmp$X y1 = tmp$y1 y2 = tmp$y2 y3 = tmp$y3
The following code creates a simple heatmap with no clustering, using the default color-scale ("blue-white-red") and adds the key for the color scale.
draw.init() draw.hmap(X) draw.hmap.key()
The following example shows how to cluster columns in a heatmap using the hcluster()
function, and how to add a tree (dendogram) resulting from the column-wise clustering on the top side of the heatmap (side=3
).
draw.init(tree=3) hcluster(X, clust="col", distance="pearson", linkage="complete") draw.hmap(X) draw.tree(side=3) draw.hmap.key() frame()
This example is the same as the previous one, except that this time the clustering is performed row-wise, and the tree is added on the left side of the heatmap (side=2
).
draw.init(tree=2) hcluster(X, clust="row", distance="euclidean", linkage="complete") draw.hmap(X) draw.tree(side=2) draw.hmap.key()
Similar example as above, where the clustering is performed both row-wise and column-wise, and trees are added on the left side and on the top of the heatmap.
draw.init(tree=c(2,3)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) draw.hmap.key()
The following example adds a color bar at the bottom of heatmap, with a discrete color scales made up of 5 colors, using the y1
vector. The color bar is labeled "Subtype".
draw.init(tree=c(2,3), cbar=1, ckey=T, legend=T, text=4) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) draw.cbar(z1, side=1) draw.cbar.key() draw.hmap.key()
The legend is large compared to the heatmap, so we can make it smaller by increasing the distance between the upper part of the heatmap and the upper part of the key, by setting the vsize
argument to 0.3
.
draw.init(tree=c(2,3), cbar=1, ckey=T, legend=T, text=4) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) draw.cbar(z1, side=1) draw.cbar.key(vusep=0.5) draw.hmap.key()
The following example shows how to add multiple color bars below a heatmap, to provide information about cancer subtype (z1
), stage (z2
) and immune score (z3
). This is done by providing multiple set.color
objects, here called z1
, z2
and z3
to the draw.cbar
function.
draw.init(tree=c(2,3), cbar=1, legend=T, text=4) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) z2 = set.color(y2, label="Stage", type="discrete", color=rainbow(8)) z3 = set.color(y3, label="Immune score", type="continuous", color="white-blue") draw.cbar(z1, z2, z3, border="black", side=1) draw.cbar.key() draw.hmap.key()
The following code adds the rownames of the dataset X
as row labels to the heatmap, using the function draw.text
. The size the text is set using the argument cex
, and the placement of the text using the argument side
, where here 4
refers to the right side of the heatmap.
xsub = X[sample(40),] draw.init(tree=c(2,3), cbar=3, legend=T, text=4) hcluster(xsub, clust="row", distance="euclidean", linkage="complete") hcluster(xsub, clust="col", distance="euclidean", linkage="complete") draw.hmap(xsub) draw.tree(side=2) draw.tree(side=3) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) z2 = set.color(y2, label="Stage", type="discrete", color=palette()) z3 = set.color(y3, label="Immune score", type="continuous", color="black-yellow") draw.cbar(z1, z2, z3, border="black", side=3) draw.text(rownames(xsub), cex=0.7, side=4) draw.cbar.key(hsep=0.15) draw.hmap.key()
Similarly to the example above, labels can be added to columns using the draw.text()
function. The size the text is set using the argument cex
, and the placement of the text using the argument side
, where here 3
refers to the top of the heatmap.
pdf("example_1.pdf") draw.init(tree=c(2,3), cbar=3, text=c(1,4)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) z2 = set.color(y2, label="Stage", type="discrete", color=palette()) z3 = set.color(y3, label="Immune score", type="continuous", color="black-yellow") draw.cbar(z1, z2, z3, border="black", side=3) draw.text(colnames(X), cex=0.7, side=1) draw.hmap.key() dev.off()
The following example illustrates how to cut a dendrogram into k=3
clusters and show in draw. The function subclust()
is used to identify 3 clusters column-wise (clust="col"
). To highlight the new tree with the colored clusters, the lwd
argument in draw.tree
is set to 2
to make the branches thicker.
draw.init(tree=c(2,3)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") subclust(3, clust="col") draw.hmap(X) draw.tree(side=2) draw.tree(side=3, lwd=2) draw.hmap.key()
The example above used a set number of clusters (k=3
). But the number of clusters can also be automatically estimated using for example the GAP algorithm, which is default in the subclust()
function. The other option is to use the PART algorithm, not shown here. The estimated clusters can then be visualized. Both row-wise and column-wise subclustering is shown.
draw.init(tree=c(2,3)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") subclust(NA, clust="col") subclust(NA, clust="row") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) draw.hmap.key()
The previous example estimated the number of subclusters with the default algorithm gap. Here an example is shown where the number of subclusters is determined with the PART algorithm, set by the method
argument of the subclust function
. Both row-wise and column-wise subclustering is shown.
draw.init(tree=c(2,3)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") subclust(NA, clust="col", method="part") subclust(NA, clust="row", method="part") draw.hmap(X) draw.tree(side=2) draw.tree(side=3) draw.hmap.key()
After estimating subclusters, the user can test the association between additional info parameters and subclusters. This is done using the draw.cbar()
function, which adds additional information to heatmap, and by specifying the pvalue.method
, here "fisher"
, a p-value showing the association between the subclusters and the color bar added is displayed on the heatmap.
draw.init(tree=c(2,3), cbar=3, legend= T, text=4) hcluster(X, clust="col", distance="euclidean", linkage="complete") hcluster(X, clust="row", distance="euclidean", linkage="complete") subclust(NA, clust="col", method="part") subclust(NA, clust="row", method="part") draw.hmap(X) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) z2 = set.color(y2, label="Stage", type="discrete", color=palette()) z3 = set.color(y3, label="Immune score", type="continuous", color="black-yellow") draw.cbar(z1, z2, z3, pvalue=T, pvalue.method="fisher", border="black", side=3) draw.tree(side=2) draw.tree(side=3) draw.cbar.key() draw.hmap.key()
Silhouette plots are used to assess the quality of the subclustering performed on the data. A silhouette plot displays how close each point in a cluster is to points in the closest neighboring clusters. Using the same clustering as in Example 12, we show the silhouette plot of the column-wise clustering. Three clusters seem reasonable in this case.
draw.init(cbar=3, text=4, legend=T) hcluster(X, clust="col", distance="euclidean", linkage="complete") hcluster(X, clust="row", distance="euclidean", linkage="complete") subclust(NA, clust="col", method="part") subclust(NA, clust="row", method="part") draw.silhouette()
The Gap statistics is a measure of clustering quality. Using the draw.gap()
function, the user can plot the Gap statistics and identify the optimal number of clusters. The optimal number of clusters is highlighted in red.
pdf("Example14.pdf") draw.init(cbar=3, text=4, legend=T) hcluster(X, clust="col", distance="euclidean", linkage="complete") hcluster(X, clust="row", distance="euclidean", linkage="complete") subclust(NA, clust="col", method="part") subclust(NA, clust="row", method="part") draw.gap() dev.off()
It may be useful to apply the clustering estimated using one dataset, to another dataset of the same size. For instance, one may wish to follow patient samples over time, using the original subgrouping identified in the first samples. This is done by plotting a heatmap using another dataset with the draw.hmap()
function, right after having performed the clustering on the original dataset with hcluster()
.
draw.init(tree=c(2,3)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") Xnew = tanh(X) # Xnew can be any numerical matrix of same size as X draw.hmap(Xnew) draw.tree(side=2) draw.tree(side=3) draw.hmap.key()
Multiple heatmaps can be displayed on the same page using the frame()
function.
xmax = max(2*X, na.rm=T) par(mfrow=c(2,2)) panel.init(mfrow=c(2,2)) draw.init(tree=c(2,3), ckey=F, outer=c(0.1, 0.1, 0.1, 0.1)) hcluster(X, clust="row", distance="euclidean", linkage="complete") hcluster(X, clust="col", distance="euclidean", linkage="complete") draw.hmap(X, xmax=xmax) draw.tree(side=2) draw.tree(side=3) draw.hmap.key() frame() draw.init(tree=c(2,3), ckey=F, outer=c(0.1, 0.1, 0.1, 0.1)) hcluster(X*2, clust="row", distance="pearson", linkage="complete") hcluster(X*2, clust="col", distance="pearson", linkage="complete") draw.hmap(X*2, xmax=xmax) draw.tree(side=2) draw.tree(side=3) draw.hmap.key() frame() draw.init(tree=c(2,3), ckey=F, outer=c(0.1, 0.1, 0.1, 0.1)) hcluster(X/2, clust="row", distance="euclidean", linkage="average") hcluster(X/2, clust="col", distance="euclidean", linkage="average") draw.hmap(X/2, xmax=xmax) draw.tree(side=2) draw.tree(side=3) draw.hmap.key() frame()
Clustering is not restricted to gene expression data. Here, a more complex matrix from an image is clustered based on (R,G,B) values.
library(magick) pict = bellagio() panel.init(mfrow=c(1,3)) draw.init(ckey=F) draw.hmap(pict$X) frame() frame() draw.init(ckey=F) draw.hmap(pict$Xrc) frame() frame() draw.init(ckey=F) Xcol = col2num(pict$Xrc, clust="col") # In each column replace color by (R,G,B) Xrow = col2num(pict$Xrc, clust="row") # In each row replace color by (R,G,B) hcluster(Xcol, clust="col", linkage="average") # Cluster columns using (R,G,B) values hcluster(Xrow, clust="row", linkage="average") # Cluster rows using (R,G,B) values draw.hmap(pict$Xrc, origin=1) frame() frame()
xsub = tanh(X[sample(40),]) draw.init(tree=c(2,3), cbar=3, legend=T, text=4) hcluster(xsub, clust="row", distance="euclidean", linkage="complete") hcluster(xsub, clust="col", distance="euclidean", linkage="complete") draw.hmap(xsub) draw.tree(side=2) draw.tree(side=3) z1 = set.color(y1, label="Subtype", type="discrete", color=c("red", "pink", "darkblue", "lightblue", "green")) z2 = set.color(y2, label="Stage", type="discrete", color=palette()) z3 = set.color(y3, label="Immune score", type="continuous", color="black-yellow") draw.cbar(z1, z2, z3, pvalue=T, pvalue.method="fisher", border="black", side=3) draw.text(rownames(xsub), cex=0.7, side=4) draw.cbar.key(hsep=0.15) draw.hmap.key()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.