BlockR: Statistical Interpretation of Two-way Clustered Heatmaps"

knitr::opts_chunk$set(fig.width=8, fig.height=6)
options(width=96)
.format <- knitr::opts_knit$get("rmarkdown.pandoc.to")
.tag <- function(N, cap ) ifelse(.format == "html",
                                 paste("Figure", N, ":",  cap),
                                 cap)

Introduction

For more than 20 years, since the earliest days of microarray studies, two-way clustered heatmaps have been an important tool for visualization of omics data (Weinstein97; Eisen98). Surprisingly, in all that time, the interpretation of such heatmaps has relied almost exclusively on informal "eyeball" tests looking for rectangular regions containing features and samples with similar levels of expression. A study in 2017, which focused on assessing alternative visual representations of the data, conducted surveys and interviews and found that "[p]ractitioners most frequently looked for blocks of cells or bands of rows and/or columns in the heatmap" (Engle2017). At present, we do not know of any studies that attempt to assign statistical significance to these color blocks or bands.

We find this state of affairs surprising, in part, because many statistical methods have been developed to assess the row and column dendrograms that appear on the borders of clustered heatmaps. These methods are primarily directed toward estimating the number of clusters present in the data. (For a survey of such methods, see (Charrad14).)

In this paper, we propose a statistical method to estimate the number of rectangular "blocks" or "tiles" present in the heatmap. In effect, our method is equivalent to simultaneously estimating the number of row-clusters and the number of column-clusters in the data. Our approach is based on using linear regression to estimate the mean expression in each block after cutting the row and column dendrograms to define clusters. It is inspired, in part, by the common use of either the Akaike Information Criterion (AIC) or the Bayes Information Criterion (BIC) to select the optimal number of parameters in a linear model by imposing a penalty based on the number of parameters.

Our main innovation in this application comes from the realization that not all of the data present in a typical clustered heatmap is independent. Certainly, data arising from different patient samples is statistically independent. But data arising from different genes, proteins, or other omics features is highly unlikely to be independent. If we were to treat it as independent, then the AIC or BIC penalties arising from the number of parameters would be inadequate to overcome the log likelihood estimates. In that situation, we would tend to choose unrealistically large estimates of the number of clusters and blocks or tiles present in the data. Instead, we introduce methods to estimate the empirical number of degrees of freedom that we can use instead of the full size of the data matrix used for the heatmap.

set.seed(625996)
options(width=88)
options(SweaveHooks = list(fig = function() par(bg='white')))

Methods

library("BlockR")

Data

Our second data set contains mRNA sequencing data from 1,635 patiemnts having one of six kinds of cancers studied as part of The Cancer Genome Atlas (TCGA): head and neck squamous cell cancer (HNSC), esophageal cancer (ESCA), stomach adenocarcinoma (STAD), pancreatic adenocarcinoma (PAAD), colon cancer (COAD), and rectal cancer (READ). These data were log2-transformed and then filtered to retain only 3,962 genes whose mean expression was at least 6 and whose standard deviation was at least 1. The expression of each gene was then standardized to have mean 0 and standard deviation 1.

data("giTCGA")
dim(giTCGA)
table(giType)

Statistical Methods

blocker

Library Packages

Our analysis uses the following R packages.

library(viridisLite)
library(xtable)
library(gplots)
library(mclust)
library(ClassDiscovery)
library(ClassComparison)

We select "clade" colors from palettes created in the Polychrome package.

library("Polychrome")
data(Dark24)
data(Light24)
cols <- c(Dark24, Light24)
rm(Dark24, Light24)

Random matrices (GI TCGA)

We repeat this analysis for the gastrointestinal TCGA data set.

f <- "randomGI.rda"
if (file.exists(f)) {
  load(f)
} else {
  nruns <- 100
  cat("Simulating\n", file=stderr())
  temp <- list()
  for (i in 1:nruns) {
    cat("run ", i, "\n", file=stderr())
    dataset <- matrix(rnorm(prod(dim(giTCGA))), ncol=ncol(giTCGA))
    temp[[i]] <- blocker(dataset)
  }
  randomGI <- array(NA, dim=c(nrow(temp[[1]]), ncol(temp[[1]]), nruns))
  for (i in 1:nruns) {
    randomGI[,,i] <- as.matrix(temp[[i]])
  }
  rm(nruns, i, temp, dataset)
  save(randomGI, file=f)
}
rm(f)

Now we compute the mean values obtained over multiple random matrices. Plots of the mean squared residuals (MSR) as a function of the number of protein groups and sample groups show that the MSR decreases in a regular pattern as the number of groups increases (Figure 1).

ranMeanGI <- apply(randomGI, 1:2, mean)
ranPivotGI <- pivot(ranMeanGI)
colnames(ranMeanGI) <- c("ProteinGroups", "SampleGroups", "MeanMSE")
ranMeanGI <- as.data.frame(ranMeanGI)
attach(ranMeanGI)
interaction.plot(ProteinGroups, SampleGroups, MeanMSE, type='b')
detach()
f <- "scrambGI.rda"
if (file.exists(f)) {
  load(f)
} else {
  cat("Scrambling\n", file=stderr())
  temp<- list()
  for (i in 1:5) {
    cat("Run ", i, "\n", file=stderr())    
    dataset <- matrix(NA, ncol=ncol(giTCGA), nrow=nrow(giTCGA))
    for (j in 1:nrow(giTCGA)) {
      dataset[j,] <- sample(giTCGA[j,])
    }
    temp[[i]] <- blocker(dataset)
  }
  scrambGI <- array(NA, dim=c(nrow(temp[[1]]), ncol(temp[[1]]), 5))
  for (i in 1:5) {
    scrambGI[,,i] <- as.matrix(temp[[i]])
  }
  rm(i, j, temp, dataset)
  save(scrambGI, file=f)
}
rm(f)
scrMeanGI <- apply(scrambGI, 1:2, mean)
scrPivotGI <- pivot(scrMeanGI)
colnames(scrMeanGI) <- c("ProteinGroups", "SampleGroups", "MeanMSE")
scrMeanGI <- as.data.frame(scrMeanGI)
attach(scrMeanGI)
interaction.plot(SampleGroups, ProteinGroups, MeanMSE, type='b')
detach()
plot(ranMeanGI$MeanMSE, scrMeanGI$MeanMSE, pch=16,
     xlab="Random Matrices", ylab="Scrambled Matices")
abline(0,1, col='red')

Scoring

The underlying statistical model used to fit the tiles is that the (clustered) matrix can be divided into uniform tiles, each with its own mean expression. The residual errors that are ``unexplained'' by these mean value are assumed to be independently and identically distributed (IID), having a normal distribution with mean zero and common standard deviation. Given the estimated parameters (tile means) from this model, we can compute the log likelihood of the data. Note that we actually compute $-2$ log(likelihood), since that is the standard term that shows up in definitions of AIC and BIC. We have defined a function to perform this computation for each number of protein and sample groups, and we apply this function to both the random and scrambled matrices.

N <- prod(dim(AMLRPPA))
ranLL <- n2loglik(ranMean, N)
scrLL <- n2loglik(scrMean, N)

Now we can plot the (-2) log-likelihood, the AIC, and the BIC as functions of the number of estimated parameters (Figure 7). Note that this number is a naive estimate of the number of degrees of freedom used to fit the data, since it ignores the fact that the data was already used to run clustering algorithms. Ideally, for random matrices, the minimum of the AIC, BIC, or both would be at the starting point where there is only one protein group and one sample group.

opar <- par(mfrow=c(2,2))
attach(ranLL)
plot(K, neg2ll, xlab="Degrees of Freedom", ylab="-2 Log(likelihood)")
plot(K, AIC, xlab="Degrees of Freedom", ylab="AIC")
plot(K, BIC, xlab="Degrees of Freedom", ylab="BIC")
detach()
attach(ranMean)
interaction.plot(SampleGroups, ProteinGroups, MeanMSE, type='b')
detach()
par(opar)

Two-way Dendrogram Cuts on Real Data

Now we cluster the data subset consisting of our actual measurements on proteins in the AKt-PI3K pathway.

sampleClades <- hclust(distanceMatrix(t(AMLRPPA), metric="pearson"), 
                       method="ward.D2")
proteinClades <- hclust(distanceMatrix(AMLRPPA, metric="pear"), 
                        method="ward.D2")

We fit a block-tile model on the real data, and compute the log likelihood.

storage <- blocker(AMLRPPA)
ourLL <- n2loglik(storage, N)
buffalo <- BlockR:::blockerLoop(AMLRPPA, sampleClades, proteinClades, 9, 23)
dim(storage)
dim(buffalo)
summary(storage)
summary(buffalo)
all.equal(storage, buffalo)

Now we look at the number of clades that would be selected by AIC or BIC on the real data. We expect both of these methods to overestimate the number of clades, since they do not properly correct for the approximate number of parameters used to fit the dendrograms in the first place.

wa <- which(ourLL$AIC==min(ourLL$AIC))
storage[wa,]

wb <- which(ourLL$BIC==min(ourLL$BIC))
storage[wb,]

Now we define our ad hoc "empirical information criterion" (EIC), which no else currently uses because we just invented it. We start by fitting a linear model to relate the observed log-likelihoods on the real data to the "empirically derived theoretical" log-likelihoods from random data. (At the moment, we do not care if a linear model is the best way to relate these two quantities. We are only using this to get some sensible estimate of a ``magic'' correction factor. We can, of course, plot the two quantities against each other to see if they are anywhere near linear (Figure 5). We are hoping for a magic number in the range 2--10, which we might believe and might eventually be able to derive by some theoretical argument.

lmod <- lm(ourLL$neg2ll ~ ranLL$neg2ll)
MAGIC <- coef(lmod)[2]
MAGIC

Now we define the EIC to be the number obtained by subtracting a penalty equal to the magic'' multiple of the random-matrix MSR from the observed MSR on the real data. Theoptimal'' number of sample and protein clades is then found by minimizing the EIC.

EIC <- ourLL$neg2ll - MAGIC*ranLL$neg2ll
w <- which(EIC == min(EIC))
storage[w,]
plot(ranLL$neg2ll, ourLL$neg2ll, pch=16,
     xlab="Random Log(likelihood)", ylab="Actual Log(likelihood)")
abline(0,1)
abline(coef(lmod), col="blue")

We can plot the EIC as a function of the na\"ive number of degrees of freedom (Figure 9). This plot can be thought of as a collection of overlaid plots for different numbers of protein clades.

plot(ourLL$K, EIC, pch=16, xlab="Degrees of Freedom", 
     ylab="(new) Empirical Information Criterion")
NP <- max(storage$ProteinGroups)
for (i in 1:NP) {
  w0 <- which(storage$ProteinGroups == i)
  lines(ourLL$K[w0], EIC[w0], col=cols[i])
}
legend("bottomright", paste("n =", 1:NP), col=cols[1:NP], lwd=2)

GI TCGA

let's do the same thing with the TCGA GI data set.

Now we cluster the data subset consisting of our actual gene measurements.

sampleCladesGI <- hclust(distanceMatrix(t(giTCGA), metric="pearson"), 
                       method="ward.D2")
proteinCladesGI <- hclust(distanceMatrix(giTCGA, metric="pear"), 
                        method="ward.D2")

We fit a block-tile model on the real data, and compute the log likelihood.

f <- "storageGI.rda"
if (file.exists(f)) {
  load(f)
} else {
  storageGI <- blocker(giTCGA)
  save(storageGI, file = f)
}
rm(f)
N <- prod(dim(giTCGA))
ranLLGI <- n2loglik(ranMeanGI, N)
scrLLGI <- n2loglik(scrMeanGI, N)
ourLLGI <- n2loglik(storageGI, N)

Now we look at the number of clades that would be selected by AIC or BIC on the real data. We expect both of these methods to overestimate the number of clades, since they do not properly correct for the approximate number of parameters used to fit the dendrograms in the first place.

wagi <- which(ourLLGI$AIC==min(ourLLGI$AIC))
storageGI[wagi,]

wbgi <- which(ourLLGI$BIC==min(ourLLGI$BIC))
storageGI[wbgi,]

Now we define our ad hoc "empirical information criterion" (EIC), which no else currently uses because we just invented it. We start by fitting a linear model to relate the observed log-likelihoods on the real data to the "empirically derived theoretical" log-likelihoods from random data. (At the moment, we do not care if a linear model is the best way to relate these two quantities. We are only using this to get some sensible estimate of a ``magic'' correction factor. We can, of course, plot the two quantities against each other to see if they are anywhere near linear (Figure 5). We are hoping for a magic number in the range 2--10, which we might believe and might eventually be able to derive by some theoretical argument.

lmodGI <- lm(ourLLGI$neg2ll ~ ranLLGI$neg2ll)
MAGICgi <- coef(lmodGI)[2]
MAGICgi

Now we define the EIC to be the number obtained by subtracting a penalty equal to the magic'' multiple of the random-matrix MSR from the observed MSR on the real data. Theoptimal'' number of sample and protein clades is then found by minimizing the EIC.

EICgi <- ourLLGI$neg2ll - MAGICgi*ranLLGI$neg2ll
wgi <- which(EICgi == min(EICgi))
storage[wgi,]
plot(ranLL$neg2ll, ourLL$neg2ll, pch=16,
     xlab="Random Log(likelihood)", ylab="Actual Log(likelihood)")
abline(0,1)
abline(coef(lmod), col="blue")

We can plot the EIC as a function of the na\"ive number of degrees of freedom (Figure 9). This plot can be thought of as a collection of overlaid plots for different numbers of protein clades.

plot(ourLL$K, EIC, pch=16, xlab="Degrees of Freedom", 
     ylab="(new) Empirical Information Criterion")
NP <- max(storage$ProteinGroups)
for (i in 1:NP) {
  w0 <- which(storage$ProteinGroups == i)
  lines(ourLL$K[w0], EIC[w0], col=cols[i])
}
legend("bottomright", paste("n =", 1:NP), col=cols[1:NP], lwd=2)

Modeling the Tiles (EIC)

Having now computed what we think may be a reaonable estimate of the optimal number of protein and sample clades, we can fit the block-tile model for the full data matrix.

np <- storage$ProteinGroups[w]
ns <- storage$SampleGroups[w]
pgroup <- paste("p", cutree(proteinClades, k=np), sep='')
sgroup <- paste("p", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=rep(pgroup, each=nrow(AMLRPPA)),
                    S=rep(sgroup, times=ncol(AMLRPPA)))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
showme <- function(dataset, dir='none', main='', K=2, colmap=jetColors(64), symKey=TRUE) {
  temp <- t(truncscale(dataset, dir=dir, K=K))
  heatmap.2(temp,
          Rowv=rowv,  Colv=colv,
          ColSideColors=cols[cutree(sampleClades, ns)],
          RowSideColors=cols[cutree(proteinClades, np)],
          col=colmap, scale="none", zlim=c(-K, K),
          labCol=NA, cexRow=1, symkey=symKey, 
          density.info="none", trace="none", main=main, xlab="", ylab="")
}
cr0 <- colorRampPalette(c("cyan", "#77aa77", "yellow"))
crp <- colorRampPalette(c("blue", "#0088ff", cr0(5), "#ff8800", "red"))
cmap <- crp(128)

srp <- colorRampPalette(c("white", "purple"))
smap <- srp(128)
opar <- par(mai=c(1, 0.3, 0.1, 0.3))
image(seq(-2, 2, length=1024), 1:1, matrix(1:1024), col=cmap, 
      ylab='', xlab='', yaxt='n')
par(opar)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Modeling the Tiles (BIC)

For comparison purposes, we can also use the estimated number of clades coming from the Bayes Informaiton Criterion.

np <- storage$ProteinGroups[wb]
ns <- storage$SampleGroups[wb]
pgroup <- paste("P", cutree(proteinClades, k=np), sep='')
sgroup <- paste("S", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=rep(pgroup, each=nrow(AMLRPPA)),
                    S=rep(sgroup, times=ncol(AMLRPPA)))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Modeling the Tiles (Alternate EIC)

For comparison purposes, we can also use the estimated number of clades coming from a modified form ofthe EIC that uses a less stringent criterion.

MAGIC <- coef(lmod)[2] / sqrt(2)
EIC2 <- ourLL$neg2ll - MAGIC*ranLL$neg2ll
w2 <- which(EIC2 == min(EIC2))
storage[w2,]

np <- storage$ProteinGroups[w2]
ns <- storage$SampleGroups[w2]
pgroup <- paste("p", cutree(proteinClades, k=np), sep='')
sgroup <- paste("p", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=rep(pgroup, each=nrow(AMLRPPA)),
                    S=rep(sgroup, times=ncol(AMLRPPA)))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Yet Another ALternative

(Taken fromt he TCGA analysis.) We are going to fit models for everything up to 40 clusters in each direction. Again, this computation takes a long time, so we cache the results.

f <- "AMLRPPA-MSR.rda"
hm <- list(Colv = proteinClades, Rowv = sampleClades)
rmax <- 40
cmax <- 40
if (file.exists(f)) {
  load(f)
} else {
  MSR <- matrix(NA, nrow = rmax, ncol = cmax)
  for (rn in 1:rmax) {
    cat(rn, "\n", file = stderr())
    for (cn in 1:cmax) {
  #    cat("\t", cn, "\n", file = stderr())
      RR <- factor(paste("R", cutree(as.hclust(hm$Rowv), k=rn), sep = ""))
      CC <- factor(paste("C", cutree(as.hclust(hm$Colv), k=cn), sep = ""))
      P <- 0*AMLRPPA
      for (R in levels(RR)) {
        for (C in levels(CC)) {
          P[RR == R, CC == C] <- mean(AMLRPPA[RR==R, CC==C])
        }
      }
      Res <- matrix(AMLRPPA - P, nrow = nrow(AMLRPPA))
      MSR[rn, cn] <- mean(Res^2)
    }
  }
  save(MSR, file = f)
}
rm(f)

Mean Square Residuals

We know that using more parameters to fit the data will always reduce the mean squared residuals, so the location of the smallest MSR will necessarily be in the upper right corner of the next plot.

image(MSR, col = viridis(64), main = "MSR")

AIC

Now we compute and plot the Akaike Information Criterion (AIC). We do not expect the AIC penalty arising from the number of parameters to change the results much, if it all.

kk <- outer(1:rmax, 1:cmax, "*")
nn <- prod(dim(AMLRPPA))

aic <- nn*log(MSR) + 2*kk
image(aic, col = viridis(64), main = "AIC")

BIC

There should be a bigger penalty from the Bayes Information Criterion (BIC). However, the next plot shows that it still doesn't change the results.

bic <- nn*log(MSR) + kk*log(nn)
which(bic == min(bic), arr.ind = TRUE)
image(bic, col= viridis(64), main = "BIC")

Reweighting

We suspect that part of the problem is that we are treating every data point in the entire matrix as independent. While that is true for the patients (columns), it is not likely to be true for the genes (rows). We will evaluate different weightings corresponding to different arbitrary estimates of the number of degrees of freedom arising from the genes.

wt <- round(seq(1, ncol(AMLRPPA)))
locs <- sapply(wt, function(W) {
  nn <- W*ncol(AMLRPPA)
  tric <- nn*log(MSR) + kk*log(nn)
  inds = which(tric == min(tric), arr.ind=TRUE)
  inds
})

Each weighting potentially gives rise to a different optimal number of row and column clusters. However, we actually only observe a fairly limited number of cluster values:

U <- unique(t(locs))
dim(U)
U
X <- apply(U, 1, prod)
Y <- MSR[U]
plot(X, Y, pch = 19, lwd=2, type ="b",
     xlab="Number of parameters",
     ylab = "MSR")
plot(X, Y/X, pch = 19, lwd=2, type ="b",
     xlab="Number of parameters",
     ylab = "Relative MSR")

Most of the improvement in the MSR arises from the first two steps, suggesting that the third pair of cluster numbers, (r U[3,]), may have some optimal properties.

W <- wt[40]
W
nn <- W*ncol(AMLRPPA)
tric <- nn*log(MSR) + kk*log(nn)
inds = which(tric == min(tric), arr.ind=TRUE)
inds

image(1:40, 1:40, tric, col= viridis(64), main = "TRIC")
points(inds[1], inds[2], pch=16, col="red", cex=0.5)

Appendix

sessionInfo()

References

Weinstein JN, Myers TG, O'Connor PM, Friend SH, Fornace Jr AJ, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johnson GS, Wittes RE, Paull KD. An information-intensive approach to the molecular pharmacology of cancer. Science. 1997 Jan 17;275(5298):343-9. doi: 10.1126/science.275.5298.343.

Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8. doi: 10.1073/pnas.95.25.14863.

Engle S, Whalen S, Joshi A, Pollard KS. Unboxing cluster heatmaps. BMC Bioinformatics. 2017 Feb 15;18(Suppl 2):63. doi: 10.1186/s12859-016-1442-6.

Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software. 2014; 61(6):1-36, url = http://www.jstatsoft.org/v61/i06/.



Try the BlockR package in your browser

Any scripts or data that you put into this service are public.

BlockR documentation built on Sept. 9, 2022, 3:01 p.m.