BlockR: Statistical Interpretation of Two-way Clustered Heatmaps"

knitr::opts_chunk$set(fig.width=8, fig.height=6)
options(width=96)
.format <- knitr::opts_knit$get("rmarkdown.pandoc.to")
.tag <- function(N, cap ) ifelse(.format == "html",
                                 paste("Figure", N, ":",  cap),
                                 cap)

Introduction

For more than 20 years, since the earliest days of microarray studies, two-way clustered heatmaps have been an important tool for visualization of omics data (Weinstein97; Eisen98). Surprisingly, in all that time, the interpretation of such heatmaps has relied almost exclusively on informal "eyeball" tests looking for rectangular regions containing features and samples with similar levels of expression. A study in 2017, which focused on assessing alternative visual representations of the data, conducted surveys and interviews and found that "[p]ractitioners most frequently looked for blocks of cells or bands of rows and/or columns in the heatmap" (Engle2017). At present, we do not know of any studies that attempt to assign statistical significance to these color blocks or bands.

We find this state of affairs surprising, in part, because many statistical methods have been developed to assess the row and column dendrograms that appear on the borders of clustered heatmaps. These methods are primarily directed toward estimating the number of clusters present in the data. (For a survey of such methods, see (Charrad14).)

In this paper, we propose a statistical method to estimate the number of rectangular "blocks" or "tiles" present in the heatmap. In effect, our method is equivalent to simultaneously estimating the number of row-clusters and the number of column-clusters in the data. Our approach is based on using linear regression to estimate the mean expression in each block after cutting the row and column dendrograms to define clusters. It is inspired, in part, by the common use of either the Akaike Information Criterion (AIC) or the Bayes Information Criterion (BIC) to select the optimal number of parameters in a linear model by imposing a penalty based on the number of parameters.

Our main innovation in this application comes from the realization that not all of the data present in a typical clustered heatmap is independent. Certainly, data arising from different patient samples is statistically independent. But data arising from different genes, proteins, or other omics features is highly unlikely to be independent. If we were to treat it as independent, then the AIC or BIC penalties arising from the number of parameters would be inadequate to overcome the log likelihood estimates. In that situation, we would tend to choose unrealistically large estimates of the number of clusters and blocks or tiles present in the data. Instead, we introduce methods to estimate the empirical number of degrees of freedom that we can use instead of the full size of the data matrix used for the heatmap.

Npte: The previous paragraph may, at this point, be overly optimistic. We will test a couple of ideas here, and hope to figure out which one actually works.

set.seed(625996)
options(width=88)
options(SweaveHooks = list(fig = function() par(bg='white')))

Methods

library("BlockR")

Data

Our first data set is a reverse phase protein array (RPPA) data set that was generated by using samples from patients with Acute Myeloid Leukemia (AML). The RPPA data has been quantified and normalized as described previously (REFERENCES). The full data set contained 719 AML samples from 511 patients with AML; these have been previously combined to adjust for the effects of fresh-vs-frozen or diagnosis-vs-relapse samples. The original data set contained measurements on a total of 203 different antibodies. Here, we use only 44 proteins known to be associated with AKT-PI3K signaling.

data("AMLRPPA")
dim(AMLRPPA)

Statistical Methods

blocker
BlockR:::blockerLoop

Library Packages

Our analysis uses the following R packages.

library(viridisLite)                      # for viridis
suppressMessages(library(gplots))         # for heatmpa.2
suppressMessages(library(ClassDiscovery)) # for distanceMatrix

We select "clade" colors from palettes created in the Polychrome package.

library("Polychrome")
data(Dark24)
data(Light24)
cols <- c(Dark24, Light24)
rm(Dark24, Light24)

Results

Random noise matrices (AML RPPA)

Here we apply the "blocker" method to random matrices of the same size as the AML RPPA data.

f <- "randoms.rda"
if (file.exists(f)) {
  load(f)
} else {
  nruns <- 200
  cat("Simulating\n", file=stderr())
  temp <- list()
  for (i in 1:nruns) {
    cat(i, "\n", file=stderr())
    dataset <- matrix(rnorm(prod(dim(AMLRPPA))), ncol=ncol(AMLRPPA))
    temp[[i]] <- blocker(dataset)
  }
  randoms <- array(NA, dim=c(nrow(temp[[1]]), ncol(temp[[1]]), nruns))
  for (i in 1:nruns) {
    randoms[,,i] <- as.matrix(temp[[i]])
  }
  rm(nruns, i, temp, dataset)
  save(randoms, file=f)
}
rm(f)

Now we compute the mean values obtained over multiple random matrices. Plots of the mean squared residuals (MSR) as a function of the number of protein groups and sample groups show that the MSR decreases in a regular pattern as the number of groups increases (Figure 1).

ranMean <- apply(randoms, 1:2, mean)
ranPivot <- pivot(ranMean)
colnames(ranMean) <- c("ProteinGroups", "SampleGroups", "MeanMSE")
ranMean <- as.data.frame(ranMean)
attach(ranMean)
interaction.plot(SampleGroups, ProteinGroups, MeanMSE, type='b')
detach()

Scrambled matrices (AML RPPA)

Here we apply the ``blocker'' method to scrambled versions of the AML RPPA data. We have scrambled the data within each column (protein), which preserves the distribution of the protein data but breaks all of the correlations.

f <- "scrambled.rda"
if (file.exists(f)) {
  load(f)
} else {
  cat("Scrambling\n", file=stderr())
  temp<- list()
  for (i in 1:100) {
    cat(i, "\n", file=stderr())    
    dataset <- matrix(NA, ncol=ncol(AMLRPPA), nrow=nrow(AMLRPPA))
    for (j in 1:ncol(AMLRPPA)) {
      dataset[,j] <- sample(AMLRPPA[,j])
    }
    temp[[i]] <- blocker(dataset)
  }
  scrambled <- array(NA, dim=c(nrow(temp[[1]]), ncol(temp[[1]]), 100))
  for (i in 1:100) {
    scrambled[,,i] <- as.matrix(temp[[i]])
  }
  rm(i, j, temp, dataset)
  save(scrambled, file=f)
}
rm(f)

Now we compute the mean values obtained over multiple scrambled matrices. Plots of the MSR as a function of the number of protein groups and sample groups show that the MSR declines in a regular pattern as the number of groups increases (Figure 2).

scrMean <- apply(scrambled, 1:2, mean)
scrPivot <- pivot(scrMean)
colnames(scrMean) <- c("ProteinGroups", "SampleGroups", "MeanMSE")
scrMean <- as.data.frame(scrMean)
attach(scrMean)
interaction.plot(SampleGroups, ProteinGroups, MeanMSE, type='b')
detach()

We can also compare the MSR from the random matrices and the column-scrambled matrices (Figure 3). While clearly strongly related, there is a systematic shift (of about -0.003) that makes the MSR for the random matrices slightly larger than the MSR for the scrambled matrices.

plot(ranMean$MeanMSE, scrMean$MeanMSE, pch=16,
     xlab="Random Matrices", ylab="Scrambled Matices")
abline(0,1, col='red', lwd=2)
abline(-0.00323, 1, col = "purple", lwd = 2)

Scoring

The underlying statistical model used to fit the tiles is that the (clustered) matrix can be divided into uniform tiles, each with its own mean expression. The residual errors that are ``unexplained'' by these mean value are assumed to be independently and identically distributed (IID), having a normal distribution with mean zero and common standard deviation. Given the estimated parameters (tile means) from this model, we can compute the log likelihood of the data. Note that we actually compute $-2$ log(likelihood), since that is the standard term that shows up in definitions of AIC and BIC. We have defined a function to perform this computation for each number of protein and sample groups, and we apply this function to both the random and scrambled matrices.

Random matrices

N <- prod(dim(AMLRPPA))
ranLL <- n2loglik(ranMean, N)
n2ll <- matrix(ranLL$neg2ll, ncol = 9)
which(n2ll == min(n2ll), arr.ind = TRUE)
aic  <- matrix(ranLL$AIC, ncol = 9)
which(aic == min(aic), arr.ind = TRUE)
bic  <- matrix(ranLL$BIC, ncol = 9)
which(bic == min(bic), arr.ind = TRUE)

Now we can plot the (-2) log-likelihood, the AIC, and the BIC as functions of the number of estimated parameters (Figure 4). Note that this number is a naive estimate of the number of degrees of freedom used to fit the data, since it ignores the fact that the data was already used to run clustering algorithms. Ideally, for random matrices, the minimum of the AIC, BIC, or both would be at the starting point where there is only one protein group and one sample group. In practice, neither of those ideals holds up. The AIC says to use th maximum number of tested cluster (9 protein, 23 sample). The BIC does vbetter, but still says there are 3 protein and 3 sample clusters.

opar <- par(mfrow=c(2,2))
attach(ranLL)
plot(K, neg2ll, xlab="Degrees of Freedom", ylab="-2 Log(likelihood)")
plot(K, AIC, xlab="Degrees of Freedom", ylab="AIC")
plot(K, BIC, xlab="Degrees of Freedom", ylab="BIC")
detach()
attach(ranMean)
interaction.plot(SampleGroups, ProteinGroups, MeanMSE, type='b')
detach()
par(opar)

Scrambled matrices

scrLL <- n2loglik(scrMean, N)
n2ll <- matrix(scrLL$neg2ll, ncol = 9)
which(n2ll == min(n2ll), arr.ind = TRUE)
aic  <- matrix(scrLL$AIC, ncol = 9)
which(aic == min(aic), arr.ind = TRUE)
bic  <- matrix(scrLL$BIC, ncol = 9)
which(bic == min(bic), arr.ind = TRUE)

The scrambled data is as bad as the random data. AIC again yields 9 and 23 clusters, while BIC again yields 3 and 3.

opar <- par(mfrow=c(2,2))
attach(scrLL)
plot(K, neg2ll, xlab="Degrees of Freedom", ylab="-2 Log(likelihood)")
plot(K, AIC, xlab="Degrees of Freedom", ylab="AIC")
plot(K, BIC, xlab="Degrees of Freedom", ylab="BIC")
detach()
attach(scrMean)
interaction.plot(SampleGroups, ProteinGroups, MeanMSE, type='b')
detach()
par(opar)

Two-way Dendrogram Cuts on Real Data

Now we cluster the data subset consisting of our actual measurements on proteins in the AKt-PI3K pathway.

sampleClades <- hclust(distanceMatrix(t(AMLRPPA), metric="pearson"), 
                       method="ward.D2")
proteinClades <- hclust(distanceMatrix(AMLRPPA, metric="pear"), 
                        method="ward.D2")

We fit a block-tile model on the real data, and compute the log likelihood.

storage <- blocker(AMLRPPA)
ourLL <- n2loglik(storage, N)
buffalo <- BlockR:::blockerLoop(AMLRPPA, sampleClades, proteinClades, 9, 23)
dim(storage)
dim(buffalo)
summary(storage)
summary(buffalo)
all.equal(storage, buffalo)

Now we look at the number of clades that would be selected by AIC or BIC on the real data. We expect both of these methods to overestimate the number of clades, since they do not properly correct for the approximate number of parameters used to fit the dendrograms in the first place.

Modeling the Tiles (AIC)

wa <- which(ourLL$AIC==min(ourLL$AIC))
storage[wa,]

The Akaike Information Criterion (AIC) tells us to use the maximum number of clusters tested.

np <- storage$ProteinGroups[wa]
ns <- storage$SampleGroups[wa]
pgroup <- paste("P", cutree(proteinClades, k=np), sep='')
sgroup <- paste("S", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=rep(pgroup, each=nrow(AMLRPPA)),
                    S=rep(sgroup, times=ncol(AMLRPPA)))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
showme <- function(dataset, dir='none', main='', K=2, colmap=jetColors(64), symKey=TRUE) {
  temp <- t(truncscale(dataset, dir=dir, K=K))
  heatmap.2(temp,
          Rowv=rowv,  Colv=colv,
          ColSideColors=cols[cutree(sampleClades, ns)],
          RowSideColors=cols[cutree(proteinClades, np)],
          col=colmap, scale="none", zlim=c(-K, K),
          labCol=NA, cexRow=1, symkey=symKey, 
          density.info="none", trace="none", main=main, xlab="", ylab="")
}
cr0 <- colorRampPalette(c("cyan", "#77aa77", "yellow"))
crp <- colorRampPalette(c("blue", "#0088ff", cr0(5), "#ff8800", "red"))
cmap <- crp(128)

srp <- colorRampPalette(c("white", "purple"))
smap <- srp(128)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Modeling the Tiles (BIC)

For comparison purposes, we can also use the estimated number of clades coming from the Bayes Information Criterion.

wb <- which(ourLL$BIC==min(ourLL$BIC))
storage[wb,]

This keeps the same number of protein clades (9), but reduces the number of sample clades (to 17).

np <- storage$ProteinGroups[wb]
ns <- storage$SampleGroups[wb]
pgroup <- paste("P", cutree(proteinClades, k=np), sep='')
sgroup <- paste("S", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=rep(pgroup, each=nrow(AMLRPPA)),
                    S=rep(sgroup, times=ncol(AMLRPPA)))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Modeling the Tiles (EIC)

Now we define our ad hoc "empirical information criterion" (EIC), which no else currently uses because we just invented it. We start by fitting a linear model to relate the observed log-likelihoods on the real data to the "empirically derived theoretical" log-likelihoods from random data. We are hoping for a magic number in the range 2--10, which we might believe and might eventually be able to derive by some theoretical argument.

At the moment, we do not care if a linear model is the best way to relate these two quantities. We are only using this to get some sensible estimate of a "magic" correction factor. We can, of course, plot the two quantities against each other to see if they are anywhere near linear (Figure 14). Note that a straight line appears to be a pretty poor representation of the data relationships.

#lmod <- lm((-ourLL$neg2ll) ~ sqrt(-scrLL$neg2ll))
lmod <- lm(ourLL$neg2ll ~ scrLL$neg2ll)
MAGIC <- coef(lmod)[2]
MAGIC
#plot(sqrt(-scrLL$neg2ll), (-ourLL$neg2ll), pch=16,
#     xlab="Random Log(likelihood)", ylab="Actual Log(likelihood)")
plot(scrLL$neg2ll, ourLL$neg2ll, pch=16,
     xlab="Random Log(likelihood)", ylab="Actual Log(likelihood)")
abline(0,1)
abline(coef(lmod), col="blue", lwd=2)

Now we define the EIC to be the number obtained by subtracting a penalty equal to the "magic" multiple of the random-matrix MSR from the observed MSR on the real data. The "optimal" number of sample and protein clades is then found by minimizing the EIC.

EIC <- ourLL$neg2ll - MAGIC*scrLL$neg2ll
w <- which(EIC == min(EIC))
storage[w,]

This model gives only 3 protein clades and only 7 sample clades. That might be overly conservative, unlike the overly liberal BIC and AIC estimates.

We can plot the EIC as a function of the naive number of degrees of freedom (Figure 15). This plot can be thought of as a collection of overlaid plots for different numbers of protein clades.

plot(ourLL$K, EIC, pch=16, xlab="Degrees of Freedom", 
     ylab="(new) Empirical Information Criterion")
NP <- max(storage$ProteinGroups)
for (i in 1:NP) {
  w0 <- which(storage$ProteinGroups == i)
  lines(ourLL$K[w0], EIC[w0], col=cols[i])
}
legend("bottomright", paste("n =", 1:NP), col=cols[1:NP], lwd=2)

Having now computed what we think may be a reasonable estimate of the optimal number of protein and sample clades, we can fit the block-tile model for the full data matrix.

np <- storage$ProteinGroups[w]
ns <- storage$SampleGroups[w]
pgroup <- paste("p", cutree(proteinClades, k=np), sep='')
sgroup <- paste("s", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=factor(rep(pgroup, each=nrow(AMLRPPA))),
                    S=factor(rep(sgroup, times=ncol(AMLRPPA))))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Modeling the Tiles (Alternate EIC)

For comparison purposes, we can also use the estimated number of clades coming from a modified form of the EIC that uses a less stringent criterion.

MAGIC <- coef(lmod)[2] / sqrt(2)
EIC2 <- ourLL$neg2ll - MAGIC*scrLL$neg2ll
w2 <- which(EIC2 == min(EIC2))
storage[w2,]

This method wants us to use 6 protein clusters and 15 sample clusters. Maybe we are closing in on the right values.

np <- storage$ProteinGroups[w2]
ns <- storage$SampleGroups[w2]
pgroup <- paste("P", cutree(proteinClades, k=np), sep='')
sgroup <- paste("S", cutree(sampleClades, k=ns), sep='')
rowv <- as.dendrogram(proteinClades)
colv <- as.dendrogram(sampleClades)
tdata <- data.frame(Y=as.vector(as.matrix(AMLRPPA)),
                    P=rep(pgroup, each=nrow(AMLRPPA)),
                    S=rep(sgroup, times=ncol(AMLRPPA)))
model <- lm(Y ~ P*S, data=tdata)
anova(model)
kut <- 1.6
showme(AMLRPPA, "none", "Scaled Data", K=kut, colmap=cmap)
rebox <-matrix(model$fit, ncol=ncol(AMLRPPA))
dimnames(rebox) <- dimnames(AMLRPPA)
showme(rebox, "none", "Tiles", K=1, colmap=cmap)
resbox <-matrix(model$res, ncol=ncol(AMLRPPA))
dimnames(resbox) <- dimnames(AMLRPPA)
showme(resbox, "none", "Absolute Residuals", K=kut, colmap=cmap)
showme(resbox, "col", "Relative Residuals", K=kut, colmap=cmap)

Yet Another Alternative

(Taken from the TCGA analysis.) We are going to fit models for everything up to 40 clusters in each direction. Again, this computation takes a long time, so we cache the results.

f <- "AMLRPPA-MSR.rda"
hm <- list(Colv = proteinClades, Rowv = sampleClades)
rmax <- 40
cmax <- 40
if (file.exists(f)) {
  load(f)
} else {
  MSR <- matrix(NA, nrow = rmax, ncol = cmax)
  for (rn in 1:rmax) {
    cat(rn, "\n", file = stderr())
    for (cn in 1:cmax) {
  #    cat("\t", cn, "\n", file = stderr())
      RR <- factor(paste("R", cutree(as.hclust(hm$Rowv), k=rn), sep = ""))
      CC <- factor(paste("C", cutree(as.hclust(hm$Colv), k=cn), sep = ""))
      P <- 0*AMLRPPA
      for (R in levels(RR)) {
        for (C in levels(CC)) {
          P[RR == R, CC == C] <- mean(AMLRPPA[RR==R, CC==C])
        }
      }
      Res <- matrix(AMLRPPA - P, nrow = nrow(AMLRPPA))
      MSR[rn, cn] <- mean(Res^2)
    }
  }
  save(MSR, file = f)
}
rm(f)

Mean Square Residuals

We know that using more parameters to fit the data will always reduce the mean squared residuals, so the location of the smallest MSR will necessarily be in the upper right corner of the next plot.

image(MSR, col = viridis(64), main = "MSR")

AIC

Now we compute and plot the Akaike Information Criterion (AIC). We do not expect the AIC penalty arising from the number of parameters to change the results much, if it all.

kk <- outer(1:rmax, 1:cmax, "*")
nn <- prod(dim(AMLRPPA))

aic <- nn*log(MSR) + 2*kk
image(aic, col = viridis(64), main = "AIC")

BIC

There should be a bigger penalty from the Bayes Information Criterion (BIC). However, the next plot shows that it still doesn't change the results.

bic <- nn*log(MSR) + kk*log(nn)
which(bic == min(bic), arr.ind = TRUE)
image(bic, col= viridis(64), main = "BIC")

Reweighting

We suspect that part of the problem is that we are treating every data point in the entire matrix as independent. While that is true for the patients (columns), it is not likely to be true for the genes (rows). We will evaluate different weightings corresponding to different arbitrary estimates of the number of degrees of freedom arising from the genes.

wt <- round(seq(1, ncol(AMLRPPA)))
locs <- sapply(wt, function(W) {
  nn <- W*ncol(AMLRPPA)
  tric <- nn*log(MSR) + kk*log(nn)
  inds = which(tric == min(tric), arr.ind=TRUE)
  inds
})

Each weighting potentially gives rise to a different optimal number of row and column clusters. However, we actually only observe a fairly limited number of cluster values:

U <- unique(t(locs))
dim(U)
U
X <- apply(U, 1, prod)
Y <- MSR[U]
plot(X, Y, pch = 19, lwd=2, type ="b",
     xlab="Number of parameters",
     ylab = "MSR")
plot(X, Y/X, pch = 19, lwd=2, type ="b",
     xlab="Number of parameters",
     ylab = "Relative MSR")

Most of the improvement in the MSR arises from the first two steps, suggesting that the third pair of cluster numbers, (r U[3,]), may have some optimal properties.

W <- wt[40]
W
nn <- W*ncol(AMLRPPA)
tric <- nn*log(MSR) + kk*log(nn)
inds = which(tric == min(tric), arr.ind=TRUE)
inds

image(1:40, 1:40, tric, col= viridis(64), main = "TRIC")
points(inds[1], inds[2], pch=16, col="red", cex=0.5)

Appendix

sessionInfo()

References

Weinstein JN, Myers TG, O'Connor PM, Friend SH, Fornace Jr AJ, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johnson GS, Wittes RE, Paull KD. An information-intensive approach to the molecular pharmacology of cancer. Science. 1997 Jan 17;275(5298):343-9. doi: 10.1126/science.275.5298.343.

Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8. doi: 10.1073/pnas.95.25.14863.

Engle S, Whalen S, Joshi A, Pollard KS. Unboxing cluster heatmaps. BMC Bioinformatics. 2017 Feb 15;18(Suppl 2):63. doi: 10.1186/s12859-016-1442-6.

Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software. 2014; 61(6):1-36, url = http://www.jstatsoft.org/v61/i06/.



Try the BlockR package in your browser

Any scripts or data that you put into this service are public.

BlockR documentation built on Sept. 9, 2022, 3:01 p.m.