prediction.strength: Prediction strength for estimating number of clusters
In fpc: Flexible Procedures for Clustering

prediction.strength

R Documentation

Prediction strength for estimating number of clusters

Description

Computes the prediction strength of a clustering of a dataset into different numbers of components. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. See details.

Various clustering methods can be used, see argument clustermethod. In Tibshirani and Walther (2005), only classification to the nearest centroid is discussed, but more methods are offered here, see argument classification.

Usage

  prediction.strength(xdata, Gmin=2, Gmax=10, M=50,
                      clustermethod=kmeansCBI,
                                classification="centroid", centroidname = NULL,
                                cutoff=0.8,nnk=1,
                      distances=inherits(xdata,"dist"),count=FALSE,...)
  ## S3 method for class 'predstr'
print(x, ...)

Arguments

`xdata`	data (something that can be coerced into a matrix).
`Gmin`	integer. Minimum number of clusters. Note that the prediction strength for 1 cluster is trivially 1, which is automatically included if `GMin>1`. Therefore `GMin<2` is useless.
`Gmax`	integer. Maximum number of clusters.
`M`	integer. Number of times the dataset is divided into two halves.
`clustermethod`	an interface function (the function name, not a string containing the name, has to be provided!). This defines the clustering method. See the "Details"-section of `clusterboot` and `kmeansCBI` for the format. Clustering methods for `prediction.strength` must have a `k`-argument for the number of clusters, must operate on n times p data matrices and must otherwise follow the specifications in `clusterboot` Note that `prediction.strength` won't work with CBI-functions that implicitly already estimate the number of clusters such as `pamkCBI`; use `claraCBI` if you want to run it for pam/clara clustering.
`classification`	string. This determines how non-clustered points are classified to given clusters. Options are explained in `classifnp` and `classifdist`, the latter for dissimilarity data. Certain classification methods are connected to certain clustering methods. `classification="averagedist"` is recommended for average linkage, `classification="centroid"` is recommended for k-means, clara and pam (with distances it will work with `claraCBI` only), `classification="knn"` with `nnk=1` is recommended for single linkage and `classification="qda"` is recommended for Gaussian mixtures with flexible covariance matrices.
`centroidname`	string. Indicates the name of the component of `CBIoutput$result` that contains the cluster centroids in case of `classification="centroid"`, where `CBIoutput` is the output object of `clustermethod`. If `clustermethod` is `kmeansCBI` or `claraCBI`, centroids are recognised automatically if `centroidname=NULL`. If `centroidname=NULL` and `distances=FALSE`, cluster means are computed as the cluster centroids.
`cutoff`	numeric between 0 and 1. The optimal number of clusters is the maximum one with prediction strength above `cutoff`.
`nnk`	number of nearest neighbours if `classification="knn"`, see `classifnp`.
`distances`	logical. If `TRUE`, data will be interpreted as dissimilarity matrix, passed on to clustering methods as `"dist"`-object, and `classifdist` will be used for classification.
`count`	logical. `TRUE` will print current number of clusters and simulation run number on the screen.
`x`	object of class `predstr`.
`...`	arguments to be passed on to the clustering method.

Details

The prediction strength for a certain number of clusters k under a random partition of the dataset in halves A and B is defined as follows. Both halves are clustered with k clusters. Then the points of A are classified to the clusters of B. In the original paper this is done by assigning every observation in A to the closest cluster centroid in B (corresponding to classification="centroid"), but other methods are possible, see classifnp. A pair of points A in the same A-cluster is defined to be correctly predicted if both points are classified into the same cluster on B. The same is done with the points of B relative to the clustering on A. The prediction strength for each of the clusterings is the minimum (taken over all clusters) relative frequency of correctly predicted pairs of points of that cluster. The final mean prediction strength statistic is the mean over all 2M clusterings.

Value

prediction.strength gives out an object of class predstr, which is a list with components

`predcorr`	list of vectors of length `M` with relative frequencies of correct predictions (clusterwise minimum). Every list entry refers to a certain number of clusters.
`mean.pred`	means of `predcorr` for all numbers of clusters.
`optimalk`	optimal number of clusters.
`cutoff`	see above.
`method`	a string identifying the clustering method.
`Gmax`	see above.
`M`	see above.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.

Examples

  options(digits=3)
  set.seed(98765)
  iriss <- iris[sample(150,20),-5]
  prediction.strength(iriss,2,3,M=3)
  prediction.strength(iriss,2,3,M=3,clustermethod=claraCBI)
# The examples are fast, but of course M should really be larger.

fpc documentation built on Jan. 14, 2026, 9:07 a.m.