clusterbenchstats: Run and validate many clusterings
In fpc: Flexible Procedures for Clustering

clusterbenchstats

R Documentation

Run and validate many clusterings

Description

This runs the methodology explained in Hennig (2019), Akhanli and Hennig (2020). It runs a user-specified set of clustering methods (CBI-functions, see kmeansCBI) with several numbers of clusters on a dataset, and computes many cluster validation indexes. In order to explore the variation of these indexes, random clusterings on the data are generated, and validation indexes are standardised by use of the random clusterings in order to make them comparable and differences between values interpretable.

The function print.valstat can be used to provide weights for the cluster validation statistics, and will then compute a weighted validation index that can be used to compare all clusterings.

See the examples for how to get the indexes A1 and A2 from Akhanli and Hennig (2020).

Usage

clusterbenchstats(data,G,diss = inherits(data, "dist"),
                                  scaling=TRUE, clustermethod,
                                  methodnames=clustermethod,
                              distmethod=rep(TRUE,length(clustermethod)),
                              ncinput=rep(TRUE,length(clustermethod)),
                              clustermethodpars,
                              npstats=FALSE,
                              useboot=FALSE,
                              bootclassif=NULL,
                              bootmethod="nselectboot",
                              bootruns=25,
                              trace=TRUE,
                              pamcrit=TRUE,snnk=2,
                              dnnk=2,
                              nnruns=100,kmruns=100,fnruns=100,avenruns=100,
                              multicore=FALSE,cores=detectCores()-1,
                              useallmethods=TRUE,
                              useallg=FALSE,...)

## S3 method for class 'clusterbenchstats'
print(x,...)

Arguments

`data`	data matrix or `dist`-object.
`G`	vector of integers. Numbers of clusters to consider.
`diss`	logical. If `TRUE`, the data matrix is assumed to be a distance/dissimilariy matrix, otherwise it's observations times variables.
`scaling`	either a logical or a numeric vector of length equal to the number of columns of `data`. If `FALSE`, data won't be scaled, otherwise `scaling` is passed on to `scale` as argument`scale`.
`clustermethod`	vector of strings specifying names of CBI-functions (see `kmeansCBI`). These are the clustering methods to be applied.
`methodnames`	vector of strings with user-chosen names for clustering methods, one for every method in `clustermethod`. These can be used to distinguish different methods run by the same CBI-function but with different parameter values such as complete and average linkage for `hclustCBI`.
`distmethod`	vector of logicals, of the same length as `clustermethod`. `TRUE` means that the clustering method operates on distances. If `diss=TRUE`, all entries have to be `TRUE`. Otherwise, if an entry is true, the corresponding method will be applied on `dist(data)`.
`ncinput`	vector of logicals, of the same length as `clustermethod`. `TRUE` indicates that the corresponding clustering method requires the number of clusters as input and will not estimate the number of clusters itself. Only methods for which this is `TRUE` can be used with `useboot=TRUE`.
`clustermethodpars`	list of the same length as `clustermethod`. Specifies parameters for all involved clustering methods. Its jth entry is passed to clustermethod number k. Can be an empty entry in case all defaults are used for a clustering method. However, the last entry is not allowed to be empty (you may just set a parameter of the last clustering method to its default value if you don't want to specify anything else)! The number of clusters does not need to be specified here.
`npstats`	logical. If `TRUE`, `distrsimilarity` is called and the two validity statistics computed there are added. These require `diss=FALSE`.
`useboot`	logical. If `TRUE`, a stability index (either `nselectboot` or `prediction.strength`) will be involved.
`bootclassif`	If `useboot=TRUE`, a vector of strings indicating the classification methods to be used with the stability index for the different methods indicated in `clustermethods`, see the `classification` argument of `nselectboot` and `prediction.strength`.
`bootmethod`	either `"nselectboot"` or `"prediction.strength"`; stability index to be used if `useboot=TRUE`.
`bootruns`	integer. Number of resampling runs. If `useboot=TRUE`, passed on as `B` to `nselectboot` or `M` to `prediction.strength`. Note that these are applied to all `kmruns+nnruns+avenruns+fnruns` random clusterings on top of the regular ones, which may take a lot of time if `bootruns` and these values are chosen large.
`trace`	logical. If `TRUE`, some runtime information is printed.
`pamcrit`	logical. If `TRUE`, the average distance of points to their respective cluster centroids is computed (criterion of the PAM clustering method, validation criterion `pamc`); centroids are chosen so that they minimise this criterion for the given clustering. Passed on to `cqcluster.stats`.
`snnk`	integer. Number of neighbours used in coefficient of variation of distance to nearest within cluster neighbour, the `cvnnd`-statistic (clusters with `snnk` or fewer points are ignored for this). Passed on to `cqcluster.stats` as argument `nnk`.
`dnnk`	integer. Number of nearest neighbors to use for dissimilarity to the uniform in case that `npstats=TRUE`; `nnk`-argument to be passed on to `distrsimilarity`.
`nnruns`	integer. Number of runs of `stupidknn` (random clusterings). With `useboot=TRUE` one may want to choose this lower than the default for reasons of computation time.
`kmruns`	integer. Number of runs of `stupidkcentroids` (random clusterings). With `useboot=TRUE` one may want to choose this lower than the default for reasons of computation time.
`fnruns`	integer. Number of runs of `stupidkfn` (random clusterings). With `useboot=TRUE` one may want to choose this lower than the default for reasons of computation time.
`avenruns`	integer. Number of runs of `stupidkaven` (random clusterings). With `useboot=TRUE` one may want to choose this lower than the default for reasons of computation time.
`multicore`	logical. If `TRUE`, parallel computing is used through the function `mclapply` from package `parallel`; read warnings there if you intend to use this; it won't work on Windows.
`cores`	integer. Number of cores for parallelisation.
`useallmethods`	logical, to be passed on to `cgrestandard`. If `FALSE`, only random clustering results are used for standardisation. If `TRUE`, clustering results from all methods are used.
`useallg`	logical to be passed on to `cgrestandard`. If `TRUE`, standardisation uses results from all numbers of clusters in `G`. If `FALSE`, standardisation of results for a specific number of cluster only uses results from that number of clusters.
`...`	further arguments to be passed on to `cqcluster.stats` through `clustatsum` (no effect in `print.clusterbenchstats`).
`x`	object of class `"clusterbenchstats"`.

Value

The output of clusterbenchstats is a big list of lists comprising lists cm, stat, sim, qstat, sstat

`cm`	output object of `cluster.magazine`, see there for details. Clustering of all methods and numbers of clusters on the dataset `data`.

`stat`	object of class `"valstat"`, see `valstat.object` for details. Unstandardised cluster validation statistics.
`sim`	output object of `randomclustersim`, see there. validity indexes from random clusterings used for standardisation of validation statistics on `data`.
`qstat`	object of class `"valstat"`, see `valstat.object` for details. Cluster validation statistics standardised by random clusterings, output of `cgrestandard` based on percentages, i.e., with `percentage=TRUE`.
`sstat`	object of class `"valstat"`, see `valstat.object` for details. Cluster validation statistics standardised by random clusterings, output of `cgrestandard` based on mean and standard deviation (called Z-score standardisation in Akhanli and Hennig (2020), i.e., with `percentage=FALSE`.

Note

This may require a lot of computing time and also memory for datasets that are not small, as most indexes require computation and storage of distances.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Examples

  
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI")
# A clustering method can be used more than once, with different
# parameters
  clustermethodpars <- list()
  clustermethodpars[[2]] <- list()
  clustermethodpars[[2]]$method <- "average"
# Last element of clustermethodpars needs to have an entry!
  methodname <- c("kmeans","average")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,2),
    clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1)
  print(cbs)
  print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1))
# The weights are weights for the validation statistics ordered as in
# cbs$qstat$statistics for computation of an aggregated index, see
# ?print.valstat.

# Now using bootstrap stability assessment as in Akhanli and Hennig (2020):
  bootclassif <- c("centroid","averagedist")
  cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,2),
    clustermethodpars=clustermethodpars,
    useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot",
    bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE)
  print(cbsboot)
## Not run: 
# Index A1 in Akhanli and Hennig (2020) (need these weights choices):
  print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0))
# Index A2 in Akhanli and Hennig (2020) (need these weights choices):
  print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0))

## End(Not run)

# Results from nselectboot:
  plot(cbsboot$stat,cbsboot$sim,statistic="boot")

fpc documentation built on Sept. 24, 2024, 9:07 a.m.