clusterbenchstats: Run and validate many clusterings

View source: R/cquality20.R

clusterbenchstatsR Documentation

Run and validate many clusterings

Description

This runs the methodology explained in Hennig (2019), Akhanli and Hennig (2020). It runs a user-specified set of clustering methods (CBI-functions, see kmeansCBI) with several numbers of clusters on a dataset, and computes many cluster validation indexes. In order to explore the variation of these indexes, random clusterings on the data are generated, and validation indexes are standardised by use of the random clusterings in order to make them comparable and differences between values interpretable.

The function print.valstat can be used to provide weights for the cluster validation statistics, and will then compute a weighted validation index that can be used to compare all clusterings.

See the examples for how to get the indexes A1 and A2 from Akhanli and Hennig (2020).

Usage

clusterbenchstats(data,G,diss = inherits(data, "dist"),
                                  scaling=TRUE, clustermethod,
                                  methodnames=clustermethod,
                              distmethod=rep(TRUE,length(clustermethod)),
                              ncinput=rep(TRUE,length(clustermethod)),
                              clustermethodpars,
                              npstats=FALSE,
                              useboot=FALSE,
                              bootclassif=NULL,
                              bootmethod="nselectboot",
                              bootruns=25,
                              trace=TRUE,
                              pamcrit=TRUE,snnk=2,
                              dnnk=2,
                              nnruns=100,kmruns=100,fnruns=100,avenruns=100,
                              multicore=FALSE,cores=detectCores()-1,
                              useallmethods=TRUE,
                              useallg=FALSE,...)

## S3 method for class 'clusterbenchstats'
print(x,...)

Arguments

data

data matrix or dist-object.

G

vector of integers. Numbers of clusters to consider.

diss

logical. If TRUE, the data matrix is assumed to be a distance/dissimilariy matrix, otherwise it's observations times variables.

scaling

either a logical or a numeric vector of length equal to the number of columns of data. If FALSE, data won't be scaled, otherwise scaling is passed on to scale as argumentscale.

clustermethod

vector of strings specifying names of CBI-functions (see kmeansCBI). These are the clustering methods to be applied.

methodnames

vector of strings with user-chosen names for clustering methods, one for every method in clustermethod. These can be used to distinguish different methods run by the same CBI-function but with different parameter values such as complete and average linkage for hclustCBI.

distmethod

vector of logicals, of the same length as clustermethod. TRUE means that the clustering method operates on distances. If diss=TRUE, all entries have to be TRUE. Otherwise, if an entry is true, the corresponding method will be applied on dist(data).

ncinput

vector of logicals, of the same length as clustermethod. TRUE indicates that the corresponding clustering method requires the number of clusters as input and will not estimate the number of clusters itself. Only methods for which this is TRUE can be used with useboot=TRUE.

clustermethodpars

list of the same length as clustermethod. Specifies parameters for all involved clustering methods. Its jth entry is passed to clustermethod number k. Can be an empty entry in case all defaults are used for a clustering method. However, the last entry is not allowed to be empty (you may just set a parameter of the last clustering method to its default value if you don't want to specify anything else)! The number of clusters does not need to be specified here.

npstats

logical. If TRUE, distrsimilarity is called and the two validity statistics computed there are added. These require diss=FALSE.

useboot

logical. If TRUE, a stability index (either nselectboot or prediction.strength) will be involved.

bootclassif

If useboot=TRUE, a vector of strings indicating the classification methods to be used with the stability index for the different methods indicated in clustermethods, see the classification argument of nselectboot and prediction.strength.

bootmethod

either "nselectboot" or "prediction.strength"; stability index to be used if useboot=TRUE.

bootruns

integer. Number of resampling runs. If useboot=TRUE, passed on as B to nselectboot or M to prediction.strength. Note that these are applied to all kmruns+nnruns+avenruns+fnruns random clusterings on top of the regular ones, which may take a lot of time if bootruns and these values are chosen large.

trace

logical. If TRUE, some runtime information is printed.

pamcrit

logical. If TRUE, the average distance of points to their respective cluster centroids is computed (criterion of the PAM clustering method, validation criterion pamc); centroids are chosen so that they minimise this criterion for the given clustering. Passed on to cqcluster.stats.

snnk

integer. Number of neighbours used in coefficient of variation of distance to nearest within cluster neighbour, the cvnnd-statistic (clusters with snnk or fewer points are ignored for this). Passed on to cqcluster.stats as argument nnk.

dnnk

integer. Number of nearest neighbors to use for dissimilarity to the uniform in case that npstats=TRUE; nnk-argument to be passed on to distrsimilarity.

nnruns

integer. Number of runs of stupidknn (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.

kmruns

integer. Number of runs of stupidkcentroids (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.

fnruns

integer. Number of runs of stupidkfn (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.

avenruns

integer. Number of runs of stupidkaven (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.

multicore

logical. If TRUE, parallel computing is used through the function mclapply from package parallel; read warnings there if you intend to use this; it won't work on Windows.

cores

integer. Number of cores for parallelisation.

useallmethods

logical, to be passed on to cgrestandard. If FALSE, only random clustering results are used for standardisation. If TRUE, clustering results from all methods are used.

useallg

logical to be passed on to cgrestandard. If TRUE, standardisation uses results from all numbers of clusters in G. If FALSE, standardisation of results for a specific number of cluster only uses results from that number of clusters.

...

further arguments to be passed on to cqcluster.stats through clustatsum (no effect in print.clusterbenchstats).

x

object of class "clusterbenchstats".

Value

The output of clusterbenchstats is a big list of lists comprising lists cm, stat, sim, qstat, sstat

cm

output object of cluster.magazine, see there for details. Clustering of all methods and numbers of clusters on the dataset data.

.

stat

object of class "valstat", see valstat.object for details. Unstandardised cluster validation statistics.

sim

output object of randomclustersim, see there. validity indexes from random clusterings used for standardisation of validation statistics on data.

qstat

object of class "valstat", see valstat.object for details. Cluster validation statistics standardised by random clusterings, output of cgrestandard based on percentages, i.e., with percentage=TRUE.

sstat

object of class "valstat", see valstat.object for details. Cluster validation statistics standardised by random clusterings, output of cgrestandard based on mean and standard deviation (called Z-score standardisation in Akhanli and Hennig (2020), i.e., with percentage=FALSE.

Note

This may require a lot of computing time and also memory for datasets that are not small, as most indexes require computation and storage of distances.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

See Also

valstat.object, cluster.magazine, kmeansCBI, cqcluster.stats, clustatsum, cgrestandard

Examples

  
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI")
# A clustering method can be used more than once, with different
# parameters
  clustermethodpars <- list()
  clustermethodpars[[2]] <- list()
  clustermethodpars[[2]]$method <- "average"
# Last element of clustermethodpars needs to have an entry!
  methodname <- c("kmeans","average")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,2),
    clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1)
  print(cbs)
  print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1))
# The weights are weights for the validation statistics ordered as in
# cbs$qstat$statistics for computation of an aggregated index, see
# ?print.valstat.

# Now using bootstrap stability assessment as in Akhanli and Hennig (2020):
  bootclassif <- c("centroid","averagedist")
  cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,2),
    clustermethodpars=clustermethodpars,
    useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot",
    bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE)
  print(cbsboot)
## Not run: 
# Index A1 in Akhanli and Hennig (2020) (need these weights choices):
  print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0))
# Index A2 in Akhanli and Hennig (2020) (need these weights choices):
  print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0))

## End(Not run)

# Results from nselectboot:
  plot(cbsboot$stat,cbsboot$sim,statistic="boot")

fpc documentation built on Sept. 24, 2024, 9:07 a.m.