plot.valstat: Simulation-standardised plot and print of cluster validation...
In fpc: Flexible Procedures for Clustering

plot.valstat

R Documentation

Simulation-standardised plot and print of cluster validation statistics

Description

Visualisation and print function for cluster validation output compared to results on simulated random clusterings. The print method can also be used to compute and print an aggregated cluster validation index.

Unlike for many other plot methods, the additional arguments of plot.valstat are essential. print.valstat should make good sense with the defaults, but for computing the aggregate index need to be set.

Usage

## S3 method for class 'valstat'
plot(x,simobject=NULL,statistic="sindex",
                            xlim=NULL,ylim=c(0,1),
                            nmethods=length(x)-5,
                            col=1:nmethods,cex=1,pch=c("c","f","a","n"),
                            simcol=rep(grey(0.7),4),
                         shift=c(-0.1,-1/3,1/3,0.1),include.othernc=NULL,...)


## S3 method for class 'valstat'
print(x,statistics=x$statistics,
                          nmethods=length(x)-5,aggregate=FALSE,
                          weights=NULL,digits=2,
                          include.othernc=NULL,...)

Arguments

`x`	object of class `"valstat"`, such as sublists `stat, qstat, sstat` of `clusterbenchstats`-output.
`simobject`	list of simulation results as produced by `randomclustersim` and documented there; typically sublist `sim` of `clusterbenchstats`-output.
`statistic`	one of `"avewithin","mnnd","variation", "diameter","gap","sindex","minsep","asw","dindex","denscut", "highdgap","pg","withinss","entropy","pamc","kdnorm","kdunif","dmode"`; validation statistic to be plotted.
`xlim`	passed on to `plot`. Default is the range of all involved numbers of clusters, minimum minus 0.5 to maximum plus 0.5.
`ylim`	passed on to `plot`.
`nmethods`	integer. Number of clustering methods to involve (these are those from number 1 to `nmethods` specified in `x$name`).
`col`	colours used for the different clustering methods.
`cex`	passed on to `plot`.
`pch`	vector of symbols for random clustering results from `stupidkcentroids`, `stupidkfn`, `stupidkaven`, `stupidknn`. To be passed on to `plot`.
`simcol`	vector of colours used for random clustering results in order `stupidkcentroids`, `stupidkfn`, `stupidkaven`, `stupidknn`.
`shift`	numeric vector. Indicates the amount to which the results from `stupidkcentroids`, `stupidkfn`, `stupidkaven`, `stupidknn` are plotted to the right of their respective number of clusters (negative numbers plot to the left).
`include.othernc`	this indicates whether methods should be included that estimated their number of clusters themselves and gave a result outside the standard range as given by `x$minG` and `x$maxG`. If not `NULL`, this is a list of integer vectors of length 2. The first number is the number of the clustering method (the order is determined by argument `x$name`), the second number is the number of clusters for those methods that estimate the number of clusters themselves and estimated a number outside the standard range. Normally what will be used here, if not `NULL`, is the output parameter `cm$othernc` of `clusterbenchstats`, see also `cluster.magazine`.
`statistics`	vector of character strings specifying the validation statistics that will be included in the output (unless you want to restrict the output for some reason, the default should be fine.
`aggregate`	logical. If `TRUE`, an aggegate validation statistic will be computed as the weighted mean of the involved statistic. This requires `weights` to be set. In order for this to make sense, values of the validation statistics should be comparable, which is achieved by standardisation in `clusterbenchstats`. Accordingly, `x` should be the `qstat` or `sstat`-component of the `clusterbenchstats`-output rather than the `stat`-component.
`weights`	vector of numericals. Weights for computation of the aggregate statistic in case that `aggregate=TRUE`. The order of clustering methods corresponding to the weight vector is given by `x$name`.
`digits`	minimal number of significant digits, passed on to `print.table`.
`...`	no effect.

Details

Whereas print.valstat, at least with aggregate=TRUE makes more sense for the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component, plot.valstat should be run with the stat-component if simobject is specified, because the simulated cluster validity statistics are unstandardised and need to be compared with unstandardised values on the dataset of interest.

print.valstat will print all values for all validation indexes and the aggregated index (in case of aggregate=TRUE and set weights will be printed last.

Value

print.valstats returns the results table as invisible object.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Examples

  
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI","hclustCBI")
  clustermethodpars <- list()
  clustermethodpars[[2]] <- clustermethodpars[[3]] <- list()
  clustermethodpars[[2]]$method <- "ward.D2"
  clustermethodpars[[3]]$method <- "single"
  methodname <- c("kmeans","ward","single")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,3),
    clustermethodpars=clustermethodpars,nnruns=2,kmruns=2,fnruns=2,avenruns=2)
  plot(cbs$stat,cbs$sim)
  plot(cbs$stat,cbs$sim,statistic="dindex")
  plot(cbs$stat,cbs$sim,statistic="avewithin")
  pcbs <- print(cbs$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0))
# Some of the values are "NaN" because due to the low number of runs of
# the stupid clustering methods there is no variation. If this happens
# in a real application, nnruns etc. should be chosen higher than 2.
# Also useallg=TRUE in clusterbenchstats may help.
#
# Finding the best aggregated value:
  mpcbs <- as.matrix(pcbs[[17]][,-1])
  which(mpcbs==max(mpcbs),arr.ind=TRUE)
# row=1 refers to the first clustering method kmeansCBI,
# col=2 refers to the second number of clusters, which is 3 in g=2:3.

fpc documentation built on Sept. 24, 2024, 9:07 a.m.