randomclustersim: Simulation of validity indexes based on random clusterings
In fpc: Flexible Procedures for Clustering

randomclustersim

R Documentation

Simulation of validity indexes based on random clusterings

Description

For a given dataset this simulates random clusterings using stupidkcentroids, stupidknn, stupidkfn, and stupidkaven. It then computes and stores a set of cluster validity indexes for every clustering.

Usage

  randomclustersim(datadist,datanp=NULL,npstats=FALSE,useboot=FALSE,
                      bootmethod="nselectboot",
                      bootruns=25, 
                      G,nnruns=100,kmruns=100,fnruns=100,avenruns=100,
                      nnk=4,dnnk=2,
                      pamcrit=TRUE, 
                      multicore=FALSE,cores=detectCores()-1,monitor=TRUE)

Arguments

`datadist`	distances on which validation-measures are based, `dist` object or distance matrix.
`datanp`	optional observations times variables data matrix, see `npstats`.
`npstats`	logical. If `TRUE`, `distrsimilarity` is called and the two statistics computed there are added to the output. These are based on `datanp` and require `datanp` to be specified.
`useboot`	logical. If `TRUE`, a stability index (either `nselectboot` or `prediction.strength`) will be involved.
`bootmethod`	either `"nselectboot"` or `"prediction.strength"`; stability index to be used if `useboot=TRUE`.
`bootruns`	integer. Number of resampling runs. If `useboot=TRUE`, passed on as `B` to `nselectboot` or `M` to `prediction.strength`.
`G`	vector of integers. Numbers of clusters to consider.
`nnruns`	integer. Number of runs of `stupidknn`.
`kmruns`	integer. Number of runs of `stupidkcentroids`.
`fnruns`	integer. Number of runs of `stupidkfn`.
`avenruns`	integer. Number of runs of `stupidkaven`.
`nnk`	`nnk`-argument to be passed on to `cqcluster.stats`.
`dnnk`	`nnk`-argument to be passed on to `distrsimilarity`.
`pamcrit`	`pamcrit`-argument to be passed on to `cqcluster.stats`.
`multicore`	logical. If `TRUE`, parallel computing is used through the function `mclapply` from package `parallel`; read warnings there if you intend to use this; it won't work on Windows.
`cores`	integer. Number of cores for parallelisation.
`monitor`	logical. If `TRUE`, it will print some runtime information.

Value

List with components

`nn`	list, indexed by number of clusters. Every entry is a data frame with `nnruns` observations for every simulation run of `stupidknn`. The variables of the data frame are `avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy`, if `pamcrit=TRUE` also `pamc`, if `npstats=TRUE` also `kdnorm, kdunif`. All these are cluster validation indexes; documented as values of `clustatsum`.
`fn`	list, indexed by number of clusters. Every entry is a data frame with `fnruns` observations for every simulation run of `stupidkfn`. The variables of the data frame are `avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy`, if `pamcrit=TRUE` also `pamc`, if `npstats=TRUE` also `kdnorm, kdunif`. All these are cluster validation indexes; documented as values of `clustatsum`.
`aven`	list, indexed by number of clusters. Every entry is a data frame with `avenruns` observations for every simulation run of `stupidkaven`. The variables of the data frame are `avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy`, if `pamcrit=TRUE` also `pamc`, if `npstats=TRUE` also `kdnorm, kdunif`. All these are cluster validation indexes; documented as values of `clustatsum`.
`km`	list, indexed by number of clusters. Every entry is a data frame with `kmruns` observations for every simulation run of `stupidkcentroids`. The variables of the data frame are `avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy`, if `pamcrit=TRUE` also `pamc`, if `npstats=TRUE` also `kdnorm, kdunif`. All these are cluster validation indexes; documented as values of `clustatsum`.
`nnruns`	number of involved runs of `stupidknn`,
`fnruns`	number of involved runs of `stupidkfn`,
`avenruns`	number of involved runs of `stupidkaven`,
`kmruns`	number of involved runs of `stupidkcentroids`,
`boot`	if `useboot=TRUE`, stability value; `stabk` for method `nselectboot`; `mean.pred` for method `prediction.strength`.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Examples

  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  rmx <- randomclustersim(dist(face),datanp=face,npstats=TRUE,G=2:3,
    nnruns=2,kmruns=2, fnruns=1,avenruns=1,nnk=2)
## Not run: 
  rmx$km # Produces slightly different but basically identical results on ATLAS

## End(Not run)
  rmx$aven
  rmx$fn
  rmx$nn

fpc documentation built on Sept. 24, 2024, 9:07 a.m.