otrimlesimg: Adequacy approach for number of clusters for OTRIMLE
In otrimle: Robust Model-Based Clustering

Description Usage Arguments Details Value Author(s) References See Also Examples

otrimlesimg computes Optimally Tuned Robust Improper Maximum Likelihood Clustering (OTRIMLE), see otrimle for a range of values of the number of clusters, and also for artificial datasets simulated from the model parameters estimated on the original data. The summary-methods present and evaluate the results so that a smallest adequate number of clusters can be found as the smallest one for which the value of the density-based cluster quality statistics Q on the original data is compatible with its distribution on the artificial datasets with the same number of clusters, see Hennig and Coretto 2021 for details.

otrimlesimg(dataset, G=1:6, multicore=TRUE,
ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20,
sim.est.logicd=FALSE, 
monitor=1)

## S3 method for class 'otrimlesimgdens'
summary(object, noisepenalty=0.05 , sdcutoff=2
, ...)

## S3 method for class 'summary.otrimlesimgdens'
print(x, ...)

## S3 method for class 'summary.otrimlesimgdens'
plot(x , plot="criterion", penx=NULL,
peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)

`dataset`	something that can be coerced into an observations times variables matrix. The dataset.
`G`	vector of integers (normally starting from 1). Numbers of clusters to be considered.
`multicore`	logical. If `TRUE`, parallel computing is used through the function `mclapply` from package `parallel`; read warnings there if you intend to use this; it won't work on Windows.
`ncores`	integer. Number of cores for parallelisation.
`erc`	A number larger or equal than one specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. See `otrimle`.
`beta0`	A non-negative constant, penalty term for noise, to be passed as `beta` to `otrimle`, see documentation there.
`simruns`	integer. Number of replicate artificial datasets drawn from each model.
`sim.est.logicd`	logical. If `TRUE`, the logarithm of the improper constant density `logicd`, see `otrimle`, is re-estimated when running `otrimle` on the artificial datasets. Otherwise the value estimated on the original data is taken as fixed. `TRUE` requires much longer computation time, but can be seen as generating more realistic variation.
`monitor`	0 or 1. If 1, progress messages are printed on screen.
`noisepenalty`	number between 0 and 1. `p_0` in Hennig and Coretto (2021); normally small. The method prefers to treat a proportion of `<=noisepenalty` of points as outliers to adding a cluster.
`sdcutoff`	numerical. `c` in formula (7) in Hennig and Coretto (2021). A clustering is treated as adequate if its value of the density-based cluster quality measure Q calibrated (i.e., mean/sd-standardised) by the values on the artificial datasets generated from the estimated model is `<=sdcutoff`.
`plot`	`"criterion"` or `"noise"`, see details.
`penx`	`FALSE, NULL`, or numerical. x-coordinate from where the simplicity ordering of clustering is given (as test in the plot). If `FALSE`, this is not added to the plot. If `NULL` a default guess is made for a good position (which doesn't always work well).
`peny`	`NULL`, or numerical. x-coordinate from where the simplicity ordering of clustering is given (as test in the plot). If `NULL`, a default guess is made for a good position (which doesn't always work well).
`pencex`	numeric. Magnification factor (parameter `cex` to be passed on to `legend`) for simplicity ordering, see parameter `penx`.
`cutoff`	logical. If `TRUE`, the `"criterion"`-plot shows the cutoff value below which numbers of clusters are adequate, see details.
`ylim`	vector of two numericals, range of the y-axis to be passed on to `plot`. If `NULL`, the range is chosen automatically (but can be different from the `plot` default).
`object`	an object of class `'otrimlesimgdens'` obtained from calling `otrimlesimg`
`x`	an object of class `'summary.otrimlesimgdens'` obtained from calling `summary` function over an object of class `'otrimlesimgdens'` obtained from calling `otrimlesimg`.
`...`	optional parameters to be passed on to `plot`.

The method is fully described in Hennig and Coretto (2021). The required tuning constants for choosing an optimal number of clusters, the smallest percentage of additional noise that the user is willing to trade in for adding another cluster (p_0 in the paper, noisepenalty here) and the critical value (c in the paper, sdcutoff here) for adequacy of the standardised density based quality measure Q are provided to the summary function, which is required to choose the best (simplest adequate) number of clusters.

The plot function plot.summary.otrimlesimgdens can produce two plots. If plot="criterion", the standardised density-based cluster quality measure Q is plotted against the number of clusters. The values for the simulated artificial datasets are points, the values for the original dataset are given as line type. If cutoff="TRUE", the critical values (see above) are added as red crosses; a number of clusters is adequate if the value of the original data is below the critical value, i.e., Q is not significantly larger than for the artificial datasets generated from the fitted model. Using penx, the ordered numbers of clusters from the simplest to the least simple can also be indicated in the plot, where simplicitly is defined as the number of clusters plus the estimated noise proportion divided by noisepenalty, see above. The chosen number of clusters is the simplest adequate one, meaning that a low number of clusters and a low noise proportion are preferred.

If plot="noise", the noise proportion (black) and the simplicity (red) are plotted against the numnber of clusters.

otrimlesimg returns a list of type "otrimlesimgdens" containing the components result, simresult, simruns.

`result`	output object of `otrimleg` (list of results on original data) run with the parameters provided to `otrimlesimg`.
`simresult`	list of length `simruns` of output objects of `otrimleg` for all the simulated artificial datasets.
`simruns`	input parameter `simruns`.

summary.otrimlesimgdens returns a list of type "summary.otrimlesimgdens" with components G, simeval, ssimruns, npr, nprdiff, logicd, denscrit, peng, penorder, bestG, sdcutoff, bestresult, cluster. simruns

`G`	`otrmlesimg` input parameter `G` (numbers of clusters).
`simeval`	list with components `denscrit, meandens, sddens, standens, errors`, defined below.
`ssimruns`	`otrmlesimg` input parameter `simruns`.
`npr`	vector of estimated noise proportions on the original data for all numbers of clusters, `exproportion[1]` from `otrimle`.
`nprdiff`	vector for all numbers of clusters of differences between estimated smallest cluster proportion and noise proportion on the original data.
`logicd`	vector of logs of improper constant density values on the original data for all numbers of clusters.
`denscrit`	vector over all numbers of clusters of density-based cluster quality statistics Q on original data as provided by the `measure`-component of `kerndensmeasure`.
`peng`	vector of simplicity values (see Details) over all numbers of clusters.
`penorder`	simplicity order of number of clusters.
`bestG`	best (i.e., most simple adequate) number of clusters.
`sdcutoff`	input parameter `sdcutoff`.
`result`	output of `otrimle` for the best number of clusters `bestG`.
`cluster`	clustering vector for the best number of clusters `bestG`. `0` corresponds to noise/outliers.

Components of summary.otrimlesimgdens output component simeval:

`denscritmatrix`	maximum number of clusters times `simruns` matrix of `denscrit`-vectors for all clusterings on simulated data.
`meandens`	vector over numbers of clusters of robust estimator of the mean of `denscrit` over simulated datasets, computed by `scaleTau2`.
`sddens`	vector over numbers of clusters of robust estimator of the standard deviation of `denscrit` over simulated datasets, computed by `scaleTau2`.
`standens`	vector over numbers of clusters of `denscrit` of the original data standardised by `meandens` and `sddens`.
`errors`	vector over numbers of clusters of numbers of times that `otrimle led to an error.` `plot.summary.otrimlesimgdens` will return the output of `par()` before anything was changed by the plot function.

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi: 10.1080/01621459.2015.1100996

P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html

Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.

otrimle, rimle, otrimleg, kerndensmeasure

## otrimlesimg is computer intensive, so only a small data subset
## is used for speed.
data(banknote)
selectdata <- c(1:30,101:110,117:136,160:161)
set.seed(555566)
x <- banknote[selectdata,5:7]
   
## simruns=2 chosen for speed. This is not recommended in practice. 
obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0)
sobanknote <- summary(obanknote)
print(sobanknote)
plot(sobanknote,plot="criterion",penx=1.4)
plot(sobanknote,plot="noise",penx=1.4)
plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])