Description Usage Arguments Details Value Author(s) References See Also Examples
otrimlesimg
computes Optimally Tuned Robust Improper Maximum
Likelihood Clustering
(OTRIMLE), see otrimle
for a range of values of the
number of clusters, and also for artificial datasets simulated from
the model parameters estimated on the original data. The
summary
-methods present and evaluate the results so that a
smallest adequate number of clusters can be found as the smallest one
for which the value of the density-based cluster quality statistics Q
on the original data
is compatible with its distribution on the artificial datasets with
the same number of clusters, see Hennig and Coretto 2021 for details.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | otrimlesimg(dataset, G=1:6, multicore=TRUE,
ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20,
sim.est.logicd=FALSE,
monitor=1)
## S3 method for class 'otrimlesimgdens'
summary(object, noisepenalty=0.05 , sdcutoff=2
, ...)
## S3 method for class 'summary.otrimlesimgdens'
print(x, ...)
## S3 method for class 'summary.otrimlesimgdens'
plot(x , plot="criterion", penx=NULL,
peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)
|
dataset |
something that can be coerced into an observations times variables matrix. The dataset. |
G |
vector of integers (normally starting from 1). Numbers of clusters to be considered. |
multicore |
logical. If |
ncores |
integer. Number of cores for parallelisation. |
erc |
A number larger or equal than one specifying the maximum
allowed ratio between within-cluster covariance matrix
eigenvalues. See |
beta0 |
A non-negative constant, penalty term for noise, to be
passed as |
simruns |
integer. Number of replicate artificial datasets drawn from each model. |
sim.est.logicd |
logical. If |
monitor |
0 or 1. If 1, progress messages are printed on screen. |
noisepenalty |
number between 0 and 1. |
sdcutoff |
numerical. |
plot |
|
penx |
|
peny |
|
pencex |
numeric. Magnification factor (parameter |
cutoff |
logical. If |
ylim |
vector of two numericals, range of the y-axis to be passed
on to |
object |
an object of class |
x |
an object of class |
... |
optional parameters to be passed on to |
The method is fully described in Hennig and Coretto
(2021). The required tuning constants for choosing an optimal number
of clusters, the smallest percentage of additional noise that the user
is willing to trade in for adding another cluster (p_0
in the
paper, noisepenalty
here) and the critical value (c
in
the paper, sdcutoff
here) for adequacy of the standardised
density based quality measure Q are provided to the summary function,
which is required to choose the best (simplest adequate) number of
clusters.
The plot function plot.summary.otrimlesimgdens
can produce two
plots. If plot="criterion"
, the standardised density-based
cluster quality
measure Q is plotted against the number of clusters. The values for
the simulated artificial datasets are points, the values for the
original dataset are given as line type. If cutoff="TRUE"
, the
critical values (see above) are added as red crosses; a number of
clusters is adequate if the value of the original data is below the
critical value, i.e., Q is not significantly larger than for the
artificial datasets generated from the fitted model. Using
penx
, the ordered numbers of clusters from the simplest to the
least simple can also be indicated in the plot, where simplicitly is
defined as the number of clusters plus the estimated noise proportion
divided by noisepenalty
, see above. The chosen number of
clusters is the simplest adequate one, meaning that a low number of
clusters and a low noise proportion are preferred.
If plot="noise"
, the noise proportion (black) and the
simplicity (red) are plotted against the numnber of clusters.
otrimlesimg
returns a list of type "otrimlesimgdens"
containing the components result, simresult, simruns
.
result |
output object of |
simresult |
list of length |
simruns |
input parameter |
summary.otrimlesimgdens
returns a list of type
"summary.otrimlesimgdens"
with components G, simeval,
ssimruns, npr, nprdiff, logicd, denscrit, peng,
penorder, bestG, sdcutoff, bestresult,
cluster
. simruns
G |
|
simeval |
list with components |
ssimruns |
|
npr |
vector of estimated noise proportions on the original data
for all numbers of clusters, |
nprdiff |
vector for all numbers of clusters of differences between estimated smallest cluster proportion and noise proportion on the original data. |
logicd |
vector of logs of improper constant density values on the original data for all numbers of clusters. |
denscrit |
vector over all numbers of clusters of density-based
cluster quality statistics Q
on original data as provided by the |
peng |
vector of simplicity values (see Details) over all numbers of clusters. |
penorder |
simplicity order of number of clusters. |
bestG |
best (i.e., most simple adequate) number of clusters. |
sdcutoff |
input parameter |
result |
output of |
cluster |
clustering vector for the best number of
clusters |
Components of summary.otrimlesimgdens
output component
simeval
:
denscritmatrix |
maximum number of clusters times |
meandens |
vector over numbers of clusters of robust estimator of
the mean of |
sddens |
vector over numbers of clusters of robust estimator of
the standard deviation of |
standens |
vector over numbers of clusters of |
errors |
vector over numbers of clusters of numbers of times that
|
Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi: 10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
otrimle
, rimle
, otrimleg
,
kerndensmeasure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## otrimlesimg is computer intensive, so only a small data subset
## is used for speed.
data(banknote)
selectdata <- c(1:30,101:110,117:136,160:161)
set.seed(555566)
x <- banknote[selectdata,5:7]
## simruns=2 chosen for speed. This is not recommended in practice.
obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0)
sobanknote <- summary(obanknote)
print(sobanknote)
plot(sobanknote,plot="criterion",penx=1.4)
plot(sobanknote,plot="noise",penx=1.4)
plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.