tclustIC: Performs cluster analysis by calling 'tclustfsda' for...
In fsdaR: Robust Data Analysis Through Monitoring and Dynamic Visualization

tclustIC

R Documentation

Performs cluster analysis by calling `tclustfsda` for different number of groups `k` and restriction factors `c`

Description

Computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values of k (number of groups) and different values of c (restriction factor), for a prespecified level of trimming (the last two letters in the name stand for 'Information Criterion'). In order to minimize randomness, given k, the same subsets are used for each value of c.

Usage

tclustIC(
  x,
  kk = 1:5,
  cc = c(1, 2, 4, 8, 16, 32, 64, 128),
  alpha = 0,
  whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"),
  nsamp,
  refsteps = 15,
  reftol = 1e-14,
  equalweights = FALSE,
  msg = TRUE,
  nocheck = FALSE,
  plot = FALSE,
  startv1 = 1,
  restrtype = c("eigen", "deter"),
  UnitsSameGroup,
  numpool,
  cleanpool,
  trace = FALSE,
  ...
)

Arguments

`x`	An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
`kk`	an integer vector specifying the number of mixture components (clusters) for which the BIC is to be calculated. By default `kk=1:5`.
`cc`	an vector specifying the values of the restriction factor which have to be considered. By default `cc=c(1, 2, 4, 8, 16, 32, 64, 128)`.
`alpha`	Global trimming level. A scalar between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If `alpha=0` all observations are considered. By default `alpha=0`. More in detail, if `0 < alpha < 1` clustering is based on `h = fix(n * (1-alpha))` observations, else if alpha is an integer greater than 1 clustering is based on `h = n - floor(alpha)`.
`whichIC`	A character value which specifies which information criteria must be computed for each `k` (number of groups) and each value of the restriction factor `c`. Possible values for `whichIC` are: "MIXMIX": a mixture model is fitted and for computing the information criterion the mixture likelihood is used. This option corresponds to the use of the Bayesian Information criterion (BIC). In output just the matrix `MIXMIX` is given. "MIXCLA": a mixture model is fitted but to compute the information criterion the classification likelihood is used. This option corresponds to the use of the Integrated Complete Likelihood (ICL). In the output just the matrix `MIXCLA` is given. "CLACLA": everything is based on the classification likelihood. This information criterion will be called CLA. In the output just the matrix `CLACLA` is given. "ALL": both classification and mixture likelihood are used. In this case all three information criteria CLA, ICL and BIC are computed. In the output all three matrices `MIXMIX`, `MIXCLA` and `CLACLA` are given.
`nsamp`	If a scalar, it contains the number of subsamples which will be extracted. If `nsamp = 0` all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If `nsamp` is a matrix it contains in the rows the indexes of the subsets which have to be extracted. `nsamp` in this case can be conveniently generated by function `subsets()`. `nsamp` can have `k` columns or `k * (p + 1)` columns. If `nsamp` has `k` columns the `k` initial centroids each iteration i are given by `X[nsamp[i,] ,]` and the covariance matrices are equal to the identity. If `nsamp` has `k * (p + 1)` columns, the initial centroids and covariance matrices in iteration `i` are computed as follows: X1 <- X[nsamp[i ,] ,] mean(X1[1:p + 1, ]) contains the initial centroid for group 1 cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1 mean(X1[(p + 2):(2p + 2), ]) contains the initial centroid for group 2 cov(X1[(p + 2):(2p + 2), ]) contains the initial cov matrix for group 2 ... mean(X1[(k-1)p+1):(k(p+1), ]) contains the initial centroids for group k cov(X1[(k-1)p+1):(k(p+1), ]) contains the initial cov matrix for group k. REMARK: If `nsamp` is not a scalar, the option `startv1` given below is ignored. More precisely, if `nsamp` has `k` columns `startv1 = 0` else if `nsamp` has `k*(p+1)` columns option `startv1=1`.
`refsteps`	Number of refining iterations in each subsample. Default is `refsteps=15`. `refsteps = 0` means "raw-subsampling" without iterations.
`reftol`	Tolerance of the refining steps. The default value is 1e-14
`equalweights`	A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If `equalweights=TRUE` we are (ideally) assuming equally sized groups, else if `equalweights = false` (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.
`msg`	Controls whether to display or not messages on the screen If `msg==TRUE` (default) messages are displayed on the screen. If `msg=2`, detailed messages are displayed, for example the information at iteration level.
`nocheck`	Check input arguments. If `nocheck=TRUE` no check is performed on matrix `X`. The default `nocheck=FALSE`.
`plot`	If `plot=TRUE`, a plot of the BIC (MIXMIX), ICL (MIXCLA) curve and CLACLA is shown on the screen. The plots which are shown depend on the input option `whichIC`.
`startv1`	How to initialize centroids and covariance matrices. Scalar. If `startv1=1` then initial centroids and covariance matrices are based on `(p+1)` observations randomly chosen, else each centroid is initialized taking a random row of input data matrix and covariance matrices are initialized with identity matrices. The default value is`startv1=1`. Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied. Remark 2 - option `startv1` is used just if `nsamp` is a scalar (see for more details the help associated with `nsamp`).
`restrtype`	Type of restriction to be applied on the cluster scatter matrices. Valid values are `'eigen'` (default), or `'deter'`. `"eigen"` implies restriction on the eigenvalues while `"deter"` implies restriction on the determinants.
`UnitsSameGroup`	List of the units which must (whenever possible) have a particular label. For example `UnitsSameGroup=c(20, 26)`, means that group which contains unit 20 is always labelled with number 1. Similarly, the group which contains unit 26 is always labelled with number 2, (unless it is found that unit 26 already belongs to group 1). In general, group which contains unit `UnitsSameGroup(r)` where `r=2, ...length(kk)-1` is labelled with number `r` (unless it is found that unit `UnitsSameGroup(r)` has already been assigned to groups `1, 2, ..., r-1`.
`numpool`	The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer.
`cleanpool`	Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool.
`trace`	Whether to print intermediate results. Default is `trace=FALSE`.
`...`	potential further arguments passed to lower level functions.

Value

An S3 object of class tclustic.object

Author(s)

FSDA team, valentin.todorov@chello.at

References

Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.

Examples

 ## Not run: 
 data(geyser2)
 (out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1))
 summary(out)
 
## End(Not run)

 ## Not run: 
 data(flea)
 Y <- as.matrix(flea[, 1:(ncol(flea)-1)])    # select only the numeric variables
 rownames(Y) <- 1:nrow(Y)
 head(Y)

 (out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100, numpool=1))
 summary(out)
 
## End(Not run)

fsdaR documentation built on May 29, 2024, 5:35 a.m.