# stpcm: Skew-t Parsimonious Clustering Models In mixture: Mixture Models for Clustering and Classification

## Description

Carries out model-based clustering or classification using some or all of the 14 parsimonious Skew-t clustering models (STPCM).

## Usage

 ```1 2 3 4 5``` ```stpcm(data=NULL, G=1:3, mnames=NULL, start=2, label=NULL, veo=FALSE, da=c(1.0), nmax=1000, atol=1e-8, mtol=1e-8, mmax=10, burn=5, pprogress=FALSE, pwarning=FALSE, stochastic = FALSE) ```

## Arguments

 `data` A matrix or data frame such that rows correspond to observations and columns correspond to variables. Note that this function currently only works with multivariate data p > 1. `G` A sequence of integers giving the number of components to be used. `mnames` The models (i.e., covariance structures) to be used. If `NULL` then all 14 are fitted. `start` If `0` then the random soft function is used for initialization. If `1` then the random hard function is used for initialization. If `2` then the kmeans function is used for initialization. If `is.matrix` then matrix is used as an initialization matrix as along as it has non-negative elements. Note: only models with the same number of columns of this matrix will be fit. `label` If `NULL` then the data has no known groups. If `is.integer` then some of the observations have known groups. If `label[i]=k` then observation belongs to group `k`. If `label[i]=0` then observation has no known group. See Examples. `veo` Stands for "Variables exceed observations". If `TRUE` then if the number variables in the model exceeds the number of observations the model is still fitted. `da` Stands for Determinstic Annealing. A vector of doubles. `nmax` The maximum number of iterations each EM algorithm is allowed to use. `atol` A number specifying the epsilon value for the convergence criteria used in the EM algorithms. For each algorithm, the criterion is based on the difference between the log-likelihood at an iteration and an asymptotic estimate of the log-likelihood at that iteration. This asymptotic estimate is based on the Aitken acceleration and details are given in the References. `mtol` A number specifying the epsilon value for the convergence criteria used in the M-step in the EM algorithms. `mmax` The maximum number of iterations each M-step is allowed in the GEM algorithms. `burn` The burn in period for imputing data. (Missing observations are removed and a model is estimated seperately before placing an imputation step within the EM.) `pprogress` If `TRUE` print the progress of the function. `pwarning` If `TRUE` print the warnings. `stochastic` If `TRUE` , it will run stochastic E step variant.

## Details

The data `x` are either clustered or classified using Skew-t mixture models with some or all of the 14 parsimonious covariance structures described in Celeux & Govaert (1995). The algorithms given by Celeux & Govaert (1995) is used for 12 of the 14 models; the "EVE" and "VVE" models use the algorithms given in Browne & McNicholas (2014). Starting values are very important to the successful operation of these algorithms and so care must be taken in the interpretation of results.

## Value

An object of class `vgpcm` is a list with components:

 `map` A vector of integers indicating the maximum a posteriori classifications for the best model. `model_objs` A list of all estimated models with parameters returned from the C++ call. `best_model` A class of vgpcm_best containing; the number of groups for the best model, the covariance structure, and Bayesian Information Criterion (BIC) value. `loglik` The log-likelihood values from fitting the best model. `z` A matrix giving the raw values upon which `map` is based. `BIC` A G by mnames by 3 dimensional array with values pertaining to BIC calculations. (legacy) `gpar` A list object for each cluster pertaining to parameters. (legacy) `startobject` The type of object inputted into `start`. `row_tags` If there were NAs in the original dataset, a vector of indices referencing the row of the imputed vectors is given.

#### Best Model

An object of class `stpcm_best` is a list with components:

 `model_type` A string containg summarized information about the type of model estimated (Covariance structure and number of groups). `model_obj` An internal list containing all parameters returned from the C++ call. `BIC` Bayesian Index Criterion (positive scale, bigger is better). `loglik` Log liklihood from the estimated model. `nparam` Number of a parameters in the mode. `startobject` The type of object inputted into `start`. `G` An integer representing the number of groups. `cov_type` A string representing the type of covariance matrix (see 14 models). `status` Convergence status of EM algorithm according to Aitken's Acceleration `map` A vector of integers indicating the maximum a posteriori classifications for the best model. `row_tags` If there were NAs in the original dataset, a vector of indices referencing the row of the imputed vectors is given.

#### Internal Objects

All classes contain an internal list called `model_obj` or `model_objs` with the following components:

 `zigs` a posteori matrix `G` An integer representing the number of groups. `sigs` A vector of covariance matrices for each group `mus` A vector of location vectors for each group `alphas` A vector containg skewness vectors for each group `gammas` A vector containing estimated gamma parameters for each group

## Note

Dedicated `print`, `plot` and `summary` functions are available for objects of class `vgpcm`.

## Author(s)

Nik Pocuca, Ryan P. Browne and Paul D. McNicholas.

Maintainer: Paul D. McNicholas <mcnicholas@math.mcmaster.ca>

## References

McNicholas, P.D. (2016), Mixture Model-Based Classification. Boca Raton: Chapman & Hall/CRC Press

Browne, R.P. and McNicholas, P.D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification 8(2), 217-226.

Wei, Y., Tang, Y. and McNicholas, P.D. (2019), 'Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data', Computational Statistics and Data Analysis 130, 18-41.

Celeux, G., Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28(5), 781-793.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23``` ```data("sx3") ## Not run: ### estimate "VVV" "EVE" ax = stpcm(sx3, G=1:3, mnames=c("VVV","EVE"), start=0) summary(ax) ax ### estimate all 14 covariance structures ax = stpcm(sx3, G=1:3, mnames=NULL, start=0) summary(ax) ax ### model based classification sx3.label = c(rep(1,1000),rep(2,1000)) plot(sx3, col=sx3.label) axl = stpcm(sx3, G=2, mnames=c("VVV", "EVE"), label=sx3.label) summary(axl) ## End(Not run) ```

mixture documentation built on April 19, 2021, 5:07 p.m.