# parse: Model-based Clustering with PARSE In PARSE: Model-Based Clustering with Regularization Methods for High-Dimensional Data

## Description

The PAirwise Reciprocal fuSE (PARSE) penalty was proposed by Wang, Zhou and Hoeting (2016). Under the framework of the model-based clustering, PARSE aims to identify the pairwise informative variables for clustering, especially for high-dimensional data.

## Usage

 1 2 3 4 parse(tuning, K = NULL, lambda = NULL, y, N = 100, kms.iter = 100, kms.nstart = 100, eps.diff = 1e-5, eps.em = 1e-5, model.crit = 'gic', backward = TRUE, cores=2) parse(tuning = NULL, K, lambda, y, N = 100, kms.iter = 100, kms.nstart = 100, eps.diff = 1e-5, eps.em = 1e-5, model.crit = 'gic', backward = TRUE, cores=2) 

## Arguments

 tuning A 2-dimensional vector or a matrix with 2 columns, the first column is the number of clusters K and the second column is the tuning parameter λ in the penalty term. If this is missing, then K and lambda must be provided. K The number of clusters K. lambda The tuning parameter λ in the penalty term. y A p-dimensional data matrix. Each row is an observation. N The maximum number of iterations in the EM algorithm. The default value is 100. kms.iter The maximum number of iterations in kmeans algorithm for generating the starting value for the EM algorithm. kms.nstart The number of starting values in K-means. eps.diff The lower bound of pairwise difference of two mean values. Any value lower than it is treated as 0. eps.em The lower bound for the stopping criterion. model.crit The criterion used to select the number of clusters K. It is either ‘bic’ for Bayesian Information Criterion or ‘gic’ for Generalized Information Criterion. backward Use the backward selection algorithm when it equals to "TRUE", otherwise select all the possible subsets. cores The number of cores which can be used in parallel computing.

## Details

The j-th variable is defined as pairwise informative for a pair of clusters C_k and C_{k'} if μ_{kj} \neq μ_{k'j}. Also, a variable is globally informative if it is pairwise informative for at least one pair of clusters. Here we assume that each cluster has the same diagonal variance in the model-based clustering. PARSE is in the following form,

∑_{j=1}^{d}∑_{k<k'}|μ_{kj} - μ_{k'j}|^{-1} \mathbf{I}(μ_{kj} \neq μ_{k'j}).

where d is the number of variables in the data.

The estimation uses the backward searching algorithm embedded in the EM algorithm. Since the EM algorithm depends on the starting values. We use the estimates from K-means with multiple starting points as the starting values. Please check the paper for details of the algorithm. In this function we use parallel computing to estimate cluster means for each dimension. The default number of cores to be used is 2, which can be specified by users.

## Value

This function returns the esimated parameters and some statistics of the optimal model within the given K and λ, which is selected by BIC when model.crit = 'bic' or GIC when model.crit = 'gic'.

 mu.hat.best The estimated cluster means in the optimal model sigma.hat.best The estimated covariance in the optimal model p.hat.best The estimated cluster proportions in the optimal model s.hat.best The clustering assignments using the optimal model lambda.best The value of λ that provide the optimal model K.best The value of K that provide the optimal model llh.best The log-likelihood of the optimal model gic.best The GIC of the optimal model bic.best The BIC of the optimal model ct.mu.best The degrees of freedom in the cluster means of the optimal model

## References

Wang, L., Zhou, W. and Hoeting, J. (2016) Identification of Pairwise Informative Features for Clustering Data. preprint.

optim nopenalty apL1 apfp foreach doParallel
 1 2 3 y <- rbind(matrix(rnorm(120,0,1),ncol=3), matrix(rnorm(120,4,1), ncol=3)) output <- parse(K = c(1:2), lambda = c(0,1), y=y, cores=2) output\$mu.hat.best