# EmSkew: The EM Algorithm and Skew Mixture Models In EMMIXskew: The EM Algorithm and Skew Mixture Distribution

## Description

As a main function, EmSkew fits the data into the specified multivariate mixture models via the EM Algorithm. Distributions (univariate and multivariate) available include Normal distribution, t-distribution, Skew Normal distribution, and Skew t-distribution.

## Usage

 ```1 2 3``` ```EmSkew(dat, g, distr="mvn", ncov=3,clust=NULL,init=NULL,itmax=1000, epsilon=1e-6, nkmeans=0, nrandom=10,nhclust=FALSE,debug=TRUE, initloop=20) ```

## Arguments

 `dat` The dataset, an n by p numeric matrix, where n is number of observations and p the dimension of data. `g` The number of components of the mixture model `distr` A three letter string indicating the type of distribution to be fitted, the default value is "mvn", the Normal distribution. See Details. `ncov` A small integer indicating the type of covariance structure; the default value is 3. See Details. `clust` A vector of integers specifying the initial partitions of the data; the default is NULL. `init` A list containing the initial parameters for the mixture model. See details. The default value is NULL. `itmax` A big integer specifying the maximum number of iterations to apply; the default value is 1000. `epsilon` A small number used to stop the EM algorithm loop when the relative difference between log-likelihood at each iteration become sufficient small; the default value is 1e-6. `nkmeans` An integer to specify the number of KMEANS partitions to be used to find the best initial values; the default value is 0. `nrandom` An integer to specify the number of random partitions to be used to find the best initial values; the default value is 10. `nhclust` A logical value to specify whether or not to use hierarchical cluster methods; the default is FALSE. If TRUE, the Complete Linkage method will be used. `debug` A logical value, if it is TRUE, the output will be printed out; FALSE silent; the default value is TRUE. `initloop` A integer specifying the number of initial loops when searching the best intial partitions.

## Details

The distribution type, determined by the `distr` parameter, which may take any one of the following values: "mvn" for a multivariate normal, "mvt" for a multivariate t-distribution, "msn" for a multivariate skew normal distribution and "mst" for a multivariate skew t-distribution.

The covariance matrix type, represented by the `ncov` parameter, may be any one of the following: `ncov`=1 for a common variance, `ncov`=2 for a common diagonal variance, `ncov`=3 for a general variance, `ncov` =4 for a diagonal variance, `ncov`=5 for sigma(h)*I(p)(diagonal covariance with same identical diagonal element values).

The parameter `init` requires following elements: `pro`, a numeric vector of the mixing proportion of each component; `mu`, a p by g matrix with each column as its corresponding mean; `sigma`, a three dimensional p by p by g array with its jth component matrix (p,p,j) as the covariance matrix for jth component of mixture models; `dof`, a vector of degrees of freedom for each component; `delta`, a p by g matrix with its columns corresponding to skew parameter vectors.

Since we treat the list of `pro`,`mu`,`sigma`,`dof`,and `delta` as a common structure of parameters for our mixture models, we need to include all of them in the initial parameter list `init` by default although in some cases it does not make sense, for example, `dof` and `delta` is not applicable to normal mixture model. But in most cases, the user only need give relevent paramters in the list.

When the parameter list `init` is given, the program ignores both initial partition `clust` and automatic partition methods such as `nkmeans`; only when both `init` and `clust` are not available, the program uses automatic approaches such as k-Means partition method to find the best inital values. All three automatic approaches are used to find the best initial partition and initial values if required.

The return values include all potential parameters `pro`,`mu`,`sigma`,`dof`,and `delta`, but user should not use or interpret irrelevant information arbitrarily. For example, `dof` and `delta` for Normal mixture models.

## Value

 `error` Error code, 0 = normal exit; 1 = did not converge within `itmax` iterations; 2 = failed to get the initial values; 3 = singularity `aic` Akaike Information Criterion (AIC) `bic` Bayes Information Criterion (BIC) `ICL` Integrated Completed Likelihood Criterion (ICL) `pro` A vector of mixing proportions. `mu` A numeric matrix with each column corresponding to the mean. `sigma` An array of dimension (p,p,g) with first two dimension corresponding covariance matrix of each component. `dof` A vector of degrees of freedom for each component, see Details. `delta` A p by g matrix with each column corresponding to a skew parameter vector. `clust` A vector of final partition `loglik` The log likelihood at convergence `lk` A vector of log likelihood at each EM iteration `tau` An n by g matrix of posterior probability for each data point

## References

Biernacki C. Celeux G., and Govaert G. (2000). Assessing a Mixture Model for Clustering with the integrated Completed Likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence. 22(7). 719-725.

McLachlan G.J. and Krishnan T. (2008). The EM Algorithm and Extensions (2nd). New Jersay: Wiley.

McLachlan G.J. and Peel D. (2000). Finite Mixture Models. New York: Wiley.

`initEmmix`,`rdemmix`.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71``` ```#define the dimension of dataset n1=300;n2=300;n3=400; nn<-c(n1,n2,n3) p <- 2 ng <- 3 #define the parameters sigma<-array(0,c(2,2,3)) for(h in 2:3) sigma[,,h]<-diag(2) sigma[,,1]<-cbind( c(1,0.2),c(0.2,1)) mu <- cbind(c(4,-4),c(3.5,4),c( 0, 0)) #and other parameters if required for "mvt","msn","mst" delta <- cbind(c(3,3),c(1,5),c(-3,1)) dof <- c(3,5,5) pro <- c(0.3,0.3,0.4) distr="mvn" ncov=3 # generate a data set set.seed(111) #random seed is reset dat <- rdemmix(nn,p,ng,distr,mu,sigma) # the following code can be used to get singular data (remarked off) # dat[1:300,2]<--4 # dat[300+1:300,1]<-2 ## dat[601:1000,1]<-0 ## dat[601:1000,2]<-0 #fit the data using KMEANS to get the initial partitions (10 trials) obj <- EmSkew(dat,ng,distr,ncov,itmax=1000,epsilon=1e-5,nkmeans=10) # alternatively, if we define initial values like initobj<-list() initobj\$pro <- pro initobj\$mu <- mu initobj\$sigma<- sigma initobj\$dof <- dof initobj\$delta<- delta # then we can fit the data from initial values obj <- EmSkew(dat,ng,distr,ncov,init=initobj,itmax=1000,epsilon=1e-5) # finally, if we know inital partition such as clust <- rep(1:ng,nn) # then we can fit the data from given initial partition obj <- EmSkew(dat,ng,distr,ncov,clust=clust,itmax=1000,epsilon=1e-5) # plot the 2D contours colnames(dat)<- paste("x",1:p,sep='') # dev.new() EmSkew.flow(dat,obj) ```