getEMGPs: Expectation - Maximisation Algorithm for Gaussian Mixtures
In ornithos/nectr: Exploring Datasets via Clustering with TURN-RES.

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/EM.R

Implementation of Expectation - Maximisation (EM) Algorithm. Many implementations exist already within R. The reason that this package includes another is for scalability reasons and also some modifications. It is intended to be used in series with TURN-RES, and the analyst can lock the clusters found through this in place.

getEMGPs(data, centers, cls.prob = NULL, move.restrict = NA, eps.target = 0.1,
					trunc = 0.5, max.iter = 100, silent = FALSE)
				
getEMClusters(data, mu = NA, sigma = NA, phi = NA, params = NA)

`data`	required. A numeric data frame or matrix where each column is a dimension to be clustered over.
`centers`	required. List of vectors, where each vector is an initialisation for a cluster center. Implicit in this argument is the number of clusters.
`cls.prob`	Optional initialisation of cluster prior probabilities. For k clusters, defaults to n/k
`move.restrict`	Optional restriction of cluster center update. Clusters are limited to the nearest possible point during the update process within a locus of `move.restrict` around the center initialisation
`eps.target`	Optional convergence criteria. Algorithm has converged when likelihood function changes < `eps.target`
`trunc`	Optional paramater to remove outliers from covariance estimation. Datapoints with < trunc/n probabilities are removed from the calculation.
`max.iter`	Optional iteration criteria. Maximumm number iterations allowed to converge before algorithm exit.
`silent`	Optional parameter. By default, algorithm gives console update after every iteration.

No default initialisation of gaussian means is implemented, neither are the number of clusters estimated. The implementation is built to work alongside the exploratory clustering available through TURN-RES clsTurnRes and so it is assumed that the user has determined both the number of clusters and their centers already.

The major modification to the algorithm is allowing the user to restrict how far the clusters can be moved from their initial center. A value of move.restrict = 0 locks the centers in place, and a value of move.restrict = 5 prevents the algorithm from moving the cluster centers further than a (euclidean) distance of 5 from their initial specification. Note that the convergence behaviour has not been analysed with this modification and the author makes no guarantees of convergence. In practice however, all known attempts have converged, but monotonicity of optimisation is violated.

It has also been experimentally discovered that sufficient noise in the dataset can lead to pathological covariance matrices, which might have such a large volume that all noise is considered part of that cluster and the intended cluster is lost. The trunc argument has been implemented in order to 'trim' the covariance matrix input such that outliers are removed. The precise implementation is that any datapoint that has a probability of < trunc/n of belonging to any of the k clusters is removed from the covariance calculation of that iteration. Again, no guarantees are therefore made of convergence.

A list with slots for the cluster parameters (phi,mu,sigma), the cluster assignment vector, and the cluster.prob vector. The cluster.prob vector is the density of the assigned cluster at the given datapoint; this can be used as a measure of strength of cluster membership. The list is unclassed; it is a generic object to be used as desired by the end user; no further functionality is given.

This implementation is heavily indebted to the package mvtnorm for providing a fast calculation of multivariate gaussian density.

Alex Bird, alex.bird@boots.co.uk

getEMClusters for scoring the data into clusters using the parameters estimated from this function. formaliseClusters for guided use of these functions or if using input from clsMRes / clsTurnRes objects.