Description Usage Arguments Details Value Note Author(s) See Also Examples
Implementation of Expectation - Maximisation (EM) Algorithm. Many implementations exist already within R. The reason that this package includes another is for scalability reasons and also some modifications. It is intended to be used in series with TURN-RES, and the analyst can lock the clusters found through this in place.
1 2 3 4 |
data |
required. A numeric data frame or matrix where each column is a dimension to be clustered over. |
centers |
required. List of vectors, where each vector is an initialisation for a cluster center. Implicit in this argument is the number of clusters. |
cls.prob |
Optional initialisation of cluster prior probabilities. For k clusters, defaults to n/k |
move.restrict |
Optional restriction of cluster center update. Clusters are limited to the nearest possible point
during the update process within a locus of |
eps.target |
Optional convergence criteria. Algorithm has converged when likelihood function changes < |
trunc |
Optional paramater to remove outliers from covariance estimation. Datapoints with < trunc/n probabilities are removed from the calculation. |
max.iter |
Optional iteration criteria. Maximumm number iterations allowed to converge before algorithm exit. |
silent |
Optional parameter. By default, algorithm gives console update after every iteration. |
No default initialisation of gaussian means is implemented, neither are the number of clusters estimated. The implementation is built to work alongside the exploratory clustering available through TURN-RES clsTurnRes and so it is assumed that the user has determined both the number of clusters and their centers already.
The major modification to the algorithm is allowing the user to restrict how far the clusters can be moved from their
initial center. A value of move.restrict = 0
locks the centers in place, and a value of move.restrict = 5
prevents the algorithm from moving the cluster centers further than a (euclidean) distance of 5 from their initial
specification. Note that the convergence behaviour has not been analysed with this modification and the author makes no
guarantees of convergence. In practice however, all known attempts have converged, but monotonicity of optimisation is
violated.
It has also been experimentally discovered that sufficient noise in the dataset can lead to pathological covariance matrices,
which might have such a large volume that all noise is considered part of that cluster and the intended cluster is lost. The
trunc
argument has been implemented in order to 'trim' the covariance matrix input such that outliers are removed. The
precise implementation is that any datapoint that has a probability of < trunc/n of belonging to any of the k clusters is
removed from the covariance calculation of that iteration. Again, no guarantees are therefore made of convergence.
A list with slots for the cluster parameters (phi
,mu
,sigma
), the cluster
assignment vector, and the
cluster.prob
vector. The cluster.prob vector is the density of the assigned cluster at the given datapoint; this can be used
as a measure of strength of cluster membership. The list is unclassed; it is a generic object to be used as desired by the end user;
no further functionality is given.
This implementation is heavily indebted to the package mvtnorm
for providing a fast calculation of multivariate
gaussian density.
Alex Bird, alex.bird@boots.co.uk
getEMClusters
for scoring the data into clusters using the parameters estimated from this function.
formaliseClusters
for guided use of these functions or if using input from clsMRes
/
clsTurnRes
objects.
1 2 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.