msClustering: Function to perform clustering using the mean shift...
In mattiaciollaro/MeanShift: Clustering via the Mean Shift Algorithm

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/meanshift.R

This function implements the mean shift algorithm. The algorithm locates the modes of a kernel density estimator and associates each data point to exactly one of the modes, thus effectively clustering the data.

1 2	msClustering(X, h = NULL, kernel = "epanechnikovKernel", tol.stop = 1e-06, tol.epsilon = 0.001, multi.core = FALSE)

`X`	a p \times n matrix containing n ≥ 1 p-dimensional numeric vectors stored as columns. Each column of `X` represents a sample point.
`h`	a strictly positive bandwidth parameter.
`kernel`	a kernel function (as a character string). The following kernels are supported: Epanechnikov: K(x) = \frac{3}{2}(1-x^2)I_{[0,1]}(x) ; `kernel="epanechnikovKernel"` cubic: K(x) = 4(1-x)^3I_{[0,1]}(x) ; `kernel="cubicKernel"` Gaussian: K(x) = √{\frac{2}{π}}e^{-\frac{x^2}{2}}I_{[0,∞)}(x) ; `kernel="gaussianKernel"` exponential K(x) = e^{-x}I_{[0,∞)}(x) ; `kernel="exponentialKernel"`.
`tol.stop`	a strictly positive tolerance parameter. The algorithm stops when all of the updates generate steps of length smaller than `tol.stop`. `tol.stop` should be considerably smaller than `tol.epsilon`.
`tol.epsilon`	a strictly positive tolerance parameter. Points that are less than `tol.epsilon`- separated are grouped in the same cluster once the algorithm stops.
`multi.core`	logical. If `TRUE`, the mean shift algorithm is parallelized.

It is generally recommended to standardize X so that each variable has unit variance prior to running the algorithm on the data.

Roughly speaking, larger values of h produce a coarser clustering (i.e. few and large clusters). For sufficiently large values of h, the algorithm produces a unique cluster containing all the data points. Smaller values of h produce a finer clustering (i.e. many small clusters). For sufficiently small values of h, each cluster that is identified by the algorithm will contain exactly one data point.

If h is not specified in the function call, then h is by default set to the 30th percentile of the empirical distribution of distances between the columns of X, i.e. h=quantile( dist( t( X ) ), 0.3 ).

In their implementation, gaussianKernel and exponentialKernel are rescaled to assign probability of at least 0.99 to the unit interval [0,1]. This ensures that all the kernels are roughly on the same scale.

To specify the number of cores when multi.core=TRUE, the option mc.cores needs to be set with options( mc.cores=n.cores ), where n.cores is the number of cores that the mean shift algorithm is allowed to use for parallel computation.

The function invisibly returns a list with names

`components`	a matrix containing the modes/cluster representatives by column.
`labels`	an integer vector of cluster labels.

Mattia Ciollaro and Daren Wang

Carreira-Perpinan, M. A. (2015) A review of mean-shift algorithms for clustering. arXiv http://arxiv.org/abs/1503.00687

bmsClustering

## an example using the iris dataset
## help( iris )

## prepare data matrix (a subset of the iris dataset)
set.seed( 2 )
indices <- sample( 1:nrow( iris ), 80 )
iris.data <- t( iris[indices,c( "Sepal.Length", "Sepal.Width" )] )

## run mean shift algorithm
clustering <- msClustering( iris.data, h=0.8 )
print( clustering )

## plot the clusters
## Not run: 
plot( iris.data[1,], iris.data[2,], col=clustering$labels+2, cex=0.8,
pch=16, xlab="Sepal.Length", ylab="Sepal.Width" )
points( clustering$components[1,], clustering$components[2,],
col=2+( 1:ncol( clustering$components ) ), cex=1.8, pch=16 )
## End(Not run)

## using multiple cores (2)
## Not run: 
options( mc.cores=2 )
clustering.mc <- msClustering( iris.data, multi.core=TRUE )
## End(Not run)