clsTurnRes: TURN-RES Clustering

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/clsTurnRes.R

Description

Implementation of the TURN-RES clustering algorithm (Foss, 2002). TURN-RES is a density based clustering algorithm, but achieves superior efficiency and usability over other methods such as DBSCAN. Neighbour estimation is achieved through cyclically sorting the dataset over all its dimensions, but note that each datapoint is given only one neighbour in every perpendicular direction, and not necessarily the closest one.

Usage

1
2
clsTurnRes(data, r, summarise = F, min.size = "Auto", base.cls = "None", 
			phi = 0.8)

Arguments

data

required. A numeric data frame or matrix where each column is a dimension to be clustered over. Alternatively a cTurn object; ie a previous output of this function.

r

required. Resolution parameter for TURN-RES. Think of this like the adjustment wheel of a microscope. The smaller the value, the higher the granularity of the clustering. A high resolution will quantize data to a coarser grid. The purpose of the clsMRes function is inform the value of this parameter.

summarise

Output can be summarised if purpose of clustering was for top level metrics. When clsMRes calls clsTurnRes, it only requires statistics of the run rather than the full output, so a summary is returned. There are two levels of summary. (1) is the more highly summarised - simply n, k, (2) also returns the cluster vector.

min.size

The minimum size a cluster must be for classification as a cluster. Any clusters smaller than min.size will be considered as noise. The default value is n/100, so a cluster must represent at least 1% of the dataset. A numeric value must be supplied is interpreted absolutely rather than as a proportion.

base.cls

A somewhat experimental notion. A vector of the known cluster membership of the dataset, such that each row of the dataset corresponds to the respective row in the vector, where an integer will specify the cluster number, and NA denotes unknown cluster membership. This is intended to force the algorithm to separate or agglomerate clusters based on prior information, but the cutoffs between the forced separation can be an unnatural shape.

phi

Another parameter of the algorithm which determines the density required to agglomerate points. In theory, the choice of this parameter is arbitrary (see references), as it effectively 'scales' the resolution parameter. There has been no formal proof of this, hence the option to tweak it.

Details

While not completely parameterless, the user is only required to specify r, the resolution of the clustering. However, the algorithm has much more power when paired with its parent function clsMRes, which iterates through a sequence of values to reveal the structure of the data and aid with parameter selection.

Desired clusters may be found across different values of the parameter r. While formaliseClusters is designed to take in arguments of a clsMR object across multiple resolutions, there may be instances where the analyst wants to split open a giant cluster for a given resolution. The function clsSplit can be called to partition a specified cluster(s) into k separate clusters.

In the TURN paper below, a second algorithm , TURN-CUT was proposed to automatically determine the choice of r. This algorithm is in principal similar to the 'elbow method' of determining number of clusters. This has been omitted due to concerns of over-fit and a proposal that an exploratory approach would anyway be preferred. Philosophically, there may be no "best" choice of parameter, as even a given objective may yield a number of different "best" parameters on the same dataset.

Value

An object of class cTurn. The cluster membership vector can be found in the slot $cluster. cTurn objects have a number of generic functions available: print, summary and plot.

Note

In order to avoid copying the dataset to each cTurn object, instead the name is saved as item $dataset.name. The data will then be retrieved in function calls via get(dataset.name, env = .GlobalEnv), which means that the user must ensure that the dataset variable name is not changed. This is obviously a suboptimal procedure, but given the package is to be used with large datasets, it is also inadvisable to make a copy for every object, particularly if dozens of different cluster calls are to be made in quick succession.

Author(s)

Alex Bird, alex.bird@boots.co.uk

References

Foss, A. (2002) A Parameterless Method for Efficiently Discover Clusters of Arbitrary Shape in Large Datasets. University of Alberta Canada.

See Also

clsMRes for determining the resolution; clsSplit for splitting a given cluster into k clusters

Examples

1
2
3
4
5
6
7
8
9
#Toy Example
data <- matrix(runif(200),100,2)
cls <- clsTurnRes(data, r = 0.1)

#Cluster Summary
summary(cls)

#Parallel Coordinate Plot
plot(cls)

ornithos/nectr documentation built on May 24, 2019, 3:57 p.m.