clusterfinder: Heuristics to find subpopulations of outliers

ClusterFinder1R Documentation

Heuristics to find subpopulations of outliers

Description

The ClusterFinder is a heuristic to find subpopulations of outliers essentially by looking for secondary modes in a density estimate.

Usage

ClusterFinder1(X,...)
## S3 method for class 'acomp'
ClusterFinder1(X,...,sigma=0.3,radius=1,asig=1,minGrp=3,
                                 robust=TRUE)
          

Arguments

X

the dataset to be clustered

...

Further arguments to MahalanobisDist(X,...,robust=robust,pairwise=TRUE)

sigma

numeric: The Bandwidth of the density estimation kernel in a robustly Mahalanobis transformed space. (i.e. in the transform, where the main group has unit variance)

radius

The minimum size of a cluster in a robustly Mahalanobis transformed space. (i.e. in the transform, where the main group has unit variance)

asig

a scaling factor for the geometry of the robustly Mahalanobis transformed space when computing the likelihood of an observation to belong to group (under a Gaussian assumption). Higher values

minGrp

the minimum size of group to be used. Smaller groups are treated as single outliers

robust

A robustness description for estimating the variance of the main group. FALSE is probably not a usefull value. However later other robustness techniques than mcd might be usefull. TRUE just picks the default method of the package.

Details

See outliersInCompositions for a comprehensive introduction into the outlier treatment in compositions.
The ClusterFinder is labeled with a number to make clear that this is just an implementation of some heuristic and not based on some eternal truth. Other might give better Clusterfinders.
Unlike other Clustering Algorithms the basic model of this algorithm assumes that there is one dominating subpopulation and an unkown number of smaller subpopulations with a similar covariance structure but a different mean. The algorithm thus first estimates the covariance structure of the main population by a robust location scale estimator. Then it uses a simplified (Gaussian) kernel density estimator to find nonrandom secondary modes. The it tries to a assign the different observations according to discrimination analysis model to the different modes. Groups under a given size are considered as single outliers forming a seperate group. In this way the number of clusters is kept low even if there are many erratic measurements in the dataset.
The main use of the clusters is descriptive plotting. The advantage of these cluster against other cluster techniques like k-mean or hclust is that it does not tear appart the central mass of the data, as these methods do to make the clusters as compact as possible.

Value

A list

types

a factor representing the group assignments, when the small groups are ignored

typesTbl

a table giving the number of members in each of these groups

groups

a factor representing the found group assignments

isMax

a logical vector indicating for each observation,whether it represent a local maximum in the density estimate.

prob

the infered probability to belong to the different groups given as an acomp composition.

nmembers

a tabel giving the number of members of each group

density

the density estimated in each observation location

likeli

The infered likelihood see this observation, for each of the groups

Author(s)

K.Gerald v.d. Boogaart http://www.stat.boogaart.de

See Also

hclust, kmeans

Examples

data(SimulatedAmounts)
  cl <- ClusterFinder1(sa.outliers5,sigma=0.4,radius=1) 
  plot(sa.outliers5,col=as.numeric(cl$types),pch=as.numeric(cl$types))
  legend(1,1,legend=levels(cl$types),xjust=1,col=1:length(levels(cl$types)),
                     pch=1:length(levels(cl$types)))


compositions documentation built on June 22, 2024, 12:15 p.m.