pdfClassification: Classification of low density data

pdfClassificationR Documentation

Classification of low density data

Description

Allocates low density data points in a multi-stage procedure after that cluster cores have been detected by applying pdfCluster.

Usage

pdfClassification(obj, n.stage = 5, se = TRUE, hcores = FALSE)

Arguments

obj

An object of pdfCluster-class.

n.stage

Allocation of low density data is performed by following a multi-stages procedure in n.stage stages.

se

Logical. Should the standard-error of the density estimates be taken into account to define the order of allocation? Default value is TRUE. See details below.

hcores

Logical. Set this value to TRUE to build cluster density estimates by selecting the same bandwidths as the ones used to form the cluster cores. Otherwise, bandwidths specific for the clusters are selected. Default value is FALSE. See details below.

Details

The basic idea of the classification stage of the procedure is as follows: for an unallocated data point x_0, compute the estimated density \hat{f}_m(x_0) based on the data already assigned to group m, m = 1, 2, …, M, and assign x_0 to the group with highest log ratio \hat{f}_m(x_0)/\max_m \hat{f}_m(x_0).

In case \hat{f}_m(x_0)=0, for all m = 1, 2, …, M, x_0 is considered as an outlier. The procedure gives a warning message and the outlier remains unclassified. The cluster label of x_0 will be set to zero.

The current implementation of this idea proceeds in n.stage stages, allocating a block of points at a time, updating the estimates \hat{f}_m(\cdot) based on the new members of each group and then allocating a new block of points. When se = TRUE, classification is performed by further weighting the log-ratios inversely with their approximated standard error, so that points whose density estimate has highest precision are allocated first.

Each of the \hat{f}_m(\cdot) is built by selecting either the same bandwidths h_0 as the ones used to form the cluster cores (when hcores = TRUE) or cluster-specific bandwidths, obtained as follows:

h_m^{*} = \exp [(1-a_m) \log(h_0) + a_m \log(h_m)],

where a_m is the proportion of data points in the m-th cluster core and h_m are asymptotically optimal for a normal distribution of the m-th cluster or computed according to the Silverman (1986) approach, if the kernel estimator has fixed or adaptive bandwidth, respectively.

Value

An object of pdfCluster-class with slot stages of class "list" having length equal to n.stage. See pdfCluster-class for further details.

Note

Function pdfClassification is called internally, from pdfCluster, when the argument n.stage is set to a value greater than zero. Alternatively, it may be called externally, by providing as argument an object of pdfCluster-class.

When pdfClassification is internally called from pdfCluster and one group only is detected, the slot stages is a list with n.stage elements, each of them being a vector with length equal to the number of data points and all elements equal to 1.

References

Azzalini A., Torelli N. (2007). Clustering via nonparametric density estimation. Statistics and Computing. 17, 71-80.

Silverman, B. (1986). Density estimation for statistics and data analysis. Chapman and Hall, London.

See Also

pdfCluster, pdfCluster-class

Examples

# load data
data(wine)

# select a subset of variables
x <- wine[, c(2,5,8)]

#whole procedure, included the classification phase
cl <- pdfCluster(x)
summary(cl)
table(groups(cl))

#use of bandwidths specific for the group 
cl1 <- pdfClassification(cl, hcores= TRUE)
table(groups(cl1))

pdfCluster documentation built on Dec. 2, 2022, 5:14 p.m.