wrsk: wrsk

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/wrsk.R

Description

This function performs robust (weighted) and sparse k-means clustering for high-dimensional data (Brodinova et al (2017)). For the given number of clusters k and the sparsity parameter s, the algorithm detects clusters, outliers, and informative variables simultaneously.

Usage

1
wrsk(data, k, s, iteration = 15, cutoff = 0.5)

Arguments

data

A data matrix with n observations and p variables.

k

The number of clusters.

s

The sparsity parameter which penalizes the L1 norm of variable weights, i.e. lasso type penalty. The value should be larger than 1 and smaller than sqrt(p).

iteration

The maximum number of iterations allowed.

cutoff

A cutoff value to determine outliers. An observation is declared as an outlier if its weight is smaller than or equal to this cutoff, the default is 0.5.

Details

The method is a three-step iterative procedure. First, a weighting function is employed during sparse k-means clustering with ROBIN initialization. Then, the variable weights from sparse k-means are updated for the given sparsity parameter. These two steps are repeated until the variable weights stabilize. Finally, both clusters and outliers are detected. The approach is a robust version of sparse k-means (Witten and Tibshirani, 2010) and an alternative of robust (trimmed) and sparse k-means (Kondo et al, 2016).

Value

clusters

An integer vector with values from 1 to k, indicating a resulting cluster membership.

obsweights

A numeric vector of observation weights ranging between 0 and 1.

outclusters

An integer vector with values from 0 to k, containing both cluster membership and identified outliers. 0 corresponds to outlier.

varweights

A numeric vector of variable weights reflecting the contribution of variables to a cluster separation. A high weight suggests that a variable is informative.

WBCSS

The weighted-between cluster sum of squares for the local optimum. The value is calculated with respect to the final variable weights and adjusted by the final observation weights.

centers

The set of final cluster centers.

Author(s)

Sarka Brodinova <sarka.brodinova@tuwien.ac.at>

@references S. Brodinova, P. Filzmoser, T. Ortner, C. Breiteneder, M. Zaharieva. Robust and sparse k-means clustering for high-dimensional data. Submitted for publication, 2017. Available at http://arxiv.org/abs/1709.10012

References

D. M. Witten and R. Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713-726, 2010.

Y. Kondo, M. Salibian-Barrera, R.H. Zamar. RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm., Journal of Statistical Software, 72(5), 1-26, 2016.

See Also

Gapwrsk, KMeansSparseCluster, RSKC

Examples

1
2
3
4
5
6
7
8
# generate data
d <- SimData(size_grp=c(40,40,40),p_inf=50,
p_noise=750,p_out_noise=75)
dat <- scale(d$x)

res <- wrsk(data=dat,k=3,s=6)
table(d$lb,res$outclusters)
plot(res$varweights)

brodsa/wrsk documentation built on April 7, 2020, 6:12 a.m.