Description Usage Arguments Details Value Author(s) References See Also Examples
wrskGap
estimates the optimal sparsity parameter controlling the sparsity in the variable
weight vector for wrsk
. The approach uses gap statistic which is additionally adjusted by observation weights.
The statistic can be employed to estimate either
the sparsity parameter or both the sparsity parameter and the number of clusters in case no information about
data is available.
1 |
data |
A data matrix with n observations and p variables. |
K |
The number of clusters. |
S |
A numeric vector containing the values of the sparsity parameter. The values have to be
in increasing order, greater than 1 and smaller than |
npermute |
The number of permutations to calculate gap statistic, the default is 10. |
cores |
The number of cores for parallel computing, see examples. |
The estimation of the optimal parameters is based on a data permutation. The gap statistic compares the
the adjusted weighted between-cluster sum of squares (WBCSS) obtained by wrsk
with the expected value
under a null distribution. The expected value is calculated with respect to WBCSS computed on the clustering solution
achieved on the permuted datasets.
The optimal sparsity parameter is selected using a 1-standard error rule. If the optimal number of clusters needs to
be estimated, the gap statistic is calculated with respect to various choices of S
and K.
Consequently, for each number of cluster, an optimal sparsity parameter is selected. The optimal number of clusters is chosen based on
the values of gap statistic corresponding the optimal sparsity parameter.
The largest value of these values indicates the optimal number of clusters.
gap |
A numeric vector containing the values of gap statistic for each choice of the sparsity parameter |
se |
A numeric vector containing the standard error of log(WCSSS) with respect to the permuted data. |
s |
A numeric vector of the investigated values of the sparsity parameter. |
resFinal |
A list of objects corresponding to the output of
|
Sarka Brodinova <sarka.brodinova@tuwien.ac.at>
S. Brodinova, P. Filzmoser, T. Ortner, C. Breiteneder, M. Zaharieva. Robust and sparse k-means clustering for high-dimensional data. Submitted for publication, 2017. Available at http://arxiv.org/abs/1709.10012
D. M. Witten and R. Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490) 713-726, 2010.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | ## Not run:
# set the number of cores
cores <- detectCores() - 2
# generate data
d <- SimData(size_grp=c(40,40,40),p_inf=50,
p_noise=750,p_out_noise=75)
dat <- scale(d$x)
lb <- d$lb
# selection of optimal par. from K=2,...5 and S=1.1,1.6,..
GAP_sk <- rep(-Inf,5)
sim_nS <- seq(1.1,sqrt(ncol(dat)),0.5)
# --- K=2 ------
k2 <- wrskGap(data=dat,K=2,S=sim_nS,npermut=10,cores)
id.max.k2 <- which.max(k2$gap)
id.opt.k2 <- which(k2$gap[1:id.max.k2]> (k2$gap[id.max.k2] - k2$se[id.max.k2]))[1]
GAP_sk[2] <- k2$gap[id.opt.k2]
# ---- K=3 ------------
k3 <- wrskGap(data=dat,K=3,S=sim_nS,npermut=10,cores)
id.max.k3 <- which.max(k3$gap)
id.opt.k3 <- which(k3$gap[1:id.max.k3]> (k3$gap[id.max.k3] - k3$se[id.max.k3]))[1]
GAP_sk[3] <- k3$gap[id.opt.k3]
# ---- K=4 ------------
k4 <- wrskGap(data=dat,K=4,S=sim_nS,npermut=10,cores)
id.max <- which.max(k4$gap)
id.opt.k4 <- which(k4$gap[1:id.max]> (k4$gap[id.max] - k4$se[id.max]))[1]
GAP_sk[4] <- k4$gap[id.opt.k4]
# --- K=5 -----------
k5 <- wrskGap(data=dat,K=5,S=sim_nS,npermut=10,cores)
id.max <- which.max(k5$gap)
id.opt.k5 <- which(k5$gap[1:id.max]> (k5$gap[id.max] - k5$se[id.max]))[1]
GAP_sk[5] <- k5$gap[id.opt.k5]
# THE OPTIMAL CLUSTERING SOLUTION
id.opt.k <- which.max(GAP_sk)
opt.k.solution <- list(0,k2,k3,k4,k5,k6)[[id.opt.k]]
id.opt.s <- c(0,id.opt.k2,id.opt.k3,id.opt.k4,id.opt.k5,id.opt.k6)[id.opt.k]
opt.res <- opt.k.solution$resFinal[[id.opt.s]]
obsweights <- opt.res$obsweights
varweights <- opt.res$varweights
clusters <- opt.res$clusters
outclusters <- opt.res$outclusters
table(d$lb,outclusters)
plot(varweights)
plot(obsweights,col=d$lb+1)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.