Description Usage Arguments Details Value References See Also Examples

This function performs sparse weighted k-means on a set of observations described by numerical and/or categorical variables. It generalizes the sparse clustering algorithm introduced in Witten & Tibshirani (2010) to any type of data (numerical, categorical or a mixture of both). The weights of the variables indicate their importance in the clustering process and discriminant variables are thus selected by means of weights set to 0.

1 2 3 4 5 6 7 8 9 10 11 12 |

`X` |
a dataframe of dimension |

`centers` |
an integer representing the number of clusters. |

`lambda` |
a vector of numerical values (or a single value) providing
a grid of values for the regularization parameter. If NULL (by default), the function computes its
own lambda sequence of length |

`nlambda` |
an integer indicating the number of values for the regularization parameter.
By default, |

`nstart` |
an integer representing the number of random starts in the k-means algorithm.
By default, |

`itermaxw` |
an integer indicating the maximum number of iterations for the inside
loop over the weights |

`itermaxkm` |
an integer representing the maximum number of iterations in the k-means
algorithm. By default, |

`renamelevel` |
a boolean. If TRUE (default option), each level of a categorical variable
is renamed as |

`verbose` |
an integer value. If |

`epsilonw` |
a positive numerical value. It provides the precision of the stopping
criterion over |

Sparse weighted k-means performs clustering on mixed data (numerical and/or categorical), and automatically selects the most discriminant variables by setting to zero the weights of the non-discriminant ones.

The mixted data is first preprocessed: numerical variables are scaled to zero mean and unit variance; categorical variables are transformed into dummy variables, and scaled – in mean and variance – with respect to the relative frequency of each level.

The algorithm is based on the optimization of a cost function which is the weighted between-class variance penalized
by a group L1-norm. The groups are implicitely defined: each numerical variable constitutes its own group, the levels
associated to one categorical variable constitute a group. The importance of the penalty term may be adjusted through
the regularization parameter `lambda`

.

The output of the algorithm is two-folded: one gets a partitioning of the data set and a vector of weights associated
to each variable. Some of the weights are equal to 0, meaning that the associated variables do not participate in the
clustering process. If `lambda`

is equal to zero, there is no penalty applied to the weighted between-class variance in the
optimization procedure. The larger the value of `lambda`

, the larger the penalty term and the number of variables with
null weights. Furthemore, the weights associated to each level of a categorical variable are also computed.

Since it is difficult to choose the regularization parameter `lambda`

without prior knowledge,
the function builds automatically a grid of parameters and finds a partition and vector of weights for each
value of the grid.

Note also that the columns of the data frame `X`

must be of class factor for
categorical variables.

`lambda` |
a numerical vector containing the regularization parameters (a grid of values). |

`W` |
a |

`Wm` |
a |

`cluster` |
a |

`sel.init.feat` |
a numerical vector of the same length as |

`sel.trans.feat` |
a numerical vector of the same length as |

`X.transformed` |
a matrix of size |

`index` |
a numerical vector indexing the variables and allowing to group together the levels of a categorical variable. |

`bss.per.feature` |
a matrix of size |

Witten, D. M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713-726.

Chavent, M. & Lacaille, J. & Mourer, A. & Olteanu, M. (2020). Sparse k-means for mixed data via group-sparse clustering, ESANN proceedings.

`plot.spwkm`

, `info_clust`

,
`groupsparsewkm`

, `recodmix`

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | ```
data(HDdata)
out <- sparsewkm(X = HDdata[,-14], centers = 2)
# grid of automatically selected regularization parameters
out$lambda
k <- 10
# weights of the variables for the k-th regularization parameter
out$W[,k]
# weights of the numerical variables and of the levels
out$Wm[,k]
# partitioning obtained for the k-th regularization parameter
out$cluster[,k]
# number of selected variables
out$sel.init.feat
# between-class variance on each variable
out$bss.per.feature[,k]
# between-class variance
sum(out$bss.per.feature[,k])
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.