# kmeansvar: k-means clustering of variables In ClustOfVar: Clustering of Variables

## Description

Iterative relocation algorithm of k-means type which performs a partitionning of a set of variables. Variables can be quantitative, qualitative or a mixture of both. The center of a cluster of variables is a synthetic variable but is not a 'mean' as for classical k-means. This synthetic variable is the first principal component calculated by PCAmix. PCAmix is defined for a mixture of qualitative and quantitative variables and includes ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases. The homogeneity of a cluster of variables is defined as the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster, which is in all cases a numerical variable. Missing values are replaced by means for quantitative variables and by zeros in the indicator matrix for qualitative variables.

## Usage

 ```1 2``` ```kmeansvar(X.quanti = NULL, X.quali = NULL, init, iter.max = 150, nstart = 1, matsim = FALSE) ```

## Arguments

 `X.quanti` a numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). `X.quali` a categorical matrix of data, or an object that can be coerced to such a matrix (such as a character vector, a factor or a data frame with all factor columns). `init` either the number of clusters or an initial partition (a vector of integers indicating the cluster to which each variable is allocated). If `init` is a number, a random set of (distinct) columns in `X.quali` and `X.quanti` is chosen as the initial cluster centers. `iter.max` the maximum number of iterations allowed. `nstart` if `init` is a number, `nstart` corresponds with the number of random sets used in the process. `matsim` boolean, if 'TRUE', the matrices of similarities between variables in same cluster are calculated.

## Details

If the quantitative and qualitative data are in a same dataframe, the function `splitmix` can be used to extract automatically the qualitative and the quantitative data in two separated dataframes.

## Value

 `var` a list of matrices of squared loadings i.e. for each cluster of variables, the squared loadings on first principal component of PCAmix. For quantitative variables (resp. qualitative), squared loadings are the squared correlations (resp. the correlation ratios) with the first PC (the cluster center). `sim` a list of matrices of similarities i.e. for each cluster, similarities between their variables. The similarity between two variables is defined as a square cosine: the square of the Pearson correlation when the two variables are quantitative; the correlation ratio when one variable is quantitative and the other one is qualitative; the square of the canonical correlation between two sets of dummy variables, when the two variables are qualitative. `sim` is 'NULL if `matsim` is FALSE. `cluster` a vector of integers indicating the cluster to which each variable is allocated. `wss` the within-cluster sum of squares for each cluster: the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster. `E` the pourcentage of homogeneity which is accounted by the partition in k clusters. `size` the number of variables in each cluster. `scores` a n by k numerical matrix which contains the k cluster centers. The center of a cluster is a synthetic variable: the first principal component calculated by PCAmix. The k columns of `scores` contain the scores of the n observations units on the first PCs of the k clusters. `coef` a list of the coefficients of the linear combinations defining the synthetic variable of each cluster.

## References

Chavent, M., Liquet, B., Kuentz, V., Saracco, J. (2012), ClustOfVar: An R Package for the Clustering of Variables. Journal of Statistical Software, Vol. 50, pp. 1-16.

`splitmix`, `summary.clustvar`,`predict.clustvar`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```data(decathlon) #choice of the number of clusters tree <- hclustvar(X.quanti=decathlon[,1:10]) stab <- stability(tree,B=60) #a random set of variables is chosen as the initial cluster centers, nstart=10 times part1 <- kmeansvar(X.quanti=decathlon[,1:10],init=5,nstart=10) summary(part1) #the partition from the hierarchical clustering is chosen as initial partition part_init<-cutreevar(tree,5)\$cluster part2<-kmeansvar(X.quanti=decathlon[,1:10],init=part_init,matsim=TRUE) summary(part2) part2\$sim ```