R/sparsewkm.R
In vimpclust: Variable Importance in Clustering

Documented in sparsewkm

#' @title Sparse weighted k-means
#' @export
#'
#' @description This function performs sparse weighted k-means on a set 
#' of observations described by numerical and/or categorical variables.
#' It generalizes the sparse clustering algorithm introduced in
#' Witten & Tibshirani (2010) to any type of data (numerical, categorical
#' or a mixture of both). The weights of the variables indicate their importance
#' in the clustering process and discriminant variables are thus selected by 
#' means of weights set to 0.
#' 
#' @param X a dataframe of dimension \code{n} (observations) by \code{p} (variables) with
#'  numerical, categorical or mixed data. 
#' @param centers an integer representing the number of clusters. 
#' @param lambda a vector of numerical values (or a single value) providing 
#' a grid of values for the regularization parameter. If NULL (by default), the function computes its 
#' own lambda sequence of length \code{nlambda} (see details).
#' @param nlambda an integer indicating the number of values for the regularization parameter. 
#' By default, \code{nlambda=20}.
#' @param nstart an integer representing the number of random starts in the k-means algorithm.
#'  By default, \code{nstart=10}. 
#' @param itermaxw an integer indicating the maximum number of iterations for the inside 
#' loop over the weights \code{w}. By default, \code{itermaxw=20}.
#' @param itermaxkm  an integer representing the maximum number of iterations in the k-means 
#' algorithm. By default, \code{itermaxkm=10}.
#' @param renamelevel a boolean. If TRUE (default option), each level of a categorical variable
#'is renamed as \code{'variable_name=level_name'}.
#' @param epsilonw a positive numerical value. It provides the precision of the stopping 
#' criterion over \code{w}. By default, \code{epsilonw =1e-04}. 
#' @param verbose an integer value. If \code{verbose=0}, the function stays silent, if \code{verbose=1} (default option), it  prints
#'  whether the stopping criterion over the weights \code{w} is satisfied.
#'
#' @return \item{lambda}{a numerical vector containing the regularization parameters (a grid of values).}
#' @return \item{W}{a \code{p} by \code{length(lambda)} matrix. It contains the weights associated to each variable.}
#' @return \item{Wm}{a \code{q} by \code{length(lambda)} matrix, where \code{q} is the 
#' number of numerical variables plus the number of levels of the categorical 
#' variables. It contains the weights associated to the numerical variables and to the levels of the categorical
#'  variables.}
#' @return \item{cluster}{a \code{n} by \code{length(lambda)} integer matrix. It contains the 
#' cluster memberships, for each value of the regularization parameter.}
#' @return \item{sel.init.feat}{a numerical vector of the same length as \code{lambda}, giving the 
#' number of selected variables for each value of the regularization parameter.}
#' @return \item{sel.trans.feat}{a numerical vector of the same length as \code{lambda}, giving the 
#' number of selected numerical variables and levels of categorical variables.}
#' @return \item{X.transformed}{a matrix of size \code{n} by \code{q}, containing the transformed data: numerical variables scaled to 
#' zero mean and unit variables, categorical variables transformed into dummy variables, scaled (in means and variance)
#' with respect to the relative frequency of the levels.}
#' @return \item{index}{a numerical vector indexing the variables and allowing to group together the levels of a
#'  categorical variable.}
#' @return \item{bss.per.feature}{a matrix of size \code{q} by \code{length(lambda)}. 
#' It contains the between-class variance computed on the \code{q} transformed variables (numerical variables and 
#' levels of categorical variables).}
#'
#' @details 
#' Sparse weighted k-means performs clustering on mixed data (numerical and/or categorical), and automatically
#' selects the most discriminant variables by setting to zero the weights of the non-discriminant ones. 
#' 
#' The mixted data is first preprocessed: numerical variables are scaled to zero mean and unit variance;
#' categorical variables are transformed into dummy variables, and scaled -- in mean and variance -- with
#' respect to the relative frequency of each level. 
#' 
#' The algorithm is based on the optimization of a cost function which is the weighted between-class variance penalized
#'  by a group L1-norm. The groups are implicitely defined: each numerical variable constitutes its own group, the levels 
#' associated to one categorical variable constitute a group. The importance of the penalty term may be adjusted through
#' the regularization parameter \code{lambda}.
#'  
#' The output of the algorithm is two-folded: one gets a partitioning of the data set and a vector of weights associated
#' to each variable. Some of the weights are equal to 0, meaning that the associated variables do not participate in the
#' clustering process. If \code{lambda} is equal to zero, there is no penalty applied to the weighted between-class variance in the 
#' optimization procedure. The larger the value of \code{lambda}, the larger the penalty term and the number of variables with
#' null weights. Furthemore, the weights associated to each level of a categorical variable are also computed.
#' 
#' Since it is difficult to choose the regularization parameter \code{lambda} without prior knowledge,
#' the function builds automatically a grid of parameters and finds a partition and vector of weights for each 
#' value of the grid.
#' 
#' Note also that the columns of the data frame \code{X} must be of class factor for 
#' categorical variables.
#' 
#' @references Witten, D. M., & Tibshirani, R. (2010). A framework for feature 
#' selection in clustering. Journal of the American Statistical Association, 
#' 105(490), 713-726.
#' @references Chavent, M. & Lacaille, J. & Mourer, A. & Olteanu, M. (2020). 
#' Sparse k-means for mixed data via group-sparse clustering, ESANN proceedings.
#' 
#' @seealso \code{\link{plot.spwkm}}, \code{\link{info_clust}}, 
#' \code{\link{groupsparsewkm}}, \code{\link{recodmix}}
#'
#'@examples
#' data(HDdata)
#' \donttest{
#' out <- sparsewkm(X = HDdata[,-14], centers = 2)
#' # grid of automatically selected regularization parameters
#' out$lambda
#' k <- 10
#' # weights of the variables for the k-th regularization parameter
#' out$W[,k]
#' # weights of the numerical variables and of the levels 
#' out$Wm[,k]
#' # partitioning obtained for the k-th regularization parameter
#' out$cluster[,k]
#' # number of selected variables
#' out$sel.init.feat
#' # between-class variance on each variable
#' out$bss.per.feature[,k]
#' # between-class variance
#' sum(out$bss.per.feature[,k])
#' }

sparsewkm <- function(X, centers, lambda = NULL, nlambda = 20, nstart = 10, 
                      itermaxw = 20, itermaxkm = 10, renamelevel = TRUE, 
                      verbose = 1, epsilonw = 1e-04) 
{
    call <- match.call()
    
    check_X(X)
    check_renamelevel(renamelevel)
    Xrec <- recodmix(X, renamelevel)
    
    res.out <- groupsparsewkm(X = Xrec$Z, centers, lambda, nlambda, index = Xrec$index, sizegroup = T,  nstart, 
                          itermaxw, itermaxkm, scaling = FALSE, verbose, epsilonw) 
    # scaling is put to FALSE because recodmix already scales the variables
    rownames(res.out$Wg) <- c(names(Xrec$X)[sapply(Xrec$X,is.factor)==F], names(Xrec$X)[sapply(Xrec$X,is.factor)==T])
    res <- list(call = res.out$call, type="MixedSparse",  W = res.out$Wg,
                Wm = res.out$W, cluster = res.out$cluster, lambda = res.out$lambda, 
                sel.init.feat=res.out$sel.groups,
                sel.trans.feat=res.out$sel.feat,
                X.transformed = Xrec$Z, index = Xrec$index, bss.per.feature = res.out$bss.per.feature)
    class(res) <- "spwkm"
    return(res)
}