vimpclust: Variable Importance in Clustering

Documented in groupsparsewkm

#' @title Group-sparse weighted k-means
#' @export
#'
#' @description This function performs group-sparse weighted k-means on a set 
#' of observations described by numerical variables organized in groups. 
#' It generalizes the sparse clustering algorithm introduced by 
#' Witten & Tibshirani (2010) to groups. While the algorithm clusters the observations, the groups of variables are supposed priorly known.
#'  The algorithm computes a series of weights associated to the groups of variables, the weights
#' indicating the importance of each group in the clustering process. 
#' 
#' @param X a numerical matrix or a dataframe of dimension \code{n} (observations) by \code{p} 
#' (variables).
#' @param centers an integer representing the number of clusters.
#' @param lambda a vector of numerical values (or a single value) providing 
#' a grid of values for the regularization parameter. If NULL (by default), the function computes its 
#' own lambda sequence of length \code{nlambda} (see details).
#' @param nlambda an integer indicating the number of values for the regularization parameter. 
#' By default, \code{nlambda=20}.
#' @param index a vector of integers of size \code{p} providing the group membership
#'  for each variable. By default, \code{index=1:ncol(X)} i.e. no groups or groups of size 1. 
#' @param sizegroup a boolean. If TRUE, the group sizes (number of variables in each group) are taken into account in the penalty term (see details).
#'  By default, \code{sizegroup=TRUE}.
#' @param nstart an integer representing the number of random starts in the k-means algorithm.
#'  By default, \code{nstart=10}. 
#' @param itermaxw an integer indicating the maximum number of iterations for the inside 
#' loop over the weights \code{w}. By default, \code{itermaxw=20}.
#' @param itermaxkm an integer representing the maximum number of iterations in the k-means 
#' algorithm. By default, \code{itermaxkm=10}.
#' @param scaling a boolean. If TRUE, variables are scaled to zero mean and unit variance. By default, \code{scaling=TRUE}.
#' @param epsilonw a positive numerical value. It provides the precision of the stopping criterion over \code{w}. By default, \code{epsilonw =1e-04}.
#' @param verbose an integer value. If \code{verbose=0}, the function stays silent, if \code{verbose=1} (default option), it  prints
#'  whether the stopping criterion over the weights \code{w} is satisfied.
#' 
#' @return \item{lambda}{a numerical vector containing the regularization parameters (a grid of values).}
#' @return \item{W}{a \code{p} by \code{length(lambda)} numerical matrix. It contains the weights associated to each variable.}
#' @return \item{Wg}{a \code{L} by \code{length(lambda)} numerical matrix, where \code{L} is the number of groups. It contains the weights associated to each group.}
#' @return \item{cluster}{a \code{n} by \code{length(lambda)} integer matrix. It contains the cluster memberships, for each value of the regularization parameter.}
#' @return \item{sel.feat}{a numerical vector of the same length as \code{lambda}, giving the number of selected variables for each value of the regularization parameter.}
#' @return \item{sel.groups}{a numerical vector of the same length as \code{lambda}, giving the number of selected groups of variables for each value of the regularization parameter.}
#' @return \item{Z}{a matrix of size \code{n} by \code{p} containing the scaled data if \code{scaling=TRUE}, and a copy of \code{X} otherwise.}
#' @return \item{bss.per.feature}{a matrix of size \code{p} by \code{length(lambda)}. It contains the between-class variance computed for each variable.}
#' 
#' 
#' @details 
#' Group-sparse weighted k-means performs clustering on data described by numerical variables priorly partitionned into groups, and automatically selects the most discriminant groups by 
#' setting to zero the weights of the non-discriminant ones. 
#' 
#' The algorithm is based on the optimization of a cost function which is the weighted between-class variance penalized by a group L1-norm. The groups must be priorly defined through 
#' expert knowledge. If there is no group structure (each group contains one variable only), the algorithm reduces to the sparse weighted k-means introduced in Witten & Tibshirani (2010).
#' The penalty term may take into account the size of the groups by setting \code{sizegroup=TRUE} (see Chavent et al. (2020) for further details on the mathematical expression of the
#' optimized criterion). The importance of the penalty term may be adjusted through the regularization parameter \code{lambda}. If \code{lambda=0}, there is no penalty applied to the 
#' weighted between-class variance. The larger \code{lambda}, the larger the penalty term and the number of groups with null weights. 
#' 
#' The output of the algorithm is three-folded: one gets a partitioning of the data, a vector of weights associated to each group, and a vector of weights associated to each
#' variable. Weights equal to zero imply that the associated variables or the associated groups do not participate in the clustering process. 
#' 
#' Since it is difficult to chose the regularization parameter \code{lambda} without prior knowledge, the function builds automatically a grid of parameters and finds the partitioning
#' and the vectors of weights associated to each value in the grid. 
#' 
#' Note that when the regularization parameter is equal to 0 (no penalty applied), the output is different from that of a regular k-means, since the optimized criterion is a weighted 
#' between-class variance and not the between-class variance only. 
#'
#' @references Witten, D. M., & Tibshirani, R. (2010). A framework for feature 
#' selection in clustering. Journal of the American Statistical Association, 
#' 105(490), p.713-726.
#' @references Chavent, M. & Lacaille, J. & Mourer, A. & Olteanu, M. (2020). 
#' Sparse k-means for mixed data via group-sparse clustering, ESANN proceedings.
#' 
#' @seealso \code{\link{plot.spwkm}}, \code{\link{info_clust}}
#' 
#'@examples
#' data(iris)
#' # define two groups of variables: 
#' # "Sepal.Length" and "Sepal.Width" in group 1
#' # "Petal.Length" and "Petal.Width"  in group 2
#' index <- c(1, 2, 1, 2)
#' # group-sparse k-means
#' \donttest{
#' out <- groupsparsewkm(X = iris[,-5], centers = 3, index = index)
#' # grid of regularization parameters
#' out$lambda
#' k <- 10
#' # weights of the variables for the k-th regularization parameter
#' out$W[,k]
#' # weights of the groups for the k-th regularization parameter
#' out$Wg[,k]
#' # partition obtained with for the k-th regularization parameter
#' out$cluster[,k]
#' # between-class variance on each variable
#' out$bss.per.feature[,k]
#' # between-class variance 
#' sum(out$bss.per.feature[,k])/length(index)
#' 
#' # one variable per group (equivalent to sparse k-means)
#' index <- 1:4 # default option in groupsparsewkm
#' # sparse k-means
#' out <- groupsparsewkm(X = iris[,-5], centers = 3, index = index)
#' # or
#' out <- groupsparsewkm(X = iris[,-5], centers = 3)
#' # group weights and variable weights are identical in this case
#' out$Wg 
#' out$W
#' }

 

groupsparsewkm <- function(X, centers, lambda = NULL, nlambda = 20, 
                           index = 1:ncol(X), sizegroup = TRUE,  
                           nstart = 10, itermaxw = 20, itermaxkm = 10, 
                           scaling = TRUE, verbose = 1, epsilonw = 1e-04) 
{
  call <- match.call()
  check_fun_groupsparsw(X, lambda, nlambda, index, sizegroup, itermaxw, scaling, verbose)
  
  X <- as.matrix(scale(X, center = scaling, scale = scaling))
  
  X <- X[,order(index)]
  index <- sort(index)
  
  km <- stats::kmeans(X, centers, nstart, itermaxkm)
  clusterini <- km$cluster
  bss.per.featureini <- weightedss(X, clusterini)$bss.per.feature/nrow(X)
  maxlam <-  max(stats::aggregate(bss.per.featureini, by=list(index), function(x){norm.vect(x)/sqrt(length(x))})[,2])
  #check_maxlambda(lambda, maxlam)
  check_maxlambda(lambda)
  
  
  if (is.null(lambda)) 
    lambda <- seq(from = 0, to = (maxlam - 0.001), length.out = nlambda) 
  
  # loop over lambda
  ls <- sapply(lambda, FUN = function(i) {
    
    # loop over w
    cluster1 <- clusterini
    bss.per.feature <- bss.per.featureini
    w1 <- rep(1/ncol(X), ncol(X))  # initialization of w1 / w1 gives the weight at the i+1 iterations
    w0 <- abs(stats::rnorm(ncol(X)))
    W_cl <- list()
    niter <- 1
    while (sum(abs(w1 - w0))/sum(abs(w0)) > epsilonw && niter < itermaxw) {
      if (niter != 1) {
        Xw <- t(t(X)*sqrt(w1))
        km <- stats::kmeans(Xw, centers, nstart = nstart, iter.max = itermaxkm)
        cluster1 <- km$cluster
        bss.per.feature <- weightedss(X, cluster1)$bss.per.feature/nrow(X)
      }
      w0 <- w1
      cluster0 <- cluster1
      # Soft thresholding operator
      w1 <- groupsoft(bss.per.feature, i, index, sizegroup)
      if (norm.vect(w1)==0)
        niter <- itermaxw
      niter = niter + 1
    }  # end loop over w
    if (verbose == 1) 
      # if (niter >= itermaxw) 
      #   print(paste0("the stopping criterion over w is not satisfied for lambda = ", i)) else
      #     print(paste0("the stopping criterion over w is satisfied for lambda = ", i)) 
    {
      if (niter == itermaxw+1) 
        print(paste0("all the feature weights a set to 0 for lambda = ", i))
      if (niter == itermaxw) 
        print(paste0("the stopping criterion over w is not satisfied for lambda = ", i))  
      if (niter < itermaxw) 
        print(paste0("the stopping criterion over w is satisfied for lambda = ", i))  
    }
    #W_cl[[1]] <- w0
    W_cl[[1]] <- w1
    if (niter == itermaxw+1)
    {
      W_cl[[2]] <- rep(NA, length(cluster0))
      W_cl[[3]] <- rep(NA, length(bss.per.feature))
    } else
    {
      W_cl[[2]] <- cluster0
      W_cl[[3]] <- bss.per.feature
    }
    W_cl
  })  # end loop over lambda
  
  W <- do.call(cbind, (ls[1, ]))
  cluster <- do.call(cbind, (ls[2, ]))
  bss.per.feature <- do.call(cbind, (ls[3, ]))
  
  rownames(W) <- colnames(X)
  colnames(W) <- paste("Lambda", 1:length(lambda))
  
  Wg <- stats::aggregate(W, by=list(sort(index)), norm.vect)[,-1]
  Wg <- as.matrix(Wg)
  rownames(Wg) <- paste("Group", unique(sort(index)))
  colnames(Wg) <- paste("Lambda", 1:length(lambda))
  
  selected.features <- apply(W,2,function(x){sum(x>0)})
  selected.groups <- apply(Wg,2,function(x){sum(x>0)})
  
  rownames(bss.per.feature) <- colnames(X)
  colnames(bss.per.feature) <- paste("Lambda", 1:length(lambda))
  
  rownames(cluster) <- rownames(X)
  colnames(cluster)  <- paste("Lambda", 1:length(lambda))
  
  out <- list(call = call, type="NumGroupSparse", lambda = lambda, W = W, Wg = Wg, 
              cluster = cluster,   sel.feat= selected.features, 
              sel.groups=selected.groups,
              Z = X, index = index, bss.per.feature = bss.per.feature)
  class(out) <- "spwkm"
  return(out)
}