R/MIXclustering.R

Defines functions MIXclustering

Documented in MIXclustering

#' @title
#'     Bayesian Nonparametric Model for Clustering with Mixed Scale Variables
#'
#' @description
#'     \code{MIXclustering} is used to perform cluster analysis of individuals using a Bayesian nonparametric mixture model that jointly models mixed scale data and accommodates for different sampling probabilities. The model is described in Carmona, C., Nieto-Barajas, L. E., Canale, A. (2016).
#'
#' @param Y Matrix or data frame containing the data to be clustered.
#'
#' @param var_type Character vector that indicates the type of variable in each column of x. Three possible types:
#' \itemize{
#'   \item "\strong{c}" for continuous variables. It is assumed to be Gaussian-shaped.
#'   \item "\strong{o}" for ordinal variables (binary and ordered categorical).
#'   \item "\strong{m}" for nominal variables (non-ordered categorical).
#' }
#'
#' @param sampling_prob vector with the sampling probabilities \eqn{\pi_i} for each individual in case that the data come from a complex survey sample. By default \eqn{\pi_i=1}.
#' @param expansion_f vector with the expansion factors, the reciprocal of the sampling probabilities, \eqn{w_i = 1/\pi_i}. If both \code{sampling_prob} and \code{expansion_f} are specified, preference is given to \code{sampling_prob}.
#'
#' @param n_iter_out Number of effective iterations in the MCMC procedure for clustering.
#' @param n_burn Number of iterations discarded as part of the burn-in period at the beginning MCMC procedure.
#' @param n_thin Number of iterations discarded for thinning the chain (reducing the autocorrelation). We keep 1 of every n_thin iterations.
#'
#' @param a_fix A numeric value to set the parameter \eqn{a} in the model. If \code{NULL} (default), the parameter \eqn{a} is assigned a prior distribution. See \code{details}.
#' @param alpha Hyperparameter in the prior distribution of \eqn{a}. See \code{details}.
#' @param d_0_a Hyperparameter in the prior distribution of \eqn{a}. See \code{details}.
#' @param d_1_a Hyperparameter in the prior distribution of \eqn{a}. See \code{details}.
#'
#' @param b_fix A numeric value to set the parameter \eqn{b} in the model. If \code{NULL} (default), the parameter \eqn{b} is assigned a prior distribution. See \code{details}.
#' @param d_0_b Hyperparameter in the prior distribution of \eqn{b}. See \code{details}.
#' @param d_1_b Hyperparameter in the prior distribution of \eqn{b}. See \code{details}.
#' @param eta Tuning parameter controlling the proposal in the \emph{Metropolis-Hastings} step for \eqn{b}.
#'
#' @param d_0_z Hyperparameter in the prior distribution of the variance for the latent variables. See \code{details}.
#' @param d_1_z Hyperparameter in the prior distribution of the variance for the latent variables. See \code{details}.
#' @param kappa Tuning parameter controlling the proposal in the \emph{Metropolis-Hastings} step for the variance of latent variables.
#'
#' @param delta Tuning parameter controlling the proposal in the \emph{Metropolis-Hastings} step for the correlation of latent variables.
#'
#' @param d_0_mu Hyperparameter in the prior distribution of the variance of the location in each cluster. See \code{details}.
#' @param d_1_mu Hyperparameter in the prior distribution of the variance of the location in each cluster. See \code{details}.
#'
#' @param log_file Specifies a file to save the details with the execution time and the parameters used.
#' @param keep_param_chains Indicates if the simulations of parameters \code{a}, \code{b}, \code{lambda} and \code{omega} should be returned as output.
#'
#' @details
#'
#' The model consists on a Bayesian non-parametric approach for clustering that is capable to combine different types of variables through the usage of associated continuous latent variables. The clustering mechanism is based on a location mixture model with a Poisson-Dirichlet (\eqn{PD}) process prior on the location parameters \eqn{\mu_i ; i=1,...,n} of the associated latent variables.
#'
#' Computational inference about the cluster allocation and the posterior distribution of the parameters are performed using MCMC.
#'
#' A full description of the model is in the article Carmona et al. (2016) (\url{https://arxiv.org/abs/1612.00083}). See \code{Reference}.
#'
#' The model consider an individual \eqn{y_i} that is characterized by a multivariate response of dimension \eqn{p}, i.e., \eqn{y_i=(y_{i,1},...,y_{i,p})}. The total number of variables \eqn{p} is divided into \eqn{c} continuous variables, \eqn{o} ordinal variables, and \eqn{m} nominal variables such that \eqn{p=c+o+m}.
#'
#' For the continuous variables, it is convenient that the variables have a real support. The user may have transformed the original values before using the function \code{MIXclustering}.
#'
#' For each response \eqn{y_i=(y_{i,1},...,y_{i,p})} (of dimension \eqn{p}) a corresponding latent vector \eqn{z_i=(z_{i,1},...,z_{i,q})} (of dimension \eqn{q}) is created, according to the following:
#' \itemize{
#'    \item For each continuous variable \eqn{y_{i,j} ; j=1,...,c} the algorithm uses a latent with the same values \eqn{z_{i,j}=y_{i,j}}.
#'    \item For each ordinal variable \eqn{y_{i,j} , j = c+1,...,c+o}, with \eqn{K_j} different ordered values, the algorithm creates one latent \eqn{z_{i,j}}, that allows to map the categories into continuous values divided by thresholds. For example, for a binary \eqn{y_j}, we have \eqn{y_j=0} iff \eqn{z_j<0} and \eqn{y_j=1} iff \eqn{z_j>0}
#'    \item For each nominal variable \eqn{y_{i,j} , j = c+o+1,...,c+o+m}, with \eqn{L_j} categories, the algorithm require \eqn{L_j-1} latent variables, whose relative order is consistent with the observed category.
#'    }
#'
#' The data may come from a complex survey sample where each individual \eqn{y_i} has known sampling probability \eqn{\pi_i, i = 1,...,n}. The reciprocal of these sampling probabilities, \eqn{w_i=1/\pi_i}, are called expansion factors or sampling design weights.
#'
#' The joint model for the latent vector is therefore:
#'
#' \eqn{(z_i | \mu_i,\Sigma)} ~ \eqn{N_q(\mu_i, \pi_i \Sigma )}
#'
#' (Note: the final model in Carmona et al. (2016) has variance \eqn{\kappa \pi_i \Sigma}. This value of \eqn{\kappa} can be used in the package through a transformed sampling probability vector \eqn{\pi_i^*=\kappa\pi_i})
#'
#' The clustering model will be based in an appropriate choice of the prior distribution on the \eqn{\mu_i}'s. A clustering of the \eqn{\mu_i}'s will induce a clustering of the \eqn{y_i}'s. Our prior on the \eqn{\mu_i}'s will be:
#'
#' \eqn{\mu_i | G}~\eqn{G}, iid for \eqn{i=1,...,n}
#'
#' Where \eqn{G}~\eqn{PD(a,b,G_0)} is a Poisson-Dirichlet process with parameters \eqn{a \in [0,1)}, \eqn{b>-a} and centering measure \eqn{G_0}. The Dirichlet and the normalized stable processes arise when \eqn{a=0} and when \eqn{b=0}, respectively.
#'
#' In consequence, this choice of prior implies that the \eqn{\mu_i}'s are exchangeable with marginal distribution \eqn{\mu_i}~\eqn{G_0} for all \eqn{i=1,...,n}.
#'
#' In our case, \eqn{G(\mu)=N(0,\Sigma_\mu)}, where \eqn{\Sigma_\mu = diag( \sigma^2_{\mu 1} ,...,\sigma^2_{\mu q} )}.
#'
#' The parameters \eqn{a} and \eqn{b} in the model define the \eqn{PD} process and therefore control the number of groups. These parameters can be fixed, resulting in a larger/smaller number of groups if assigned a larger/smaller value, respectively.
#'
#' There are 9 hyperparameters in the function that also characterize the prior distributions in the model:
#' \itemize{
#'     \item f(a) = alpha * I(a=0) + (1-alpha) * dbeta( a | d_0_a , d_0_a )
#'     \item f(b | a) = dgamma( b + a | d_0_b , d_1_b )
#'     \item sigma^2 ~ inverse-gamma( d_0_z , d_1_z)
#'     \item sigma^2_mu ~ inverse-gamma( d_0_mu , d_1_mu )
#' }
#'
#' The definition of these values also affect the number of resulting clusters since they affect the variance implied in the model.
#'
#' For example, increasing the values of \code{d_1_a} and \code{d_1_b} reduce the number of groups.
#'
#' Finally, the function parameters \code{eta}, \code{kappa}, \code{delta} are tuning parameters that control the acceptance rate in the random-walk MH steps of the new proposed values for the parameters \eqn{b}, \eqn{\Lambda_{j,j}} (variance of latents) and \eqn{\Omega_{i,j}} (correlation of latents). These parameters are not recommended to be changed (used in the internal functions: \code{sampling_b} , \code{sampling_Lambda_jj} , \code{sampling_Omega_ij}).
#'
#' @return
#' \code{MIXclustering} returns a S3 object of class "\code{MIXcluster}".
#'
#' The generic methods \code{\link{summary}} and \code{\link{plot}} are defined for this class.
#'
#' An object of class "MIXcluster" is a list containing the following components:
#' \describe{
#'     \item{\code{cluster}}{vector with the cluster allocation for each row in the data. It corresponds to the iteration which is Closest-To-Average (CTA) arrangement.}
#'     \item{\code{cluster_heterogeneity}}{Heterogeneity Measure (HM) for the cluster in the previous point. The HM measure is discussed in section 4 of Carmona et al. (2017).}
#'     \item{\code{Y.cluster.summary}}{a summary of the data divided by the allocation in \code{$cluster}.}
#'     \item{\code{Y.var_type}}{vector with the variable types in the data.}
#'     \item{\code{Y.na}}{vector specifying the rows with missing values.}
#'     \item{\code{Y.n}}{number of rows in the data.}
#'     \item{\code{Y.p}}{number of variables in the data.}
#'     \item{\code{MC.clusters}}{matrix with the cluster allocation for each row in the data. Each column corresponds to an effective iteration in the MCMC simulation of the model (after discarding burn-in and thinning iterations).}
#'     \item{\code{MC.clusters_heterogeneity}}{Heterogeneity Measure (HM) for all the clusters returned in \code{MC.clusters}.}
#'     \item{\code{cluster.matrix.avg}}{average similarity matrix of size \eqn{n} by \eqn{n}.}
#'     \item{\code{MC.values}}{a list with the simulated values of the chains for the parameters \eqn{a},\eqn{b},\eqn{\Lambda},\eqn{\Omega}.}
#'     \item{\code{MC.accept.rate}}{a named vector with the acceptance rates for each parameter. It includes iterations that are discarded in the burn-in period and thinning.}
#'     \item{\code{call}}{the matched call.}
#' }
#'
#' @seealso
#' \code{\link{summary.MIXcluster}} for a summary of the clustering results, \code{\link{plot.MIXcluster}} for graphical representation of results.
#'
#' @examples
#' 
#' ##############################
#' #     Simulation study 1     #
#' #    Carmona et al. (2017)   #
#' ##############################
#' 
#' # Data and parameters are discussed in section 5.1 of Carmona et al. (2017) #
#' 
#' # Set seed for reproducibility #
#' set.seed(0) 
#' 
#' 
#' # Specification of data Y #
#' help(Y_ex_5_1)
#' 
#' # Observable data
#' # Choose scenario: 1, 2, or 3
#' ex_i <- 1
#' 
#' # Prior specification
#' # Choose "a", "b" or "c"
#' param_j <- "c"
#' 
#' # Specify the data type that is being provided to the method
#' var_type_Y_ex_5_1 <- list( c("c","c","c"),
#'                            c("o","o"),
#'                            c("o","o","o","c") )
#' \dontrun{
#' cluster_ex <- MIXclustering( Y = as.matrix(Y_ex_5_1[[ ex_i ]]),
#'                              var_type=var_type_Y_ex_5_1[[ ex_i ]],
#'                              
#'                              n_iter_out=1500,
#'                              n_burn=200,
#'                              n_thin=3,
#'                              
#'                              alpha = meta_param_ex[ param_j, "alpha" ],
#'                              d_0_a = meta_param_ex[ param_j, "d_0_a"],
#'                              d_1_a = meta_param_ex[ param_j, "d_1_a" ],
#'                              d_0_b = meta_param_ex[ param_j, "d_0_b" ],
#'                              d_1_b = meta_param_ex[ param_j, "d_1_b" ],
#'                              eta = meta_param_ex[ param_j, "eta" ],
#'                              kappa = meta_param_ex[ param_j, "kappa" ],
#'                              delta = meta_param_ex[ param_j, "delta" ],
#'                              
#'                              d_0_z = meta_param_ex[ param_j, "d_0_z" ],
#'                              d_1_z = meta_param_ex[ param_j, "d_1_z" ],
#'                              d_0_mu = meta_param_ex[ param_j, "d_0_mu" ],
#'                              d_1_mu = meta_param_ex[ param_j, "d_1_mu" ] )
#' # Summary of clustering results
#' summary(cluster_ex)
#'
#' # Visualizing clustering results
#' plot(cluster_ex,type="heatmap")
#' plot(cluster_ex,type="chain")
#' 
#' # Comparison of cluster configurations #
#' # 1) Minimum distance with average MCMC iterations
#' # 2) Minimum Heterogeneity Measure (HM)
#' plot( x=jitter(cluster_ex$cluster),y=jitter(cluster_ex$clusterHMmin), col="#FF000080", pch=20,
#'       main=paste("Comparison of two relevant cluster configurations"),
#'       xlab="minimizes distance to average MCMC grouping", ylab="minimizes Heterogeneity Measure" )
#' 
#' # Comparison with the original clusters in the simulated data
#' plot(x=jitter(Z_latent_ex_5_1$cluster),
#'      y=jitter(cluster_ex$cluster),
#'      main=paste("Comparison real configuration with the model results"),
#'      xlab="Real cluster",
#'      ylab="Model cluster",
#'      pch=19, col="#FF000080")
#' }
#'
#' ##############################
#' #      Households data       #
#' #    Carmona et al. (2017)   #
#' ##############################
#' 
#' # Testing "MIXclustering" function with poverty.data #
#' # Data and parameters are discussed in section 5.3 of Carmona et al. (2017) #
#' 
#' # Set seed for reproducibility #
#' set.seed(0) 
#' 
#'
#' \dontrun{
#' # relevant variables for clustering households #
#' Y_names <- c("ict_norm",
#'              "ic_ali","ic_asalud","ic_cv",
#'              "ic_rezedu","ic_sbv","ic_segsoc",
#'              "niv_ed","tam_loc")
#' Y_var_type <- c("c","o","o","o","o","o","o","o","m")
#' 
#' # using only data from state 15 (Edomex) #
#' aux_subset <- rep(T,nrow(poverty.data))
#' aux_subset <- aux_subset & is.element(substr(poverty.data$folioviv,1,2),"15")
#'
#' Y_data <- poverty.data[aux_subset,Y_names]
#'
#' ### Sampling probability dependin on the scenario ###
#' # Scenario description in section 5.3 of Carmona et al. (2017) #
#' # Choose 1, 2 or 3 #
#' poverty_sampling_spec <- 3
#' 
#' 
#' if (poverty_sampling_spec == 1) {
#'   k <- 1
#'   sampling_prob_pov <- rep(1,nrow(Y_data))
#' } else if (poverty_sampling_spec == 2) {
#'   k <- 2 * mean(poverty.data[aux_subset,"factor_hog"])
#'   sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"]
#' } else if (poverty_sampling_spec == 3) {
#'   k <- 4 * mean(poverty.data[aux_subset,"factor_hog"])
#'   sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"]
#' }
#' 
#' cluster_poverty <- MIXclustering( Y=Y_data,
#'                                   var_type=Y_var_type,
#'                                   n_iter_out=1500,
#'                                   n_burn=200,
#'                                   n_thin=3,
#'                                   
#'                                   alpha = 0.5,
#'                                   d_0_a = 1, d_1_a = 1,
#'                                   d_0_b = 1, d_1_b = 1,
#'                                   
#'                                   eta = 2,
#'                                   kappa = 5,
#'                                   delta = 4,
#'                                   
#'                                   d_0_z = 2.1, d_1_z = 30,
#'                                   d_0_mu = 2.1, d_1_mu = 30,
#'                                   
#'                                   sampling_prob = k * sampling_prob_pov )
#' 
#' summary(cluster_poverty)
#' plot(cluster_poverty,type="heatmap")
#' plot(cluster_poverty,type="chain")
#' }
#'
#' @references
#' Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2017). \emph{Model based approach for household clustering with mixed scale variables.} (\url{https://arxiv.org/abs/1612.00083})
#'
#' @importFrom stats aggregate as.formula cor rgamma rmultinom sd
#' @import plyr
#'
#' @useDynLib BNPMIXcluster
#' @importFrom Rcpp sourceCpp
#' @importFrom stats weighted.mean 
#' @export

MIXclustering <- function( Y,
                           var_type,
                           
                           n_iter_out=2000,
                           n_burn=100,
                           n_thin=2,
                           
                           a_fix=NULL,
                           alpha=0.5,
                           d_0_a=1, d_1_a=1,
                           
                           b_fix=NULL,
                           d_0_b=1, d_1_b=1,
                           eta=2,
                           
                           d_0_z=2.1, d_1_z=30,
                           kappa=5, delta=4,
                           
                           d_0_mu=2.1, d_1_mu=30,
                           
                           sampling_prob=NULL,
                           expansion_f=NULL,
                           
                           log_file=NULL,
                           keep_param_chains=FALSE ) {
  
  if( !any( c(is.matrix(Y),is.data.frame(Y)) ) ) { stop("Y has to be a Matrix or data frame, with each column representing a variable, and each row representing an individual") }
  
  USING_CPP = TRUE
  max.time=Inf
  
  #on.exit(browser())
  
  log_clock <- Sys.time()
  
  if( !is.null(log_file) ) {
    cat("\n
        ***** MIXclustering *****\n\n",
        "----- characteristics -----\n",
        "effective iterations = ",n_iter_out,"\n\n",
        "alpha=",alpha,"\n",
        "d_0_a=",d_0_a,"\n",
        "d_1_a=",d_1_a,"\n",
        "eta=",eta,"\n",
        "d_0_b=",d_0_b,"\n",
        "d_1_b=",d_1_b,"\n",
        "kappa=",kappa,"\n",
        "delta=",delta,"\n",
        "d_0_z=",d_0_z,"\n",
        "d_1_z=",d_1_z,"\n",
        "d_0_mu=",d_0_mu,"\n",
        "d_1_mu=",d_1_mu,"\n",
        "---------------------------\n",
        "Starting time:\n",as.character(log_clock),"\n\n",
        "---------------------------\n\n",
        "Iteration times:\n\n",
        "iter_i , elapsed_minutes , Sys.time\n",
        file=log_file)
  }
  
  dev_verbose = FALSE
  
  cl <- match.call()
  
  ######
  #   total number of iterations
  #####
  n_iter <- n_burn + 1 + (n_iter_out-1)*(n_thin)
  
  # possible variable classes that are allowed
  var_type_all <- c("c","o","m")
  
  # checking input consistency
  if( length(var_type) != ncol(Y) ) {
    cat('\nError: The number of columns in "Y" have to be equal to the lenght of vector "var_type"\n')
    stop('The number of columns in "Y" have to be equal to the lenght of vector "var_type"')
  }
  if( any(!is.element(var_type,var_type_all)) ) {
    cat('\nError: Elements in "var_type" have to be one of ',paste(var_type_all,collapse = ","),'\n')
    stop('Elements in "var_type" have to be one of ',paste(var_type_all,collapse = ","))
  }
  
  # Sorting Y columns
  if(!is.null(colnames(Y))) {
    Y_colnames_orig <- colnames(Y)
  } else {
    colnames(Y) <- Y_colnames_orig <- paste("Y",1:ncol(as.matrix(Y)),sep=".")
  }
  Y_orig <- Y
  var_type_orig <- var_type
  
  Y_new_order <- c( which(is.element(var_type,var_type_all[1])),
                    which(is.element(var_type,var_type_all[2])),
                    which(is.element(var_type,var_type_all[3])))
  var_type <- var_type[Y_new_order]
  Y <- Y[,Y_new_order,drop=F]
  
  ##### Missing data #####
  # Only consider rows with complete information, i.e. no missing data allowed
  aux_na_Y <- apply(!is.na(Y),1,all)
  
  Y <- Y[aux_na_Y,,drop=F]
  if(sum(!aux_na_Y)>0){
    cat(sum(!aux_na_Y),' rows with missing data were removed\n')
  }
  
  # vector for returning Z to Y incorporating back the NAs
  map_z_to_orig <-cumsum(aux_na_Y)
  map_z_to_orig[duplicated(map_z_to_orig)]<-NA
  
  n <- nrow(Y)
  p <- ncol(Y)
  
  # continuous latent variables
  n_c <- sum( is.element( var_type, var_type_all[1] ) )
  # ordinal latent variables (categorical, ordered)
  n_o <- sum( is.element( var_type, var_type_all[2] ) )
  # nominal variables (categorical, unordered)
  n_m <- sum( is.element( var_type, var_type_all[3] ) )
  # nominal latent variables (categorical, unordered)
  n_m_l = rep(NA,n_m)
  if(n_m>0) {
    for (i in 1:n_m ) {
      # i=1
      aux_i = which( is.element( var_type, var_type_all[3] ) )[i]
      # k-1 latents are needed for a categorical variable with k categories
      n_m_l[i] = length(unique(Y[,aux_i]))-1
    }
    rm(i,aux_i)
  }
  # total number of latent variables in Z
  n_q <- n_c+n_o+sum(n_m_l)
  
  
  ### Transforms Y to numerical matrix ###
  Y_numerical <- matrix(NA,nrow(Y),ncol(Y))
  if( n_c>0 ) {
    for(j in which(is.element( var_type, var_type_all[1] ))) {
      Y_numerical[,j] <- as.numeric(Y[,j])
    }
  }
  if( n_o>0 ) {
    for(j in which(is.element( var_type, var_type_all[2] ))) {
      Y_numerical[,j] <- match(Y[,j],sort(unique(Y[,j])))
    }
  }
  if( n_m>0 ) {
    for(j in which(is.element( var_type, var_type_all[3] ))) {
      Y_numerical[,j] <- match(Y[,j],sort(unique(Y[,j])))
    }
  }
  Y <- Y_numerical
  rm(Y_numerical,j)
  
  # changes the colnames of Y for simplicity and standarization
  
  colnames(Y) <- paste("var_",
                       c(rep(var_type_all[1],n_c),rep(var_type_all[2],n_o),rep(var_type_all[3],n_m)),"_",
                       formatC( unlist(mapply(seq,1,c(n_c,n_o,n_m),length.out=c(n_c,n_o,n_m))) , width = 2, flag = '0'),
                       sep="")
  
  if(is.null(sampling_prob)) {
    
    if(is.null(expansion_f)) {
      expansion_f<-rep(1,n)
    } else {
      if(length(expansion_f)!=n) {
        cat('\nError: There is an inconsistency between "expansion_f" and the number of rows in "Y"\n')
        stop('There is an inconsistency between "expansion_f" and the number of rows in "Y"')
      }
    }
    
    sampling_prob <- 1/expansion_f # Sample design probabilities
    
    
    
    if(length(sampling_prob)!=n) {
      cat('\nError: There is an inconsistency between "sampling_prob" and the number of rows in "Y"\n')
      stop('There is an inconsistency between "sampling_prob" and the number of rows in "Y"')
    }
  } else {
    if(is.null(expansion_f)) {
    expansion_f <- 1/sampling_prob
    } else {
      stop('Only one of "sampling_prob" or "expansion_f" should be specified')
    }
  }
  
  #sampling_prob <- sampling_prob/sum(sampling_prob) # should add up 1
  
  if(n_c>0){
    ### scale continuos variables to have mean 0 and sd 1 ###
    Y[,1:n_c] <- scale(Y[,1:n_c],center=T,scale=T)
  }
  
  #####     Simulating latent variables 'Z' from 'Y'     #####
  if(FALSE) {
    # DANGER: This is not ready yet! #
    Z <- get_latents_cpp( Y=as.matrix(Y),
                          var_type = match(var_type,var_type_all),
                          mu_Z = matrix(0,nrow=n,ncol=n_q),
                          sigma_Z = diag(1,nrow=n_q,ncol=n_q),
                          Z_ini = matrix(0,1,1) )
  } else {
    Z <- get_latents( Y=Y,
                      var_type=var_type,
                      USING_CPP=USING_CPP )$Z
  }
  
  ##### Start: Initializing the chain #####
  
  # sigma_Z: variances in Lambda
  sigma_Z_diag <- apply(Z,2,sd)
  Lambda <- diag(sigma_Z_diag,nrow=length(sigma_Z_diag),ncol=length(sigma_Z_diag))
  rm(sigma_Z_diag)
  
  # unit variances for "sigma_Z"
  aux_var1_Z <- rep(F,n_q)
  if( n_o>0 ){
    # unit variance for binary ordinal variables
    # aux_var1_Z[n_c+which(K_o==2)] <- T
    # unit variance for ALL ordinal variables
    aux_var1_Z[(n_c+1):(n_c+n_o)] <- T
  }
  if( n_m>0 ){
    # unit variance for categorical (not ordinal) variables
    aux_var1_Z[(n_c+n_o+1):(n_c+n_o+sum(n_m_l))] <- T
  }
  diag(Lambda)[aux_var1_Z] <- 1
  
  #cat("aux_var1_Z: ",paste(aux_var1_Z,collapse=", "),"\n")
  
  # sigma_Z: correlations in Omega
  Omega <- cor(Z)
  
  # sigma_Z
  sigma_Z <- Lambda %*% Omega %*% Lambda
  
  # checking that sigma_Z[j,k] <= sigma_Z[j,j] * sigma_Z[k,k]
  aux_cond_sigma <- sigma_Z < outer(diag(sigma_Z),diag(sigma_Z))
  diag(aux_cond_sigma) <- T
  if ( !all(aux_cond_sigma) ) {
    cat('\nError: It must happen that: sigma_Z[j,k] < sigma_Z[j,j] * sigma_Z[k,k] \n')
    stop('It must happen that: sigma_Z[j,k] < sigma_Z[j,j] * sigma_Z[k,k] ')
  }
  
  # mu_Z <- Z
  # it is not necessary to define an object for mu_Z, since it is is fully characterized by
  # its unique values 'mu_star'
  # and its mapping 'mu_star_map'
  
  # matrix with unique values of mu_Z
  if(n_c>0) {
    mu_star <- Z[!duplicated(Z[,1]),,drop=F]
  } else {
    mu_star <- Z[!duplicated(Z),,drop=F]
  }
  
  # mapping each mu_Z to mu_star
  mu_star_map <- match(data.frame(t(Z)), data.frame(t(mu_star)))
  
  # how many repeated values of each mu_star
  mu_star_n_r <- as.numeric(table(mu_star_map))
  
  # 'sigma_mu'
  sigma_mu <- diag( diag(sigma_Z) )
  
  if( (nrow(mu_star)!=length(mu_star_n_r)) | (nrow(Z)!=length(mu_star_map)) ) {
    cat('Error: there is an inconsistency between "mu_star", "mu_star_n_r" and "mu_star_map"\n')
    stop('there is an inconsistency between "mu_star", "mu_star_n_r" and "mu_star_map"')
  }
  
  a <- ifelse( is.null(a_fix), 0, a_fix )
  
  b <- ifelse( is.null(b_fix), 0.1, b_fix )
  
  ##### End: Initializing the chain #####
  
  #####
  #mu_Z_sim <- array(NA,dim=c(n,n_q,n_iter))
  #sigma_Z_sim <- array(NA,dim=c(n_q,n_q,n_iter))
  mu_star_map_sim <- data.frame(iter_0=1:n)
  #colnames(mu_star_map_sim) <- paste("iter_",1:ncol(mu_star_map_sim),sep="")
  #rownames(mu_star_map_sim) <- NULL
  
  #mu_star_n_r_sim <- list()
  #mu_star_probs <- array(dim=c(n,n+1,n_iter))
  
  ##### Monitoring acceptance rate for MH #####
  
  if(keep_param_chains) {
    Lambda_sim <- matrix(as.numeric(NA),nrow=n_iter,ncol=ncol(sigma_Z) )
    Omega_sim <- array(as.numeric(NA),dim=c(nrow(sigma_Z),ncol(sigma_Z),n_iter))
    a_sim <- as.numeric(NULL)
    b_sim <- as.numeric(NULL)
    
    #Lambda_accept <- matrix(as.numeric(NA),nrow=n_iter,ncol=ncol(sigma_Z) )
    #Omega_accept <- array(as.numeric(NA),dim=c(nrow(sigma_Z),ncol(sigma_Z),n_iter))
    #a_accept <- as.numeric(NULL)
    #b_accept <- as.numeric(NULL)
  } else {
    Lambda_sim <- NULL
    Omega_sim <- NULL
    a_sim <- NULL
    b_sim <- NULL
    
    #Lambda_accept <- NULL
    #Omega_accept <- NULL
    #a_accept <- NULL
    #b_accept <- NULL
  }
  
  
  #if(dev_verbose) {
  cat("*** Clustering estimation started ***\n")
  #cat('     Sys.time() = ',as.character(Sys.time()),sep='')
  #}
  time_sim <- Sys.time()
  
  lsa_iter <- c(ls(),"iter_i","lsa_iter","pb")
  
  for( iter_i in 1:n_iter) {
    
    if(iter_i==1) {
      # initializing progress bar
      pb <- plyr::create_progress_bar("time")
      pb$init(n_iter)
      
      on.exit(pb$term())
    }
    
    # iter_i<-1
    clock_iter_i <- Sys.time()
    if(dev_verbose) {
      cat('\n**-- iter_i=',iter_i,' --**\n\n')
    }
    
    
    ### (a) Sampling mu_Z ###
    
    #mu_Z_new <- mu_Z ; mu_Z_new[] <- NA
    if(dev_verbose) {
      cat('Sampling "mu_Z": ',nrow(Z),' values...\n',sep='')
    }
    # Creating a copy to keep control of original values of mu_star
    mu_star_new <- mu_star
    mu_star_map_new <- mu_star_map
    mu_star_n_r_new <- mu_star_n_r
    for( i in 1:nrow(Z) ) {
      # i<-1
      if(dev_verbose) {
        if (i%%10==0) { cat( i , ", " ) }
      }
      # getting the probabilities #
      v_i <- solve( solve(sampling_prob[i] * sigma_Z) + solve(sigma_mu) )
      # u_i <- v_i %*% solve(sampling_prob[i] * sigma_Z) %*% matrix(Z[i,],nrow=n_q,ncol=1)
      u_i <- v_i %*% solve(sampling_prob[i] * sigma_Z) %*% matrix(mu_star[mu_star_map[i],],nrow=n_q,ncol=1)
      
      # eliminates the non-simmetry due to numerical precision
      if( max( abs(v_i-t(v_i)) )<1e-7 ) {v_i <- v_i - (v_i-t(v_i))/2} else {browser()}
      
      # number of diferent values of mu_star[-1]
      mu_star_n_r_temp <- mu_star_n_r
      mu_star_n_r_temp[mu_star_map[i]] <- mu_star_n_r_temp[mu_star_map[i]]-1
      r_i <- sum(mu_star_n_r_temp>0)
      
      D_0 <- ( b + a * r_i ) * dmvnrm_arma( x=Z[i,,drop=F] , mean=rep(0,n_q), sigma=sampling_prob[i]*sigma_Z+sigma_mu )
      D_values <- as.numeric(NULL)
      for(r in 1:length(mu_star_n_r_temp)) {
        D_values[r] <- ( mu_star_n_r_temp[r] - a ) * dmvnrm_arma( x=Z[i,,drop=F] , mean=mu_star[r,], sigma=sampling_prob[i]*sigma_Z )
      }
      D_values[D_values<0] <- 0
      
      if(sum( D_0, D_values )==0) {
        cat('\nError: There is a problem with "D_r" in the simulation of mu_Z \n: sum( D_0, D_values )==0\n')
        stop('There is a problem with "D_r" in the simulation of mu_Z \n: sum( D_0, D_values )==0')
      }
      
      # Calculating probabilities
      prob_0 <- D_0 / sum( D_0, D_values )
      prob_values <- D_values / sum( D_0, D_values )
      
      #mu_star_probs[i,1+0:length(prob_values),iter_i] <- c(prob_0,prob_values)
      
      # simulating mu #
      mu_class <- which( is.element( rmultinom(n=1,size=1,prob=c(prob_0,prob_values)), 1 ) )
      if(mu_class==1) {
        # generates a new value for mu_Z
        mu_star_new <- rbind( mu_star_new, mvtnorm::rmvnorm( n=1 , mean=u_i, sigma=v_i ) )
        mu_star_map_new[i] <- nrow(mu_star_new)
        mu_star_n_r_new[mu_star_map[i]] <- mu_star_n_r_new[mu_star_map[i]]-1
        mu_star_n_r_new <- c(mu_star_n_r_new,1)
        
      } else {
        # get an existing value
        #cat(i,mu_class-1,"\n")
        mu_star_n_r_new[mu_star_map[i]] <- mu_star_n_r_new[mu_star_map[i]]-1
        mu_star_n_r_new[mu_class-1] <- mu_star_n_r_new[mu_class-1]+1
        mu_star_map_new[i] <- mu_class-1
      }
      if(any(is.na(mu_star_new))){
        stop("There is a problem sampling from mu_Z (code 1)")
      }
    } #finished simulating from mu_Z
    
    # Identifies values of mu_star that have no Z elements pointing to them
    aux_no_map <- is.element(mu_star_n_r_new,0)
    
    # Replaces new simulated values of mu_star
    mu_star <- mu_star_new[!aux_no_map,,drop=F]
    mu_star_map <- mu_star_map_new - cumsum(aux_no_map)[mu_star_map_new]
    mu_star_n_r <- mu_star_n_r_new[!aux_no_map]
    
    if(dev_verbose) {
      cat('...Done! \n')
    }
    
    ### (b) Sampling mu_star ###
    
    # Creating a copy to keep control of original values of mu_star
    mu_star_new <- mu_star
    mu_star_new[] <- NA
    if(dev_verbose) {
      cat('Sampling "mu_star": ',nrow(mu_star),' values...\n',sep='')
    }
    sigma_mu_inv <- solve(sigma_mu)
    sigma_Z_inv <- solve(sigma_Z)
    for( j in 1:nrow(mu_star) ) {
      # j<-1
      if(dev_verbose) {
        if (j%%10==0) { cat( j , ", " ) }
      }
      I_j <- is.element(mu_star_map,j)
      
      if(T){
        v_i_star <- solve(sum(1/sampling_prob[I_j]) * sigma_Z_inv + sigma_mu_inv )
      } else {
        for( k in 1:sum(I_j) ) {
          if(k==1) {
            v_i_star <- (1/sampling_prob[I_j])[k] * sigma_Z_inv + sigma_mu_inv
          } else {
            v_i_star <- v_i_star + (1/sampling_prob[I_j])[k] * sigma_Z_inv + sigma_mu_inv
          }
        }
        v_i_star <- solve(v_i_star)
        rm(k)
      }
      weighted_z <- apply( matrix(1/sampling_prob[I_j],nrow=sum(I_j),ncol=n_q,byrow=F) * matrix(Z[I_j,],nrow=sum(I_j),ncol=n_q), 2, sum )
      weighted_z <- matrix(weighted_z,nrow=n_q,ncol=1)
      u_i_star <- v_i_star %*% solve(sigma_Z) %*% weighted_z
      
      mu_star_new[j,] <- mvtnorm::rmvnorm(1,mean=u_i_star,sigma=v_i_star)
    }
    if(dev_verbose) {
      cat('...Done! \n')
    }
    # Replaces new simulated values of mu_star
    mu_star <- mu_star_new
    
    ### (c) Sampling "sigma_mu" ###
    if(dev_verbose) {
      cat( 'Sampling "sigma_mu":\n' )
    }
    sigma_mu_new <- sigma_mu
    sigma_mu_new[] <- 0; diag(sigma_mu_new) <- NA
    for( j in 1:n_q ) {
      if(dev_verbose) {
        cat(j,", ")
      }
      shape_gamma <- d_0_mu+length(mu_star_n_r)/2
      rate_gamma <- d_1_mu+(1/2)*sum(mu_star[,j]^2)
      sigma_mu_new[j,j] <- 1/rgamma( n=1, shape=shape_gamma, rate=rate_gamma )
    }
    sigma_mu <- sigma_mu_new
    
    if(dev_verbose) {
      cat('...Done! \n')
    }
    
    ### Sampling sigma_Z ###
    
    if( !matrixcalc::is.positive.definite(sigma_Z) ) {
      cat("*****\nProcess finished because 'sigma_Z' is not positive definite!\n*****");
      return()
    }
    
    sigma_Z_new <- sigma_Z
    Lambda_new <- Lambda
    Omega_new <- Omega
    
    ### (d) Sampling sigma_Z, variances in Lambda ###
    if(dev_verbose) {
      cat( 'Sampling sigma_Z: variances in Lambda \n' )
    }
    
    # all elements of sigma_Z with variance different of 1
    for( j_sigma in which(!aux_var1_Z)) {
      # j_sigma <- 1
      #for( j_sigma in 1:n_q) {
      
      if(dev_verbose) {
        cat(j_sigma,", ")
      }
      
      aux_Lambda <-  sampling_Lambda_jj( n_sim_mh=1, sigma_jj_ini=Lambda_new[j_sigma,j_sigma]^2,j=j_sigma,
                                         d_0_z=d_0_z, d_1_z=d_1_z, kappa=kappa,
                                         Z=Z, mu_Z=mu_star[mu_star_map,,drop=F], sigma_Z=sigma_Z_new, sampling_prob=sampling_prob,
                                         max.time=max.time,n.burn=0,
                                         verbose=F,
                                         USING_CPP=USING_CPP )
      
      Lambda_new[j_sigma,j_sigma] <- sqrt(aux_Lambda[[1]])
      #Lambda_accept[iter_i,j_sigma] <- aux_Lambda[[2]]
      
      # Element with unitary variance in Lambda_new
      # diag(Lambda_new)[aux_var1_Z] <- 1
      
      # updates 'sigma_Z' after updating each element of 'Lambda'
      # because 'Lambda' sampling IS dependent of 'sigma_Z'
      sigma_Z_new <- Lambda_new %*% Omega_new %*% Lambda_new
      
      # eliminates the non-simmetry due to numerical precision
      if( max( abs(sigma_Z_new-t(sigma_Z_new)) ) < 1e-7 ) {sigma_Z_new <- sigma_Z_new - (sigma_Z_new-t(sigma_Z_new))/2} else {browser()}
      
      if( !matrixcalc::is.positive.definite(sigma_Z_new) ) {
        cat("*****\nProcess finished because 'sigma_Z_new' is not positive definite!\n*****");
        return()
      }
    }
    
    if(dev_verbose) {
      cat('...Done! \n')
    }
    
    
    ### (e) Sampling sigma_Z, correlations in Omega ###
    if ( all(dim(Omega)>c(1,1)) ) {
      if(dev_verbose) {
        cat( 'Sampling sigma_Z: correlations in Omega \n' )
      }
      
      if( !matrixcalc::is.positive.definite(Omega_new) ) {
        cat("*****\nProcess finished because 'Omega_new' is not positive definite!\n*****");
        return()
      }
      
      for(i_omega in 2:dim(Omega_new)[1]) {
        for(j_omega in 1:(i_omega-1) ) {
          if(dev_verbose) {
            cat('(',i_omega,',',j_omega,') ',sep='')
          }
          
          aux_omega_ij_new <- sampling_Omega_ij( n=1,
                                                 Omega.ini=Omega_new,
                                                 i=i_omega,
                                                 j=j_omega,
                                                 delta=delta,
                                                 Z=Z,
                                                 mu_Z=mu_star[mu_star_map,,drop=F],
                                                 Lambda=Lambda_new,
                                                 sampling_prob=sampling_prob,
                                                 n.burn=0,
                                                 n.thin=0,
                                                 max.time=max.time,
                                                 USING_CPP=USING_CPP )
          
          
          omega_ij_new <- aux_omega_ij_new[[1]]
          #Omega_accept[i_omega,j_omega,iter_i] <- Omega_accept[j_omega,i_omega,iter_i] <- aux_omega_ij_new[[2]]
          
          Omega_new[i_omega,j_omega] <- Omega_new[j_omega,i_omega] <- omega_ij_new
          if(dev_verbose) {
            if(j_omega==(i_omega-1)) {cat('\n')}
          }
          
          if( !matrixcalc::is.positive.definite(Omega_new) ) {
            cat("     Process finished because 'Omega_new' is not positive definite!\n");
            return()
          }
          
        }
      }
      
      # updates 'sigma_Z' after all element in 'Omega' are sampled
      # only once because 'Omega' sampling IS NOT dependent of 'sigma_Z'
      sigma_Z_new <- Lambda_new %*% Omega_new %*% Lambda_new
      
      if( max( abs(sigma_Z_new-t(sigma_Z_new)) ) < 1e-7 ) {
        sigma_Z_new <- sigma_Z_new - (sigma_Z_new-t(sigma_Z_new))/2
      } else {
        stop("There is a problem sampling from sigma_Z (code 1)")
      }
      
      if( !matrixcalc::is.positive.definite(sigma_Z_new) ) {
        cat("*****\nProcess finished because 'sigma_Z_new' is not positive definite!\n*****");
        return()
      }
      
      # eliminates the non-simmetry due to numerical precision
      if( max( abs(sigma_Z_new-t(sigma_Z_new)) ) < 1e-7 ) {
        sigma_Z_new <- sigma_Z_new - (sigma_Z_new-t(sigma_Z_new))/2
      } else {
        stop("There is a problem sampling from sigma_Z (code 2)")
      }
      
      #if( !matrixcalc::is.positive.definite(sigma_Z_new) ) {
      #  cat("*****\nProcess finished because 'sigma_Z' is not positive definite!\n*****");
      #  return()
      #}
      
      sigma_Z <- sigma_Z_new
      Lambda <- Lambda_new
      Omega <- Omega_new
      if(dev_verbose) {
        cat('...Done! \n')
      }
    }
    
    # (f) Sampling "a"
    
    if(is.null(a_fix)){
      
      if(dev_verbose) {
        cat( 'Sampling "a":\n' )
      }
      
      aux_a_new <- sampling_a( n=1, a.ini=a,
                               b=b, alpha=alpha, d_0_a=d_0_a, d_1_a=d_1_a,
                               mu_star_n_r=mu_star_n_r,
                               max.time=max.time, n.burn=0,
                               verbose=F,
                               USING_CPP=USING_CPP)
      
      a_new <- aux_a_new$a.chain
      #a_accept <- c( a_accept , aux_a_new$accept.indic )
      
    } else {
      a_new <- a_fix
      #a_accept <- NULL
    }
    
    a <- a_new
    
    if(dev_verbose) {
      cat('...Done! \n')
    }
    
    # (g) Sampling "b"
    
    if(is.null(b_fix)){
      if(dev_verbose) {
        cat( 'Sampling "b":\n' )
      }
      
      aux_b_new <- sampling_b( n_sim_mh=1, b_ini=b,
                               a=a, d_0_b=d_0_b, d_1_b=d_1_b,
                               mu_star_n_r=mu_star_n_r,
                               max.time=max.time, n.burn=0, eta=eta,
                               verbose=F,
                               USING_CPP=USING_CPP )
      
      b_new <- aux_b_new[[1]]
      #b_accept <- c(b_accept, aux_b_new[[2]])
      
      if(!all(b_new>-a)){stop('There is a problem sampling from "b", it should be >-a\nb=',b,"\n-a=",-a,sep="")}
      
    } else {
      b_new <- b_fix
      #b_accept <- NULL
    }
    b <- b_new
    if(dev_verbose) {
      cat('...Done! \n')
    }
    
    # (h) Sampling "Z_ij"
    
    if(dev_verbose) {
      cat( 'Sampling "Z_ij":\n' )
    }
    
    if( FALSE ) {
      # DANGER: This is not ready yet! #
      Z_new <- get_latents_cpp( Y = as.matrix(Y),
                                var_type = match(var_type, var_type_all),
                                mu_Z = mu_star[mu_star_map,,drop=F],
                                sigma_Z = sigma_Z,
                                Z_ini = Z )
    } else {
      Z_new <- get_latents( Y=Y,
                            var_type=var_type,
                            mu_Z=mu_star[mu_star_map,,drop=F],
                            sigma_Z=sigma_Z,
                            Z_old=Z,
                            USING_CPP=USING_CPP)$Z
    }
    if(dev_verbose) {
      cat('...Done! \n')
    }
    
    ### Storing simulation values for each iteration ###
    
    #mu_Z_sim[,,iter_i] <- mu_star[mu_star_map,]
    #sigma_Z_sim[,,iter_i] <- sigma_Z
    mu_star_map_sim[,paste("iter_",iter_i,sep="")] <- mu_star_map
    #mu_star_n_r_sim[[iter_i]] <- mu_star_n_r
    
    if(keep_param_chains) {
      Lambda_sim[iter_i,] <- diag(Lambda)
      Omega_sim[,,iter_i] <- Omega
      a_sim[iter_i] <- a
      b_sim[iter_i] <- b
    }
    
    time_sim[iter_i+1] <- Sys.time()
    
    if(dev_verbose) {
      cat('\n   Finished iter_i=',iter_i,', \n     elapsed time:\n          ',format(diff(time_sim)[iter_i]),' this iteration\n          ',format(diff(time_sim[c(1,length(time_sim))])), ' total\n**----**',sep='')
    }
    
    rm(list=setdiff(ls(),lsa_iter))
    pb$step()
    
    if( !is.null(log_file) & ( ((iter_i%%100)==0) || (iter_i==n_iter) ) ) {
      cat( as.character(iter_i)," , ",difftime(time1=Sys.time(),time2=log_clock,units="mins")," , ",as.character(Sys.time())," \n",
           file=log_file, append=TRUE )
    }
    
  }
  
  #####
  #   Clustering results summary   #
  #####
  
  sort.cluster <- function(x) {
    # function for sorting clustering numbers
    # cluster 1 will be the one with more elements
    plyr::mapvalues(x,
                    from=as.numeric(names(sort(table(x),decreasing=T))),
                    to=sort(unique(x)))
  }
  
  ### iterations that will be reported ###
  # after burn-in period and thinning
  iter_out <- seq(from=n_burn+1,to=n_iter,by=n_thin)
  
  ### reported clusters (after ordering) ###
  clusters <- sapply( mu_star_map_sim[,iter_out], sort.cluster )
  clusters <- data.frame(clusters)
  
  ### Number of clusters in each iteration ###
  n.clusters <- apply(clusters,2,max)
  
  ### Average cluster matrix ###
  cluster_matrix_sum <- matrix(0,nrow=dim(clusters)[1],ncol=dim(clusters)[1])
  for(iter_i in 1:dim(clusters)[2]) {
    # iter_i <- 1
    cluster_matrix_i <- matrix(0,nrow=dim(clusters)[1],ncol=dim(clusters)[1])
    aux_cluster_num <- clusters[,iter_i]
    
    # creates the clustering matrix for "iter_i"
    for(cluster_j in 1:max(aux_cluster_num)) {
      # cluster_j <- 1
      cluster_matrix_i[aux_cluster_num==cluster_j,aux_cluster_num==cluster_j] <- 1
    }
    
    # add clustering matrix with "cluster_matrix_sum"
    cluster_matrix_sum <- cluster_matrix_sum + cluster_matrix_i
  }
  cluster.matrix.avg <- cluster_matrix_sum/dim(clusters)[2]
  
  # Distance of each cluster to the average matrix #
  cluster_dist <- rep(0,dim(clusters)[2])
  for(iter_i in 1:dim(clusters)[2]) {
    #cat(iter_i,",")
    # iter_i <- 10
    cluster_matrix_i <- matrix(0,nrow=dim(clusters)[1],ncol=dim(clusters)[1])
    aux_cluster_num <- clusters[,iter_i]
    
    # creates the clustering matrix for "iter_i"
    for(cluster_j in 1:max(aux_cluster_num)) {
      # cluster_j <- 1
      cluster_matrix_i[aux_cluster_num==cluster_j,aux_cluster_num==cluster_j] <- 1
    }
    
    # calculates distance to the average matrix
    cluster_dist[iter_i] <- sum((cluster_matrix_i-cluster.matrix.avg)^2)
  }
  
  # Which cluster is closer to the average similarity matrix #
  cluster.cta <- clusters[,which.min(cluster_dist)]
  
  # Summary of original variables by cluster
  Y <- data.frame(cluster=cluster.cta,Y_orig) # restablishes original Y
  Y.cluster.summary <- data.frame(cluster=sort(unique(cluster.cta)),
                                  n.elements=as.numeric(table(cluster.cta)))
  
  for(i in 2:ncol(Y)) {
    if( is.element(var_type_orig[i-1],var_type_all[1]) ) {
      # continuous variables
      Y_summary_aux <- aggregate(as.formula(paste(colnames(Y)[i],"~cluster",sep="")),
                                 data=Y,
                                 FUN=mean,
                                 na.rm=T)
      Y.cluster.summary <- merge( Y.cluster.summary,Y_summary_aux,by="cluster",all.x=T)
    } else if( is.element(var_type_orig[i-1],var_type_all[-1]) ) {
      Y_summary_aux <- tapply(X=rep(1,nrow(Y)),INDEX=list(Y$cluster,Y[,i]),FUN=sum,na.rm=T)
      Y_summary_aux[is.na(Y_summary_aux)] <- 0
      Y_summary_aux <- Y_summary_aux / apply(Y_summary_aux,1,sum,na.rm=T)
      colnames(Y_summary_aux) <- paste(colnames(Y)[i],'value',colnames(Y_summary_aux),sep="_")
      Y_summary_aux <- data.frame(cluster=sort(unique(Y$cluster)),Y_summary_aux)
      Y.cluster.summary <- merge( Y.cluster.summary,Y_summary_aux,by="cluster",all.x=T)
    } else { stop('There is an unknown type in "var_type_orig"') }
  }
  
  ### Cluster Heterogeneity Measure ###
  q<-ncol(Y)
  hm <- vector("numeric")
  for( k in 1:dim(clusters)[2] ) {
    # k <- 1
    cluster<-clusters[,k]
    nc<-max(cluster)
    nr<-table(cluster)[1:nc]
    
    hmm<-matrix(NA,nc,q)
    
    for (i in 1:nc){
      # i <- 1
      for (j in 1:q){
        # j <- 1
        xbar<-weighted.mean(Y[cluster==i,j],expansion_f[cluster==i])
        x2bar<-weighted.mean(Y[cluster==i,j]^2,expansion_f[cluster==i])
        hmm[i,j]<-x2bar-xbar^2
      }
    }
    hm <- c( hm, sum(apply(hmm,1,sum)*nr) )
  }
  
  
  dpclust <- structure(
    list(
      cluster=cluster.cta[map_z_to_orig],
      clusterHMmin = clusters[map_z_to_orig,which.min(hm)],
      
      Y.cluster.summary=Y.cluster.summary,
      
      Y.var_type=var_type_orig,
      Y.na=which(!aux_na_Y),
      Y.n=n,
      Y.p=p,
      
      MC.clusters=clusters[map_z_to_orig,],
      
      # Heterogeneity Measure (HM)
      cluster.HM=hm[which.min(cluster_dist)],
      clusterHMmin.HM=min(hm),
      MC.clusters.HM=hm,
      
      cluster.matrix.avg=cluster.matrix.avg[map_z_to_orig,map_z_to_orig],
      
      #mu_star_n_r_sim=mu_star_n_r_sim,
      #mu_Z_sim=mu_Z_sim,
      #sigma_Z_sim=sigma_Z_sim,
      #mu_star_probs=mu_star_probs,
      
      MC.values =list(
        n.clusters=n.clusters,
        a=a_sim,
        b=b_sim,
        Lambda=Lambda_sim,
        Omega=Omega_sim
      ) ,
      MC.accept.rate=NULL,
      call=cl
    )
    ,class="MIXcluster")
  
  return(dpclust)
  
}
christianu7/BNPMIXcluster documentation built on Sept. 10, 2022, 11:40 p.m.