MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale...
In christianu7/BNPMIXcluster: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

MIXclustering

R Documentation

Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

Description

MIXclustering is used to perform cluster analysis of individuals using a Bayesian nonparametric mixture model that jointly models mixed scale data and accommodates for different sampling probabilities. The model is described in Carmona, C., Nieto-Barajas, L. E., Canale, A. (2016).

Usage

MIXclustering(
  Y,
  var_type,
  n_iter_out = 2000,
  n_burn = 100,
  n_thin = 2,
  a_fix = NULL,
  alpha = 0.5,
  d_0_a = 1,
  d_1_a = 1,
  b_fix = NULL,
  d_0_b = 1,
  d_1_b = 1,
  eta = 2,
  d_0_z = 2.1,
  d_1_z = 30,
  kappa = 5,
  delta = 4,
  d_0_mu = 2.1,
  d_1_mu = 30,
  sampling_prob = NULL,
  expansion_f = NULL,
  log_file = NULL,
  keep_param_chains = FALSE
)

Arguments

`Y`	Matrix or data frame containing the data to be clustered.
`var_type`	Character vector that indicates the type of variable in each column of x. Three possible types: "c" for continuous variables. It is assumed to be Gaussian-shaped. "o" for ordinal variables (binary and ordered categorical). "m" for nominal variables (non-ordered categorical).
`n_iter_out`	Number of effective iterations in the MCMC procedure for clustering.
`n_burn`	Number of iterations discarded as part of the burn-in period at the beginning MCMC procedure.
`n_thin`	Number of iterations discarded for thinning the chain (reducing the autocorrelation). We keep 1 of every n_thin iterations.
`a_fix`	A numeric value to set the parameter a in the model. If `NULL` (default), the parameter a is assigned a prior distribution. See `details`.
`alpha`	Hyperparameter in the prior distribution of a. See `details`.
`d_0_a`	Hyperparameter in the prior distribution of a. See `details`.
`d_1_a`	Hyperparameter in the prior distribution of a. See `details`.
`b_fix`	A numeric value to set the parameter b in the model. If `NULL` (default), the parameter b is assigned a prior distribution. See `details`.
`d_0_b`	Hyperparameter in the prior distribution of b. See `details`.
`d_1_b`	Hyperparameter in the prior distribution of b. See `details`.
`eta`	Tuning parameter controlling the proposal in the Metropolis-Hastings step for b.
`d_0_z`	Hyperparameter in the prior distribution of the variance for the latent variables. See `details`.
`d_1_z`	Hyperparameter in the prior distribution of the variance for the latent variables. See `details`.
`kappa`	Tuning parameter controlling the proposal in the Metropolis-Hastings step for the variance of latent variables.
`delta`	Tuning parameter controlling the proposal in the Metropolis-Hastings step for the correlation of latent variables.
`d_0_mu`	Hyperparameter in the prior distribution of the variance of the location in each cluster. See `details`.
`d_1_mu`	Hyperparameter in the prior distribution of the variance of the location in each cluster. See `details`.
`sampling_prob`	vector with the sampling probabilities π_i for each individual in case that the data come from a complex survey sample. By default π_i=1.
`expansion_f`	vector with the expansion factors, the reciprocal of the sampling probabilities, w_i = 1/π_i. If both `sampling_prob` and `expansion_f` are specified, preference is given to `sampling_prob`.
`log_file`	Specifies a file to save the details with the execution time and the parameters used.
`keep_param_chains`	Indicates if the simulations of parameters `a`, `b`, `lambda` and `omega` should be returned as output.

Details

The model consists on a Bayesian non-parametric approach for clustering that is capable to combine different types of variables through the usage of associated continuous latent variables. The clustering mechanism is based on a location mixture model with a Poisson-Dirichlet (PD) process prior on the location parameters μ_i ; i=1,...,n of the associated latent variables.

Computational inference about the cluster allocation and the posterior distribution of the parameters are performed using MCMC.

A full description of the model is in the article Carmona et al. (2016) (https://arxiv.org/abs/1612.00083). See Reference.

The model consider an individual y_i that is characterized by a multivariate response of dimension p, i.e., y_i=(y_{i,1},...,y_{i,p}). The total number of variables p is divided into c continuous variables, o ordinal variables, and m nominal variables such that p=c+o+m.

For the continuous variables, it is convenient that the variables have a real support. The user may have transformed the original values before using the function MIXclustering.

For each response y_i=(y_{i,1},...,y_{i,p}) (of dimension p) a corresponding latent vector z_i=(z_{i,1},...,z_{i,q}) (of dimension q) is created, according to the following:

For each continuous variable y_{i,j} ; j=1,...,c the algorithm uses a latent with the same values z_{i,j}=y_{i,j}.
For each ordinal variable y_{i,j} , j = c+1,...,c+o, with K_j different ordered values, the algorithm creates one latent z_{i,j}, that allows to map the categories into continuous values divided by thresholds. For example, for a binary y_j, we have y_j=0 iff z_j<0 and y_j=1 iff z_j>0
For each nominal variable y_{i,j} , j = c+o+1,...,c+o+m, with L_j categories, the algorithm require L_j-1 latent variables, whose relative order is consistent with the observed category.

The data may come from a complex survey sample where each individual y_i has known sampling probability π_i, i = 1,...,n. The reciprocal of these sampling probabilities, w_i=1/π_i, are called expansion factors or sampling design weights.

The joint model for the latent vector is therefore:

(z_i | μ_i,Σ) ~ N_q(μ_i, π_i Σ )

(Note: the final model in Carmona et al. (2016) has variance κ π_i Σ. This value of κ can be used in the package through a transformed sampling probability vector π_i^*=κπ_i)

The clustering model will be based in an appropriate choice of the prior distribution on the μ_i's. A clustering of the μ_i's will induce a clustering of the y_i's. Our prior on the μ_i's will be:

μ_i | G~G, iid for i=1,...,n

Where G~PD(a,b,G_0) is a Poisson-Dirichlet process with parameters a \in [0,1), b>-a and centering measure G_0. The Dirichlet and the normalized stable processes arise when a=0 and when b=0, respectively.

In consequence, this choice of prior implies that the μ_i's are exchangeable with marginal distribution μ_i~G_0 for all i=1,...,n.

In our case, G(μ)=N(0,Σ_μ), where Σ_μ = diag( σ^2_{μ 1} ,...,σ^2_{μ q} ).

The parameters a and b in the model define the PD process and therefore control the number of groups. These parameters can be fixed, resulting in a larger/smaller number of groups if assigned a larger/smaller value, respectively.

There are 9 hyperparameters in the function that also characterize the prior distributions in the model:

f(a) = alpha * I(a=0) + (1-alpha) * dbeta( a | d_0_a , d_0_a )
f(b | a) = dgamma( b + a | d_0_b , d_1_b )
sigma^2 ~ inverse-gamma( d_0_z , d_1_z)
sigma^2_mu ~ inverse-gamma( d_0_mu , d_1_mu )

The definition of these values also affect the number of resulting clusters since they affect the variance implied in the model.

For example, increasing the values of d_1_a and d_1_b reduce the number of groups.

Finally, the function parameters eta, kappa, delta are tuning parameters that control the acceptance rate in the random-walk MH steps of the new proposed values for the parameters b, Λ_{j,j} (variance of latents) and Ω_{i,j} (correlation of latents). These parameters are not recommended to be changed (used in the internal functions: sampling_b , sampling_Lambda_jj , sampling_Omega_ij).

Value

MIXclustering returns a S3 object of class "MIXcluster".

The generic methods summary and plot are defined for this class.

An object of class "MIXcluster" is a list containing the following components:

cluster: vector with the cluster allocation for each row in the data. It corresponds to the iteration which is Closest-To-Average (CTA) arrangement.
cluster_heterogeneity: Heterogeneity Measure (HM) for the cluster in the previous point. The HM measure is discussed in section 4 of Carmona et al. (2017).
Y.cluster.summary: a summary of the data divided by the allocation in $cluster.
Y.var_type: vector with the variable types in the data.
Y.na: vector specifying the rows with missing values.
Y.n: number of rows in the data.
Y.p: number of variables in the data.
MC.clusters: matrix with the cluster allocation for each row in the data. Each column corresponds to an effective iteration in the MCMC simulation of the model (after discarding burn-in and thinning iterations).
MC.clusters_heterogeneity: Heterogeneity Measure (HM) for all the clusters returned in MC.clusters.
cluster.matrix.avg: average similarity matrix of size n by n.
MC.values: a list with the simulated values of the chains for the parameters a,b,Λ,Ω.
MC.accept.rate: a named vector with the acceptance rates for each parameter. It includes iterations that are discarded in the burn-in period and thinning.
call: the matched call.

References

Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2017). Model based approach for household clustering with mixed scale variables. (https://arxiv.org/abs/1612.00083)

Examples


##############################
#     Simulation study 1     #
#    Carmona et al. (2017)   #
##############################

# Data and parameters are discussed in section 5.1 of Carmona et al. (2017) #

# Set seed for reproducibility #
set.seed(0) 


# Specification of data Y #
help(Y_ex_5_1)

# Observable data
# Choose scenario: 1, 2, or 3
ex_i <- 1

# Prior specification
# Choose "a", "b" or "c"
param_j <- "c"

# Specify the data type that is being provided to the method
var_type_Y_ex_5_1 <- list( c("c","c","c"),
                           c("o","o"),
                           c("o","o","o","c") )
## Not run: 
cluster_ex <- MIXclustering( Y = as.matrix(Y_ex_5_1[[ ex_i ]]),
                             var_type=var_type_Y_ex_5_1[[ ex_i ]],
                             
                             n_iter_out=1500,
                             n_burn=200,
                             n_thin=3,
                             
                             alpha = meta_param_ex[ param_j, "alpha" ],
                             d_0_a = meta_param_ex[ param_j, "d_0_a"],
                             d_1_a = meta_param_ex[ param_j, "d_1_a" ],
                             d_0_b = meta_param_ex[ param_j, "d_0_b" ],
                             d_1_b = meta_param_ex[ param_j, "d_1_b" ],
                             eta = meta_param_ex[ param_j, "eta" ],
                             kappa = meta_param_ex[ param_j, "kappa" ],
                             delta = meta_param_ex[ param_j, "delta" ],
                             
                             d_0_z = meta_param_ex[ param_j, "d_0_z" ],
                             d_1_z = meta_param_ex[ param_j, "d_1_z" ],
                             d_0_mu = meta_param_ex[ param_j, "d_0_mu" ],
                             d_1_mu = meta_param_ex[ param_j, "d_1_mu" ] )
# Summary of clustering results
summary(cluster_ex)

# Visualizing clustering results
plot(cluster_ex,type="heatmap")
plot(cluster_ex,type="chain")

# Comparison of cluster configurations #
# 1) Minimum distance with average MCMC iterations
# 2) Minimum Heterogeneity Measure (HM)
plot( x=jitter(cluster_ex$cluster),y=jitter(cluster_ex$clusterHMmin), col="#FF000080", pch=20,
      main=paste("Comparison of two relevant cluster configurations"),
      xlab="minimizes distance to average MCMC grouping", ylab="minimizes Heterogeneity Measure" )

# Comparison with the original clusters in the simulated data
plot(x=jitter(Z_latent_ex_5_1$cluster),
     y=jitter(cluster_ex$cluster),
     main=paste("Comparison real configuration with the model results"),
     xlab="Real cluster",
     ylab="Model cluster",
     pch=19, col="#FF000080")

## End(Not run)

##############################
#      Households data       #
#    Carmona et al. (2017)   #
##############################

# Testing "MIXclustering" function with poverty.data #
# Data and parameters are discussed in section 5.3 of Carmona et al. (2017) #

# Set seed for reproducibility #
set.seed(0) 


## Not run: 
# relevant variables for clustering households #
Y_names <- c("ict_norm",
             "ic_ali","ic_asalud","ic_cv",
             "ic_rezedu","ic_sbv","ic_segsoc",
             "niv_ed","tam_loc")
Y_var_type <- c("c","o","o","o","o","o","o","o","m")

# using only data from state 15 (Edomex) #
aux_subset <- rep(T,nrow(poverty.data))
aux_subset <- aux_subset & is.element(substr(poverty.data$folioviv,1,2),"15")

Y_data <- poverty.data[aux_subset,Y_names]

### Sampling probability dependin on the scenario ###
# Scenario description in section 5.3 of Carmona et al. (2017) #
# Choose 1, 2 or 3 #
poverty_sampling_spec <- 3


if (poverty_sampling_spec == 1) {
  k <- 1
  sampling_prob_pov <- rep(1,nrow(Y_data))
} else if (poverty_sampling_spec == 2) {
  k <- 2 * mean(poverty.data[aux_subset,"factor_hog"])
  sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"]
} else if (poverty_sampling_spec == 3) {
  k <- 4 * mean(poverty.data[aux_subset,"factor_hog"])
  sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"]
}

cluster_poverty <- MIXclustering( Y=Y_data,
                                  var_type=Y_var_type,
                                  n_iter_out=1500,
                                  n_burn=200,
                                  n_thin=3,
                                  
                                  alpha = 0.5,
                                  d_0_a = 1, d_1_a = 1,
                                  d_0_b = 1, d_1_b = 1,
                                  
                                  eta = 2,
                                  kappa = 5,
                                  delta = 4,
                                  
                                  d_0_z = 2.1, d_1_z = 30,
                                  d_0_mu = 2.1, d_1_mu = 30,
                                  
                                  sampling_prob = k * sampling_prob_pov )

summary(cluster_poverty)
plot(cluster_poverty,type="heatmap")
plot(cluster_poverty,type="chain")

## End(Not run)

christianu7/BNPMIXcluster documentation built on Sept. 10, 2022, 11:40 p.m.