# MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale... In BNPMIXcluster: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

## Description

MIXclustering is used to perform cluster analysis of individuals using a Bayesian nonparametric mixture model that jointly models mixed scale data and accommodates for different sampling probabilities. The model is described in Carmona, C., Nieto-Barajas, L. E., Canale, A. (2016).

## Usage

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 MIXclustering( Y, var_type, n_iter_out = 2000, n_burn = 100, n_thin = 2, a_fix = NULL, alpha = 0.5, d_0_a = 1, d_1_a = 1, b_fix = NULL, d_0_b = 1, d_1_b = 1, eta = 2, d_0_z = 2.1, d_1_z = 30, kappa = 5, delta = 4, d_0_mu = 2.1, d_1_mu = 30, sampling_prob = NULL, expansion_f = NULL, log_file = NULL, keep_param_chains = FALSE )

## Arguments

 Y Matrix or data frame containing the data to be clustered. var_type Character vector that indicates the type of variable in each column of x. Three possible types: "c" for continuous variables. It is assumed to be Gaussian-shaped. "o" for ordinal variables (binary and ordered categorical). "m" for nominal variables (non-ordered categorical). n_iter_out Number of effective iterations in the MCMC procedure for clustering. n_burn Number of iterations discarded as part of the burn-in period at the beginning MCMC procedure. n_thin Number of iterations discarded for thinning the chain (reducing the autocorrelation). We keep 1 of every n_thin iterations. a_fix A numeric value to set the parameter a in the model. If NULL (default), the parameter a is assigned a prior distribution. See details. alpha Hyperparameter in the prior distribution of a. See details. d_0_a Hyperparameter in the prior distribution of a. See details. d_1_a Hyperparameter in the prior distribution of a. See details. b_fix A numeric value to set the parameter b in the model. If NULL (default), the parameter b is assigned a prior distribution. See details. d_0_b Hyperparameter in the prior distribution of b. See details. d_1_b Hyperparameter in the prior distribution of b. See details. eta Tuning parameter controlling the proposal in the Metropolis-Hastings step for b. d_0_z Hyperparameter in the prior distribution of the variance for the latent variables. See details. d_1_z Hyperparameter in the prior distribution of the variance for the latent variables. See details. kappa Tuning parameter controlling the proposal in the Metropolis-Hastings step for the variance of latent variables. delta Tuning parameter controlling the proposal in the Metropolis-Hastings step for the correlation of latent variables. d_0_mu Hyperparameter in the prior distribution of the variance of the location in each cluster. See details. d_1_mu Hyperparameter in the prior distribution of the variance of the location in each cluster. See details. sampling_prob vector with the sampling probabilities π_i for each individual in case that the data come from a complex survey sample. By default π_i=1. expansion_f vector with the expansion factors, the reciprocal of the sampling probabilities, w_i = 1/π_i. If both sampling_prob and expansion_f are specified, preference is given to sampling_prob. log_file Specifies a file to save the details with the execution time and the parameters used. keep_param_chains Indicates if the simulations of parameters a, b, lambda and omega should be returned as output.

## Details

The model consists on a Bayesian non-parametric approach for clustering that is capable to combine different types of variables through the usage of associated continuous latent variables. The clustering mechanism is based on a location mixture model with a Poisson-Dirichlet (PD) process prior on the location parameters μ_i ; i=1,...,n of the associated latent variables.

Computational inference about the cluster allocation and the posterior distribution of the parameters are performed using MCMC.

A full description of the model is in the article Carmona et al. (2016) (https://arxiv.org/abs/1612.00083). See Reference.

The model consider an individual y_i that is characterized by a multivariate response of dimension p, i.e., y_i=(y_{i,1},...,y_{i,p}). The total number of variables p is divided into c continuous variables, o ordinal variables, and m nominal variables such that p=c+o+m.

For the continuous variables, it is convenient that the variables have a real support. The user may have transformed the original values before using the function MIXclustering.

For each response y_i=(y_{i,1},...,y_{i,p}) (of dimension p) a corresponding latent vector z_i=(z_{i,1},...,z_{i,q}) (of dimension q) is created, according to the following:

• For each continuous variable y_{i,j} ; j=1,...,c the algorithm uses a latent with the same values z_{i,j}=y_{i,j}.

• For each ordinal variable y_{i,j} , j = c+1,...,c+o, with K_j different ordered values, the algorithm creates one latent z_{i,j}, that allows to map the categories into continuous values divided by thresholds. For example, for a binary y_j, we have y_j=0 iff z_j<0 and y_j=1 iff z_j>0

• For each nominal variable y_{i,j} , j = c+o+1,...,c+o+m, with L_j categories, the algorithm require L_j-1 latent variables, whose relative order is consistent with the observed category.

The data may come from a complex survey sample where each individual y_i has known sampling probability π_i, i = 1,...,n. The reciprocal of these sampling probabilities, w_i=1/π_i, are called expansion factors or sampling design weights.

The joint model for the latent vector is therefore:

(z_i | μ_i,Σ) ~ N_q(μ_i, π_i Σ )

(Note: the final model in Carmona et al. (2016) has variance κ π_i Σ. This value of κ can be used in the package through a transformed sampling probability vector π_i^*=κπ_i)

The clustering model will be based in an appropriate choice of the prior distribution on the μ_i's. A clustering of the μ_i's will induce a clustering of the y_i's. Our prior on the μ_i's will be:

μ_i | G~G, iid for i=1,...,n

Where G~PD(a,b,G_0) is a Poisson-Dirichlet process with parameters a \in [0,1), b>-a and centering measure G_0. The Dirichlet and the normalized stable processes arise when a=0 and when b=0, respectively.

In consequence, this choice of prior implies that the μ_i's are exchangeable with marginal distribution μ_i~G_0 for all i=1,...,n.

In our case, G(μ)=N(0,Σ_μ), where Σ_μ = diag( σ^2_{μ 1} ,...,σ^2_{μ q} ).

The parameters a and b in the model define the PD process and therefore control the number of groups. These parameters can be fixed, resulting in a larger/smaller number of groups if assigned a larger/smaller value, respectively.

There are 9 hyperparameters in the function that also characterize the prior distributions in the model:

• f(a) = alpha * I(a=0) + (1-alpha) * dbeta( a | d_0_a , d_0_a )

• f(b | a) = dgamma( b + a | d_0_b , d_1_b )

• sigma^2 ~ inverse-gamma( d_0_z , d_1_z)

• sigma^2_mu ~ inverse-gamma( d_0_mu , d_1_mu )

The definition of these values also affect the number of resulting clusters since they affect the variance implied in the model.

For example, increasing the values of d_1_a and d_1_b reduce the number of groups.

Finally, the function parameters eta, kappa, delta are tuning parameters that control the acceptance rate in the random-walk MH steps of the new proposed values for the parameters b, Λ_{j,j} (variance of latents) and Ω_{i,j} (correlation of latents). These parameters are not recommended to be changed (used in the internal functions: sampling_b , sampling_Lambda_jj , sampling_Omega_ij).

## Value

MIXclustering returns a S3 object of class "MIXcluster".

The generic methods summary and plot are defined for this class.

An object of class "MIXcluster" is a list containing the following components:

cluster

vector with the cluster allocation for each row in the data. It corresponds to the iteration which is Closest-To-Average (CTA) arrangement.

cluster_heterogeneity

Heterogeneity Measure (HM) for the cluster in the previous point. The HM measure is discussed in section 4 of Carmona et al. (2017).

Y.cluster.summary

a summary of the data divided by the allocation in $cluster. Y.var_type vector with the variable types in the data. Y.na vector specifying the rows with missing values. Y.n number of rows in the data. Y.p number of variables in the data. MC.clusters matrix with the cluster allocation for each row in the data. Each column corresponds to an effective iteration in the MCMC simulation of the model (after discarding burn-in and thinning iterations). MC.clusters_heterogeneity Heterogeneity Measure (HM) for all the clusters returned in MC.clusters. cluster.matrix.avg average similarity matrix of size n by n. MC.values a list with the simulated values of the chains for the parameters a,b,Λ,Ω. MC.accept.rate a named vector with the acceptance rates for each parameter. It includes iterations that are discarded in the burn-in period and thinning. call the matched call. ## References Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2017). Model based approach for household clustering with mixed scale variables. (https://arxiv.org/abs/1612.00083) ## See Also summary.MIXcluster for a summary of the clustering results, plot.MIXcluster for graphical representation of results. ## Examples  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 ############################## # Simulation study 1 # # Carmona et al. (2017) # ############################## # Data and parameters are discussed in section 5.1 of Carmona et al. (2017) # # Set seed for reproducibility # set.seed(0) # Specification of data Y # help(Y_ex_5_1) # Observable data # Choose scenario: 1, 2, or 3 ex_i <- 1 # Prior specification # Choose "a", "b" or "c" param_j <- "c" # Specify the data type that is being provided to the method var_type_Y_ex_5_1 <- list( c("c","c","c"), c("o","o"), c("o","o","o","c") ) ## Not run: cluster_ex <- MIXclustering( Y = as.matrix( ex_i ]]), var_type=var_type_Y_ex_5_1[[ ex_i ]], n_iter_out=1500, n_burn=200, n_thin=3, alpha = param_j, "alpha" ], d_0_a = param_j, "d_0_a"], d_1_a = param_j, "d_1_a" ], d_0_b = param_j, "d_0_b" ], d_1_b = param_j, "d_1_b" ], eta = param_j, "eta" ], kappa = param_j, "kappa" ], delta = param_j, "delta" ], d_0_z = param_j, "d_0_z" ], d_1_z = param_j, "d_1_z" ], d_0_mu = param_j, "d_0_mu" ], d_1_mu = param_j, "d_1_mu" ] ) # Summary of clustering results summary(cluster_ex) # Visualizing clustering results plot(cluster_ex,type="heatmap") plot(cluster_ex,type="chain") # Comparison of cluster configurations # # 1) Minimum distance with average MCMC iterations # 2) Minimum Heterogeneity Measure (HM) plot( x=jitter(cluster_ex$cluster),y=jitter(cluster_ex$clusterHMmin), col="#FF000080", pch=20, main=paste("Comparison of two relevant cluster configurations"), xlab="minimizes distance to average MCMC grouping", ylab="minimizes Heterogeneity Measure" ) # Comparison with the original clusters in the simulated data plot(x=jitter(Z_latent_ex_5_1$cluster), y=jitter(cluster_ex$cluster), main=paste("Comparison real configuration with the model results"), xlab="Real cluster", ylab="Model cluster", pch=19, col="#FF000080") ## End(Not run) ############################## # Households data # # Carmona et al. (2017) # ############################## # Testing "MIXclustering" function with poverty.data # # Data and parameters are discussed in section 5.3 of Carmona et al. (2017) # # Set seed for reproducibility # set.seed(0) ## Not run: # relevant variables for clustering households # Y_names <- c("ict_norm", "ic_ali","ic_asalud","ic_cv", "ic_rezedu","ic_sbv","ic_segsoc", "niv_ed","tam_loc") Y_var_type <- c("c","o","o","o","o","o","o","o","m") # using only data from state 15 (Edomex) # aux_subset <- rep(T,nrow(poverty.data)) aux_subset <- aux_subset & is.element(substr(poverty.data$folioviv,1,2),"15") Y_data <- poverty.data[aux_subset,Y_names] ### Sampling probability dependin on the scenario ### # Scenario description in section 5.3 of Carmona et al. (2017) # # Choose 1, 2 or 3 # poverty_sampling_spec <- 3 if (poverty_sampling_spec == 1) { k <- 1 sampling_prob_pov <- rep(1,nrow(Y_data)) } else if (poverty_sampling_spec == 2) { k <- 2 * mean(poverty.data[aux_subset,"factor_hog"]) sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"] } else if (poverty_sampling_spec == 3) { k <- 4 * mean(poverty.data[aux_subset,"factor_hog"]) sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"] } cluster_poverty <- MIXclustering( Y=Y_data, var_type=Y_var_type, n_iter_out=1500, n_burn=200, n_thin=3, alpha = 0.5, d_0_a = 1, d_1_a = 1, d_0_b = 1, d_1_b = 1, eta = 2, kappa = 5, delta = 4, d_0_z = 2.1, d_1_z = 30, d_0_mu = 2.1, d_1_mu = 30, sampling_prob = k * sampling_prob_pov ) summary(cluster_poverty) plot(cluster_poverty,type="heatmap") plot(cluster_poverty,type="chain") ## End(Not run)

BNPMIXcluster documentation built on Nov. 30, 2020, 5:07 p.m.