Description Usage Arguments Details Value References See Also Examples
View source: R/MIXclustering.R
MIXclustering
is used to perform cluster analysis of individuals using a Bayesian nonparametric mixture model that jointly models mixed scale data and accommodates for different sampling probabilities. The model is described in Carmona, C., NietoBarajas, L. E., Canale, A. (2016).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25  MIXclustering(
Y,
var_type,
n_iter_out = 2000,
n_burn = 100,
n_thin = 2,
a_fix = NULL,
alpha = 0.5,
d_0_a = 1,
d_1_a = 1,
b_fix = NULL,
d_0_b = 1,
d_1_b = 1,
eta = 2,
d_0_z = 2.1,
d_1_z = 30,
kappa = 5,
delta = 4,
d_0_mu = 2.1,
d_1_mu = 30,
sampling_prob = NULL,
expansion_f = NULL,
log_file = NULL,
keep_param_chains = FALSE
)

Y 
Matrix or data frame containing the data to be clustered. 
var_type 
Character vector that indicates the type of variable in each column of x. Three possible types:

n_iter_out 
Number of effective iterations in the MCMC procedure for clustering. 
n_burn 
Number of iterations discarded as part of the burnin period at the beginning MCMC procedure. 
n_thin 
Number of iterations discarded for thinning the chain (reducing the autocorrelation). We keep 1 of every n_thin iterations. 
a_fix 
A numeric value to set the parameter a in the model. If 
alpha 
Hyperparameter in the prior distribution of a. See 
d_0_a 
Hyperparameter in the prior distribution of a. See 
d_1_a 
Hyperparameter in the prior distribution of a. See 
b_fix 
A numeric value to set the parameter b in the model. If 
d_0_b 
Hyperparameter in the prior distribution of b. See 
d_1_b 
Hyperparameter in the prior distribution of b. See 
eta 
Tuning parameter controlling the proposal in the MetropolisHastings step for b. 
d_0_z 
Hyperparameter in the prior distribution of the variance for the latent variables. See 
d_1_z 
Hyperparameter in the prior distribution of the variance for the latent variables. See 
kappa 
Tuning parameter controlling the proposal in the MetropolisHastings step for the variance of latent variables. 
delta 
Tuning parameter controlling the proposal in the MetropolisHastings step for the correlation of latent variables. 
d_0_mu 
Hyperparameter in the prior distribution of the variance of the location in each cluster. See 
d_1_mu 
Hyperparameter in the prior distribution of the variance of the location in each cluster. See 
sampling_prob 
vector with the sampling probabilities π_i for each individual in case that the data come from a complex survey sample. By default π_i=1. 
expansion_f 
vector with the expansion factors, the reciprocal of the sampling probabilities, w_i = 1/π_i. If both 
log_file 
Specifies a file to save the details with the execution time and the parameters used. 
keep_param_chains 
Indicates if the simulations of parameters 
The model consists on a Bayesian nonparametric approach for clustering that is capable to combine different types of variables through the usage of associated continuous latent variables. The clustering mechanism is based on a location mixture model with a PoissonDirichlet (PD) process prior on the location parameters μ_i ; i=1,...,n of the associated latent variables.
Computational inference about the cluster allocation and the posterior distribution of the parameters are performed using MCMC.
A full description of the model is in the article Carmona et al. (2016) (https://arxiv.org/abs/1612.00083). See Reference
.
The model consider an individual y_i that is characterized by a multivariate response of dimension p, i.e., y_i=(y_{i,1},...,y_{i,p}). The total number of variables p is divided into c continuous variables, o ordinal variables, and m nominal variables such that p=c+o+m.
For the continuous variables, it is convenient that the variables have a real support. The user may have transformed the original values before using the function MIXclustering
.
For each response y_i=(y_{i,1},...,y_{i,p}) (of dimension p) a corresponding latent vector z_i=(z_{i,1},...,z_{i,q}) (of dimension q) is created, according to the following:
For each continuous variable y_{i,j} ; j=1,...,c the algorithm uses a latent with the same values z_{i,j}=y_{i,j}.
For each ordinal variable y_{i,j} , j = c+1,...,c+o, with K_j different ordered values, the algorithm creates one latent z_{i,j}, that allows to map the categories into continuous values divided by thresholds. For example, for a binary y_j, we have y_j=0 iff z_j<0 and y_j=1 iff z_j>0
For each nominal variable y_{i,j} , j = c+o+1,...,c+o+m, with L_j categories, the algorithm require L_j1 latent variables, whose relative order is consistent with the observed category.
The data may come from a complex survey sample where each individual y_i has known sampling probability π_i, i = 1,...,n. The reciprocal of these sampling probabilities, w_i=1/π_i, are called expansion factors or sampling design weights.
The joint model for the latent vector is therefore:
(z_i  μ_i,Σ) ~ N_q(μ_i, π_i Σ )
(Note: the final model in Carmona et al. (2016) has variance κ π_i Σ. This value of κ can be used in the package through a transformed sampling probability vector π_i^*=κπ_i)
The clustering model will be based in an appropriate choice of the prior distribution on the μ_i's. A clustering of the μ_i's will induce a clustering of the y_i's. Our prior on the μ_i's will be:
μ_i  G~G, iid for i=1,...,n
Where G~PD(a,b,G_0) is a PoissonDirichlet process with parameters a \in [0,1), b>a and centering measure G_0. The Dirichlet and the normalized stable processes arise when a=0 and when b=0, respectively.
In consequence, this choice of prior implies that the μ_i's are exchangeable with marginal distribution μ_i~G_0 for all i=1,...,n.
In our case, G(μ)=N(0,Σ_μ), where Σ_μ = diag( σ^2_{μ 1} ,...,σ^2_{μ q} ).
The parameters a and b in the model define the PD process and therefore control the number of groups. These parameters can be fixed, resulting in a larger/smaller number of groups if assigned a larger/smaller value, respectively.
There are 9 hyperparameters in the function that also characterize the prior distributions in the model:
f(a) = alpha * I(a=0) + (1alpha) * dbeta( a  d_0_a , d_0_a )
f(b  a) = dgamma( b + a  d_0_b , d_1_b )
sigma^2 ~ inversegamma( d_0_z , d_1_z)
sigma^2_mu ~ inversegamma( d_0_mu , d_1_mu )
The definition of these values also affect the number of resulting clusters since they affect the variance implied in the model.
For example, increasing the values of d_1_a
and d_1_b
reduce the number of groups.
Finally, the function parameters eta
, kappa
, delta
are tuning parameters that control the acceptance rate in the randomwalk MH steps of the new proposed values for the parameters b, Λ_{j,j} (variance of latents) and Ω_{i,j} (correlation of latents). These parameters are not recommended to be changed (used in the internal functions: sampling_b
, sampling_Lambda_jj
, sampling_Omega_ij
).
MIXclustering
returns a S3 object of class "MIXcluster
".
The generic methods summary
and plot
are defined for this class.
An object of class "MIXcluster" is a list containing the following components:
cluster
vector with the cluster allocation for each row in the data. It corresponds to the iteration which is ClosestToAverage (CTA) arrangement.
cluster_heterogeneity
Heterogeneity Measure (HM) for the cluster in the previous point. The HM measure is discussed in section 4 of Carmona et al. (2017).
Y.cluster.summary
a summary of the data divided by the allocation in $cluster
.
Y.var_type
vector with the variable types in the data.
Y.na
vector specifying the rows with missing values.
Y.n
number of rows in the data.
Y.p
number of variables in the data.
MC.clusters
matrix with the cluster allocation for each row in the data. Each column corresponds to an effective iteration in the MCMC simulation of the model (after discarding burnin and thinning iterations).
MC.clusters_heterogeneity
Heterogeneity Measure (HM) for all the clusters returned in MC.clusters
.
cluster.matrix.avg
average similarity matrix of size n by n.
MC.values
a list with the simulated values of the chains for the parameters a,b,Λ,Ω.
MC.accept.rate
a named vector with the acceptance rates for each parameter. It includes iterations that are discarded in the burnin period and thinning.
call
the matched call.
Carmona, C., NietoBarajas, L. E. & Canale, A. (2017). Model based approach for household clustering with mixed scale variables. (https://arxiv.org/abs/1612.00083)
summary.MIXcluster
for a summary of the clustering results, plot.MIXcluster
for graphical representation of results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138  ##############################
# Simulation study 1 #
# Carmona et al. (2017) #
##############################
# Data and parameters are discussed in section 5.1 of Carmona et al. (2017) #
# Set seed for reproducibility #
set.seed(0)
# Specification of data Y #
help(Y_ex_5_1)
# Observable data
# Choose scenario: 1, 2, or 3
ex_i < 1
# Prior specification
# Choose "a", "b" or "c"
param_j < "c"
# Specify the data type that is being provided to the method
var_type_Y_ex_5_1 < list( c("c","c","c"),
c("o","o"),
c("o","o","o","c") )
## Not run:
cluster_ex < MIXclustering( Y = as.matrix(Y_ex_5_1[[ ex_i ]]),
var_type=var_type_Y_ex_5_1[[ ex_i ]],
n_iter_out=1500,
n_burn=200,
n_thin=3,
alpha = meta_param_ex[ param_j, "alpha" ],
d_0_a = meta_param_ex[ param_j, "d_0_a"],
d_1_a = meta_param_ex[ param_j, "d_1_a" ],
d_0_b = meta_param_ex[ param_j, "d_0_b" ],
d_1_b = meta_param_ex[ param_j, "d_1_b" ],
eta = meta_param_ex[ param_j, "eta" ],
kappa = meta_param_ex[ param_j, "kappa" ],
delta = meta_param_ex[ param_j, "delta" ],
d_0_z = meta_param_ex[ param_j, "d_0_z" ],
d_1_z = meta_param_ex[ param_j, "d_1_z" ],
d_0_mu = meta_param_ex[ param_j, "d_0_mu" ],
d_1_mu = meta_param_ex[ param_j, "d_1_mu" ] )
# Summary of clustering results
summary(cluster_ex)
# Visualizing clustering results
plot(cluster_ex,type="heatmap")
plot(cluster_ex,type="chain")
# Comparison of cluster configurations #
# 1) Minimum distance with average MCMC iterations
# 2) Minimum Heterogeneity Measure (HM)
plot( x=jitter(cluster_ex$cluster),y=jitter(cluster_ex$clusterHMmin), col="#FF000080", pch=20,
main=paste("Comparison of two relevant cluster configurations"),
xlab="minimizes distance to average MCMC grouping", ylab="minimizes Heterogeneity Measure" )
# Comparison with the original clusters in the simulated data
plot(x=jitter(Z_latent_ex_5_1$cluster),
y=jitter(cluster_ex$cluster),
main=paste("Comparison real configuration with the model results"),
xlab="Real cluster",
ylab="Model cluster",
pch=19, col="#FF000080")
## End(Not run)
##############################
# Households data #
# Carmona et al. (2017) #
##############################
# Testing "MIXclustering" function with poverty.data #
# Data and parameters are discussed in section 5.3 of Carmona et al. (2017) #
# Set seed for reproducibility #
set.seed(0)
## Not run:
# relevant variables for clustering households #
Y_names < c("ict_norm",
"ic_ali","ic_asalud","ic_cv",
"ic_rezedu","ic_sbv","ic_segsoc",
"niv_ed","tam_loc")
Y_var_type < c("c","o","o","o","o","o","o","o","m")
# using only data from state 15 (Edomex) #
aux_subset < rep(T,nrow(poverty.data))
aux_subset < aux_subset & is.element(substr(poverty.data$folioviv,1,2),"15")
Y_data < poverty.data[aux_subset,Y_names]
### Sampling probability dependin on the scenario ###
# Scenario description in section 5.3 of Carmona et al. (2017) #
# Choose 1, 2 or 3 #
poverty_sampling_spec < 3
if (poverty_sampling_spec == 1) {
k < 1
sampling_prob_pov < rep(1,nrow(Y_data))
} else if (poverty_sampling_spec == 2) {
k < 2 * mean(poverty.data[aux_subset,"factor_hog"])
sampling_prob_pov < 1/poverty.data[aux_subset,"factor_hog"]
} else if (poverty_sampling_spec == 3) {
k < 4 * mean(poverty.data[aux_subset,"factor_hog"])
sampling_prob_pov < 1/poverty.data[aux_subset,"factor_hog"]
}
cluster_poverty < MIXclustering( Y=Y_data,
var_type=Y_var_type,
n_iter_out=1500,
n_burn=200,
n_thin=3,
alpha = 0.5,
d_0_a = 1, d_1_a = 1,
d_0_b = 1, d_1_b = 1,
eta = 2,
kappa = 5,
delta = 4,
d_0_z = 2.1, d_1_z = 30,
d_0_mu = 2.1, d_1_mu = 30,
sampling_prob = k * sampling_prob_pov )
summary(cluster_poverty)
plot(cluster_poverty,type="heatmap")
plot(cluster_poverty,type="chain")
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.