MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale...
In BNPMIXcluster: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

Description Usage Arguments Details Value References See Also Examples

MIXclustering is used to perform cluster analysis of individuals using a Bayesian nonparametric mixture model that jointly models mixed scale data and accommodates for different sampling probabilities. The model is described in Carmona, C., Nieto-Barajas, L. E., Canale, A. (2016).

MIXclustering(
  Y,
  var_type,
  n_iter_out = 2000,
  n_burn = 100,
  n_thin = 2,
  a_fix = NULL,
  alpha = 0.5,
  d_0_a = 1,
  d_1_a = 1,
  b_fix = NULL,
  d_0_b = 1,
  d_1_b = 1,
  eta = 2,
  d_0_z = 2.1,
  d_1_z = 30,
  kappa = 5,
  delta = 4,
  d_0_mu = 2.1,
  d_1_mu = 30,
  sampling_prob = NULL,
  expansion_f = NULL,
  log_file = NULL,
  keep_param_chains = FALSE
)

`Y`	Matrix or data frame containing the data to be clustered.
`var_type`	Character vector that indicates the type of variable in each column of x. Three possible types: "c" for continuous variables. It is assumed to be Gaussian-shaped. "o" for ordinal variables (binary and ordered categorical). "m" for nominal variables (non-ordered categorical).
`n_iter_out`	Number of effective iterations in the MCMC procedure for clustering.
`n_burn`	Number of iterations discarded as part of the burn-in period at the beginning MCMC procedure.
`n_thin`	Number of iterations discarded for thinning the chain (reducing the autocorrelation). We keep 1 of every n_thin iterations.
`a_fix`	A numeric value to set the parameter a in the model. If `NULL` (default), the parameter a is assigned a prior distribution. See `details`.
`alpha`	Hyperparameter in the prior distribution of a. See `details`.
`d_0_a`	Hyperparameter in the prior distribution of a. See `details`.
`d_1_a`	Hyperparameter in the prior distribution of a. See `details`.
`b_fix`	A numeric value to set the parameter b in the model. If `NULL` (default), the parameter b is assigned a prior distribution. See `details`.
`d_0_b`	Hyperparameter in the prior distribution of b. See `details`.
`d_1_b`	Hyperparameter in the prior distribution of b. See `details`.
`eta`	Tuning parameter controlling the proposal in the Metropolis-Hastings step for b.
`d_0_z`	Hyperparameter in the prior distribution of the variance for the latent variables. See `details`.
`d_1_z`	Hyperparameter in the prior distribution of the variance for the latent variables. See `details`.
`kappa`	Tuning parameter controlling the proposal in the Metropolis-Hastings step for the variance of latent variables.
`delta`	Tuning parameter controlling the proposal in the Metropolis-Hastings step for the correlation of latent variables.
`d_0_mu`	Hyperparameter in the prior distribution of the variance of the location in each cluster. See `details`.
`d_1_mu`	Hyperparameter in the prior distribution of the variance of the location in each cluster. See `details`.
`sampling_prob`	vector with the sampling probabilities π_i for each individual in case that the data come from a complex survey sample. By default π_i=1.
`expansion_f`	vector with the expansion factors, the reciprocal of the sampling probabilities, w_i = 1/π_i. If both `sampling_prob` and `expansion_f` are specified, preference is given to `sampling_prob`.
`log_file`	Specifies a file to save the details with the execution time and the parameters used.
`keep_param_chains`	Indicates if the simulations of parameters `a`, `b`, `lambda` and `omega` should be returned as output.

The model consists on a Bayesian non-parametric approach for clustering that is capable to combine different types of variables through the usage of associated continuous latent variables. The clustering mechanism is based on a location mixture model with a Poisson-Dirichlet (PD) process prior on the location parameters μ_i ; i=1,...,n of the associated latent variables.

Computational inference about the cluster allocation and the posterior distribution of the parameters are performed using MCMC.

A full description of the model is in the article Carmona et al. (2016) (https://arxiv.org/abs/1612.00083). See Reference.

The model consider an individual y_i that is characterized by a multivariate response of dimension p, i.e., y_i=(y_{i,1},...,y_{i,p}). The total number of variables p is divided into c continuous variables, o ordinal variables, and m nominal variables such that p=c+o+m.

For the continuous variables, it is convenient that the variables have a real support. The user may have transformed the original values before using the function MIXclustering.

For each response y_i=(y_{i,1},...,y_{i,p}) (of dimension p) a corresponding latent vector z_i=(z_{i,1},...,z_{i,q}) (of dimension q) is created, according to the following:

For each continuous variable y_{i,j} ; j=1,...,c the algorithm uses a latent with the same values z_{i,j}=y_{i,j}.
For each ordinal variable y_{i,j} , j = c+1,...,c+o, with K_j different ordered values, the algorithm creates one latent z_{i,j}, that allows to map the categories into continuous values divided by thresholds. For example, for a binary y_j, we have y_j=0 iff z_j<0 and y_j=1 iff z_j>0
For each nominal variable y_{i,j} , j = c+o+1,...,c+o+m, with L_j categories, the algorithm require L_j-1 latent variables, whose relative order is consistent with the observed category.

The data may come from a complex survey sample where each individual y_i has known sampling probability π_i, i = 1,...,n. The reciprocal of these sampling probabilities, w_i=1/π_i, are called expansion factors or sampling design weights.

The joint model for the latent vector is therefore:

(z_i | μ_i,Σ) ~ N_q(μ_i, π_i Σ )

(Note: the final model in Carmona et al. (2016) has variance κ π_i Σ. This value of κ can be used in the package through a transformed sampling probability vector π_i^*=κπ_i)

The clustering model will be based in an appropriate choice of the prior distribution on the μ_i's. A clustering of the μ_i's will induce a clustering of the y_i's. Our prior on the μ_i's will be:

μ_i | G~G, iid for i=1,...,n

Where G~PD(a,b,G_0) is a Poisson-Dirichlet process with parameters a \in [0,1), b>-a and centering measure G_0. The Dirichlet and the normalized stable processes arise when a=0 and when b=0, respectively.

In consequence, this choice of prior implies that the μ_i's are exchangeable with marginal distribution μ_i~G_0 for all i=1,...,n.

In our case, G(μ)=N(0,Σ_μ), where Σ_μ = diag( σ^2_{μ 1} ,...,σ^2_{μ q} ).

The parameters a and b in the model define the PD process and therefore control the number of groups. These parameters can be fixed, resulting in a larger/smaller number of groups if assigned a larger/smaller value, respectively.

There are 9 hyperparameters in the function that also characterize the prior distributions in the model:

f(a) = alpha * I(a=0) + (1-alpha) * dbeta( a | d_0_a , d_0_a )
f(b | a) = dgamma( b + a | d_0_b , d_1_b )
sigma^2 ~ inverse-gamma( d_0_z , d_1_z)
sigma^2_mu ~ inverse-gamma( d_0_mu , d_1_mu )

The definition of these values also affect the number of resulting clusters since they affect the variance implied in the model.

For example, increasing the values of d_1_a and d_1_b reduce the number of groups.

Finally, the function parameters eta, kappa, delta are tuning parameters that control the acceptance rate in the random-walk MH steps of the new proposed values for the parameters b, Λ_{j,j} (variance of latents) and Ω_{i,j} (correlation of latents). These parameters are not recommended to be changed (used in the internal functions: sampling_b , sampling_Lambda_jj , sampling_Omega_ij).

MIXclustering returns a S3 object of class "MIXcluster".

The generic methods summary and plot are defined for this class.

An object of class "MIXcluster" is a list containing the following components:

cluster: vector with the cluster allocation for each row in the data. It corresponds to the iteration which is Closest-To-Average (CTA) arrangement.
cluster_heterogeneity: Heterogeneity Measure (HM) for the cluster in the previous point. The HM measure is discussed in section 4 of Carmona et al. (2017).
Y.cluster.summary: a summary of the data divided by the allocation in $cluster.
Y.var_type: vector with the variable types in the data.
Y.na: vector specifying the rows with missing values.
Y.n: number of rows in the data.
Y.p: number of variables in the data.
MC.clusters: matrix with the cluster allocation for each row in the data. Each column corresponds to an effective iteration in the MCMC simulation of the model (after discarding burn-in and thinning iterations).
MC.clusters_heterogeneity: Heterogeneity Measure (HM) for all the clusters returned in MC.clusters.
cluster.matrix.avg: average similarity matrix of size n by n.
MC.values: a list with the simulated values of the chains for the parameters a,b,Λ,Ω.
MC.accept.rate: a named vector with the acceptance rates for each parameter. It includes iterations that are discarded in the burn-in period and thinning.
call: the matched call.

Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2017). Model based approach for household clustering with mixed scale variables. (https://arxiv.org/abs/1612.00083)

summary.MIXcluster for a summary of the clustering results, plot.MIXcluster for graphical representation of results.

##############################
#     Simulation study 1     #
#    Carmona et al. (2017)   #
##############################

# Data and parameters are discussed in section 5.1 of Carmona et al. (2017) #

# Set seed for reproducibility #
set.seed(0) 


# Specification of data Y #
help(Y_ex_5_1)

# Observable data
# Choose scenario: 1, 2, or 3
ex_i <- 1

# Prior specification
# Choose "a", "b" or "c"
param_j <- "c"

# Specify the data type that is being provided to the method
var_type_Y_ex_5_1 <- list( c("c","c","c"),
                           c("o","o"),
                           c("o","o","o","c") )
## Not run: 
cluster_ex <- MIXclustering( Y = as.matrix(Y_ex_5_1[[ ex_i ]]),
                             var_type=var_type_Y_ex_5_1[[ ex_i ]],
                             
                             n_iter_out=1500,
                             n_burn=200,
                             n_thin=3,
                             
                             alpha = meta_param_ex[ param_j, "alpha" ],
                             d_0_a = meta_param_ex[ param_j, "d_0_a"],
                             d_1_a = meta_param_ex[ param_j, "d_1_a" ],
                             d_0_b = meta_param_ex[ param_j, "d_0_b" ],
                             d_1_b = meta_param_ex[ param_j, "d_1_b" ],
                             eta = meta_param_ex[ param_j, "eta" ],
                             kappa = meta_param_ex[ param_j, "kappa" ],
                             delta = meta_param_ex[ param_j, "delta" ],
                             
                             d_0_z = meta_param_ex[ param_j, "d_0_z" ],
                             d_1_z = meta_param_ex[ param_j, "d_1_z" ],
                             d_0_mu = meta_param_ex[ param_j, "d_0_mu" ],
                             d_1_mu = meta_param_ex[ param_j, "d_1_mu" ] )
# Summary of clustering results
summary(cluster_ex)

# Visualizing clustering results
plot(cluster_ex,type="heatmap")
plot(cluster_ex,type="chain")

# Comparison of cluster configurations #
# 1) Minimum distance with average MCMC iterations
# 2) Minimum Heterogeneity Measure (HM)
plot( x=jitter(cluster_ex$cluster),y=jitter(cluster_ex$clusterHMmin), col="#FF000080", pch=20,
      main=paste("Comparison of two relevant cluster configurations"),
      xlab="minimizes distance to average MCMC grouping", ylab="minimizes Heterogeneity Measure" )

# Comparison with the original clusters in the simulated data
plot(x=jitter(Z_latent_ex_5_1$cluster),
     y=jitter(cluster_ex$cluster),
     main=paste("Comparison real configuration with the model results"),
     xlab="Real cluster",
     ylab="Model cluster",
     pch=19, col="#FF000080")

## End(Not run)

##############################
#      Households data       #
#    Carmona et al. (2017)   #
##############################

# Testing "MIXclustering" function with poverty.data #
# Data and parameters are discussed in section 5.3 of Carmona et al. (2017) #

# Set seed for reproducibility #
set.seed(0) 


## Not run: 
# relevant variables for clustering households #
Y_names <- c("ict_norm",
             "ic_ali","ic_asalud","ic_cv",
             "ic_rezedu","ic_sbv","ic_segsoc",
             "niv_ed","tam_loc")
Y_var_type <- c("c","o","o","o","o","o","o","o","m")

# using only data from state 15 (Edomex) #
aux_subset <- rep(T,nrow(poverty.data))
aux_subset <- aux_subset & is.element(substr(poverty.data$folioviv,1,2),"15")

Y_data <- poverty.data[aux_subset,Y_names]

### Sampling probability dependin on the scenario ###
# Scenario description in section 5.3 of Carmona et al. (2017) #
# Choose 1, 2 or 3 #
poverty_sampling_spec <- 3


if (poverty_sampling_spec == 1) {
  k <- 1
  sampling_prob_pov <- rep(1,nrow(Y_data))
} else if (poverty_sampling_spec == 2) {
  k <- 2 * mean(poverty.data[aux_subset,"factor_hog"])
  sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"]
} else if (poverty_sampling_spec == 3) {
  k <- 4 * mean(poverty.data[aux_subset,"factor_hog"])
  sampling_prob_pov <- 1/poverty.data[aux_subset,"factor_hog"]
}

cluster_poverty <- MIXclustering( Y=Y_data,
                                  var_type=Y_var_type,
                                  n_iter_out=1500,
                                  n_burn=200,
                                  n_thin=3,
                                  
                                  alpha = 0.5,
                                  d_0_a = 1, d_1_a = 1,
                                  d_0_b = 1, d_1_b = 1,
                                  
                                  eta = 2,
                                  kappa = 5,
                                  delta = 4,
                                  
                                  d_0_z = 2.1, d_1_z = 30,
                                  d_0_mu = 2.1, d_1_mu = 30,
                                  
                                  sampling_prob = k * sampling_prob_pov )

summary(cluster_poverty)
plot(cluster_poverty,type="heatmap")
plot(cluster_poverty,type="chain")

## End(Not run)

BNPMIXcluster documentation built on Nov. 30, 2020, 5:07 p.m.

BNPMIXcluster index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

BNPMIXcluster
Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale...
In BNPMIXcluster: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Related to MIXclustering in BNPMIXcluster...

R Package Documentation

Browse R Packages

We want your feedback!

BNPMIXcluster Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale... In BNPMIXcluster: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Related to MIXclustering in BNPMIXcluster...

R Package Documentation

Browse R Packages

We want your feedback!

BNPMIXcluster
Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale...
In BNPMIXcluster: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables