fabMix: Main function
In fabMix: Overfitting Bayesian Mixtures of Factor Analyzers with Parsimonious Covariance and Unknown Number of Components

fabMix

R Documentation

Main function

Description

This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.

Usage

fabMix(model, nChains, dirPriorAlphas, rawData, outDir, Kmax, mCycles, 
	burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize,  
	thinning, zStart, nIterPerCycle, gibbs_z, 
	warm_up_overfitting, warm_up, overfittingInitialization, 
	progressGraphs, gwar, rmDir, parallelModels, lowerTriangular)

Arguments

`model`	Any subset of "UUU" "CUU" "UCU" "CCU" "UCC" "UUC" "CUC", "CCC" indicating the fitted models. By default, all models are fitted.
`nChains`	The number of parallel heated chains. When 'dirPriorAlphas' is supplied, 'nChains' can be ignored.
`dirPriorAlphas`	vector of length `nChains` in the form of an increasing sequence of positive scalars. Each entry contains the (common) prior Dirichlet parameter for the corresponding chain. Default: `dirPriorAlphas = c(1, 1 + dN*(2:nChains - 1))/Kmax`, where `dN = 1`, for `nChains > 1`. Otherwise, `dirPriorAlphas = 1/Kmax`.
`rawData`	The observed data as an `n\times p` matrix. Clustering is performed on the rows of the matrix.
`outDir`	Name of the output folder. An error is thrown if the directory already exists inside the current working directory. Note: it should NOT correspond to an absolute path, e.g.: outDir = `fabMix_example` is acceptable, but outDir = `C:\Username\Documents\fabMix_example` is not.
`Kmax`	Number of components in the overfitted mixture. Default: 20.
`mCycles`	Number of MCMC cycles. Each cycle consists of `nIterPerCycle` MCMC iterations. At the end of each cycle a swap of the state of two randomly chosen adjacent chains is attempted.
`burnCycles`	Number of cycles that will be discarded as burn-in period.
`g`	Prior parameter `g`. Default value: `g = 0.5`.
`h`	Prior parameter `h`. Default value: `h = 0.5`.
`alpha_sigma`	Prior parameter `\alpha`. Default value: `\alpha = 0.5`.
`beta_sigma`	Prior parameter `\beta`. Default value: `\beta = 0.5`.
`q`	A vector containing the number of factors to be fitted.
`normalize`	Should the observed data be normalized? Default value: TRUE. (Recommended)
`thinning`	Optional integer denoting the thinning of the keeped MCMC cycles.
`zStart`	Optional starting value for the allocation vector.
`nIterPerCycle`	Number of iteration per MCMC cycle. Default value: 10.
`gibbs_z`	Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1.
`warm_up_overfitting`	Number of iterations for the overfitting initialization scheme. Default value: 500.
`warm_up`	Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 5000.
`overfittingInitialization`	Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE.
`progressGraphs`	Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE.
`gwar`	Initialization parameter. Default: 0.05.
`rmDir`	Logical value indicating whether to delete the `outDir` directory. Default: TRUE.
`parallelModels`	Model-level parallelization: An optional integer specifying the number of cores that will be used in order to fit in parallel each member of `model`. Default: NULL (no model-level parallelization).
`lowerTriangular`	logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE.

Details

Let \boldsymbol{X}_i; i =1,\ldots,n denote independent p-dimensional random vectors. Let Y_i\in R^q with q < p denote the latent factor for observation i = 1,\ldots,n. In the typical factor analysis model, each observation is modelled as \boldsymbol{X}_i = \boldsymbol{\mu} + \boldsymbol{\Lambda} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i , with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{\Sigma}), where \boldsymbol{\varepsilon}_i and Y_i are assumed independent, i =1,\ldots,n. The p\times q matrix \Lambda consists of the factor loadings. Assume that there are K clusters and let Z_i denotes the latent allocation of observation i to one amongs the k clusters, with prior probability P(Z_i = k) = w_k, k = 1,\ldots,K, independent for i=1,\ldots,n. Following McNicholas et al (2008), the following parameterizations are used:

UUU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{\Sigma}_k)

UCU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{\Sigma})

UUC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\sigma_k \boldsymbol{I}_p)

UCC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\sigma \boldsymbol{I}_p)

CUU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{\Sigma}_k)

CCU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{\Sigma})

CUC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\sigma_k \boldsymbol{I}_p)

CCC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{\mu}_k + \boldsymbol{\Lambda} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\sigma \boldsymbol{I}_p)

In all cases, \boldsymbol{\varepsilon}_i and \boldsymbol{Y}_i are assumed independent, i =1,\ldots,n. Note that \boldsymbol{\Sigma}_k and \boldsymbol{\Sigma} denote positive definite matrices, \boldsymbol{I}_p denotes the p\times p identity matrix and \sigma_k, \sigma denote positive scalars.

Value

An object of class fabMix.object, that is, a list consisting of the following entries:

`bic`	Bayesian Information Criterion per model and number of factors.
`class`	The estimated single best clustering of the observations according to the selected model.
`n_Clusters_per_model`	The most probable number of clusters (number of non-empty components of the overfitted mixture) per model and number of factors.
`posterior_probability`	The posterior probability of the estimated allocations according to the selected model.
`covariance_matrix`	The estimated posterior mean of the covariance matrix per cluster according to the selected model.
`mu`	The estimated posterior mean of the mean per cluster according to the selected model.
`weights`	The estimated posterior mean of the mixing proportions according to the selected model.
`selected_model`	Data frame containing the parameterization, number of clusters and factors of the selected model.
`mcmc`	A list containing the MCMC draws for the parameters of the selected model. Each entry is returned as an `mcmc` object, a class imported from the `coda` package (Plummer et al, 2006). All component-specific parameters have been reordered according to the ECR algorithm in order to undo the label switching problem. However, the output corresponding to factor scores and factor loadings is not identifiable due to sign-switching across the MCMC trace.
`data`	The observed data.
`regularizedExpression`	The regularized expressions of variable scores to each factor per cluster (see Papastamoulis 2018, CSDA).
`Kmap_prob`	The posterior probability of the Maximum A Posteriori number of alive clusters for each parameterization and factor level.

Note

It is recommended to use: normalize = TRUE (default). Tuning of dirPriorAlphas may be necessary to achieve reasonable acceptance rates of chain swaps. Note that the output is reordered in order to deal with the label switching problem, according to the ECR algorithm applied by dealWithLabelSwitching function.

Parallelization is enabled in both the chain-level as well as the model-level. By default all heated chains (specified by the nchains argument) run in parallel using (at most) the same number of threads (if available). If parallelModels = NULL (default), then the selected parameterizations will run (serially) on the same thread. Otherwise, if parallelModels = x (where x denotes a positive integer), the algorithm will first use x threads to fit the specified model parameterizations in parallel, and furthermore will also parallelize the heated chains (according to the remaining free cores on the user's system). The user should combine parallelModels with nChains efficiently, for example: if the number of available threads equals 12 and the user wishes to run 3 heated chains per model (recall that there are 8 parameterizations in total), then, an ideal allocation would be parallelModels = 4 and nChains = 3 because all available threads will be constantly busy. If the user wishes to run nChains = 4 heated chains per model using 12 threads, then an ideal allocation would be parallelModels = 3 models running in parallel. In the case where parallelModels*nChains > m, with m denoting the available number of threads, the algorithm will first allocate min(parallelModels, m) threads to run the same number of parameterizations in parallel, and then the remaining threads (if any) will be used to process the parallel heated chains. If no other threads are available, the heated chains will be allocated to single threads.

Author(s)

Panagiotis Papastamoulis

References

Martyn Plummer, Nicky Best, Kate Cowles and Karen Vines (2006). CODA: Convergence Diagnosis and Output Analysis for MCMC, R News, vol 6, 7-11.

McNicholas, P.D. and Murphy, T.B. Stat Comput (2008) 18: 285. https://doi.org/10.1007/s11222-008-9056-0.

Papastamoulis, P. (2018). Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Computational Statistics and Data Analysis, 124: 220-234. DOI: 10.1016/j.csda.2018.03.007.

Papastamoulis, P (2019). Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Statistics and Computing, doi: 10.1007/s11222-019-09891-z.

Examples

# TOY EXAMPLE (very small numbers... only for CRAN check purposes)

#################################################################
# (a) using 2 cores in parallel, each one running 2 heated chains.
#################################################################
library('fabMix')

n = 8                # sample size
p = 5                # number of variables
q = 2                # number of factors
K = 2		     # true number of clusters

sINV_diag = 1/((1:p))	 # diagonal of inverse variance of errors
set.seed(100)
syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, 
			sINV_values = sINV_diag)
colnames(syntheticDataset$data) <- paste0("x_",1:p)

# Run `fabMix` for a _small_ number of iterations 
#	but considering only the `UUU` (maximal model),
# 	using the default prior parallel heating parameters `dirPriorAlphas`.
#	NOTE: `dirPriorAlphas` may require some tuning in general.


qRange <- 2	# values for the number of factors (only the true number 
#                                                    is considered here)
Kmax <- 2	# number of components for the overfitted mixture model
nChains <- 2	# number of parallel heated chains

set.seed(1)
fm <- fabMix( model = "UUU", nChains = nChains, 
	rawData = syntheticDataset$data, outDir = "toyExample",
        Kmax = Kmax, mCycles = 4, burnCycles = 1, q = qRange,
        g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, 
        warm_up_overfitting = 2, warm_up = 5) 

# WARNING: the following parameters: 
#  Kmax, nChains, mCycles, burnCycles, warm_up_overfitting, warm_up 
#	 should take (much) _larger_ values. E.g. a typical implementation consists of:
#        Kmax = 20, nChains >= 3, mCycles = 1100, burnCycles = 100, 
#        warm_up_overfitting = 500, warm_up = 5000. 

# You may also print and plot
# print(fm)
# plot(fm)

#################################################################
# (b) using 12 cores_____________________________________________
#_______4 models with 3 heated chains running in parallel________
#_______considering all 8 model parameterizations________________
#################################################################
## Not run: 
library('fabMix')
set.seed(99)
n = 200                # sample size
p = 30                # number of variables
q = 2                # number of factors
K = 5		     # number of clusters
sINV_diag = rep(1/20,p) 	# diagonal of inverse variance of errors
syntheticDataset <- simData(sameLambda=FALSE,K.true = K, n = n, q = q, p = p, 
			sINV_values = sINV_diag)
colnames(syntheticDataset$data) <- paste0("x_",1:p)
qRange <- 1:3	# range of values for the number of factors
Kmax <- 20	# number of components for the overfitted mixture model
nChains <- 3	# number of parallel heated chains

# the next command takes ~ 2 hours in a Linux machine with 12 threads.

fm <- fabMix( parallelModels = 4, 
	nChains = nChains, 
	model = c("UUU","CUU","UCU","CCU","UCC","UUC","CUC","CCC"), 
	rawData = syntheticDataset$data, outDir = "toyExample_b",
        Kmax = Kmax, mCycles = 1100, burnCycles = 100, q = qRange) 

print(fm)
plot(fm, what = "BIC")
# see also
# plot(fm); summary(fm)


## End(Not run)

fabMix documentation built on May 29, 2024, 2:53 a.m.