# fabMix: Main function In fabMix: Overfitting Bayesian Mixtures of Factor Analyzers with Parsimonious Covariance and Unknown Number of Components

## Description

This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.

## Usage

 1 2 3 4 5 fabMix(model, nChains, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, rmDir, parallelModels, lowerTriangular) 

## Arguments

 model Any subset of "UUU" "CUU" "UCU" "CCU" "UCC" "UUC" "CUC", "CCC" indicating the fitted models. By default, all models are fitted. nChains The number of parallel heated chains. When 'dirPriorAlphas' is supplied, 'nChains' can be ignored. dirPriorAlphas vector of length nChains in the form of an increasing sequence of positive scalars. Each entry contains the (common) prior Dirichlet parameter for the corresponding chain. Default: dirPriorAlphas = c(1, 1 + dN*(2:nChains - 1))/Kmax, where dN = 1, for nChains > 1. Otherwise, dirPriorAlphas = 1/Kmax. rawData The observed data as an n\times p matrix. Clustering is performed on the rows of the matrix. outDir Name of the output folder. An error is thrown if the directory already exists inside the current working directory. Note: it should NOT correspond to an absolute path, e.g.: outDir = fabMix_example is acceptable, but outDir = C:\Username\Documents\fabMix_example is not. Kmax Number of components in the overfitted mixture. Default: 20. mCycles Number of MCMC cycles. Each cycle consists of nIterPerCycle MCMC iterations. At the end of each cycle a swap of the state of two randomly chosen adjacent chains is attempted. burnCycles Number of cycles that will be discarded as burn-in period. g Prior parameter g. Default value: g = 0.5. h Prior parameter h. Default value: h = 0.5. alpha_sigma Prior parameter α. Default value: α = 0.5. beta_sigma Prior parameter β. Default value: β = 0.5. q A vector containing the number of factors to be fitted. normalize Should the observed data be normalized? Default value: TRUE. (Recommended) thinning Optional integer denoting the thinning of the keeped MCMC cycles. zStart Optional starting value for the allocation vector. nIterPerCycle Number of iteration per MCMC cycle. Default value: 10. gibbs_z Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. warm_up_overfitting Number of iterations for the overfitting initialization scheme. Default value: 500. warm_up Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 5000. overfittingInitialization Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. progressGraphs Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. gwar Initialization parameter. Default: 0.05. rmDir Logical value indicating whether to delete the outDir directory. Default: TRUE. parallelModels Model-level parallelization: An optional integer specifying the number of cores that will be used in order to fit in parallel each member of model. Default: NULL (no model-level parallelization). lowerTriangular logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE.

## Details

Let \boldsymbol{X}_i; i =1,…,n denote independent p-dimensional random vectors. Let Y_i\in R^q with q < p denote the latent factor for observation i = 1,…,n. In the typical factor analysis model, each observation is modelled as \boldsymbol{X}_i = \boldsymbol{μ} + \boldsymbol{Λ} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i , with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{Σ}), where \boldsymbol{\varepsilon}_i and Y_i are assumed independent, i =1,…,n. The p\times q matrix Λ consists of the factor loadings. Assume that there are K clusters and let Z_i denotes the latent allocation of observation i to one amongs the k clusters, with prior probability P(Z_i = k) = w_k, k = 1,…,K, independent for i=1,…,n. Following McNicholas et al (2008), the following parameterizations are used:

UUU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{Σ}_k)

UCU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{Σ})

UUC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,σ_k \boldsymbol{I}_p)

UCC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ}_k \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,σ \boldsymbol{I}_p)

CUU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{Σ}_k)

CCU model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,\boldsymbol{Σ})

CUC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,σ_k \boldsymbol{I}_p)

CCC model: (\boldsymbol{X}_i|Z_i = k) = \boldsymbol{μ}_k + \boldsymbol{Λ} \boldsymbol{Y}_i + \boldsymbol{\varepsilon}_i, with \boldsymbol{\varepsilon}_i \sim \mathcal N(0,σ \boldsymbol{I}_p)

In all cases, \boldsymbol{\varepsilon}_i and \boldsymbol{Y}_i are assumed independent, i =1,…,n. Note that \boldsymbol{Σ}_k and \boldsymbol{Σ} denote positive definite matrices, \boldsymbol{I}_p denotes the p\times p identity matrix and σ_k, σ denote positive scalars.

## Value

An object of class fabMix.object, that is, a list consisting of the following entries:

 bic  Bayesian Information Criterion per model and number of factors. class  The estimated single best clustering of the observations according to the selected model. n_Clusters_per_model  The most probable number of clusters (number of non-empty components of the overfitted mixture) per model and number of factors. posterior_probability  The posterior probability of the estimated allocations according to the selected model. covariance_matrix  The estimated posterior mean of the covariance matrix per cluster according to the selected model. mu  The estimated posterior mean of the mean per cluster according to the selected model. weights  The estimated posterior mean of the mixing proportions according to the selected model. selected_model  Data frame containing the parameterization, number of clusters and factors of the selected model. mcmc  A list containing the MCMC draws for the parameters of the selected model. Each entry is returned as an mcmc object, a class imported from the coda package (Plummer et al, 2006). All component-specific parameters have been reordered according to the ECR algorithm in order to undo the label switching problem. However, the output corresponding to factor scores and factor loadings is not identifiable due to sign-switching across the MCMC trace. data  The observed data. regularizedExpression The regularized expressions of variable scores to each factor per cluster (see Papastamoulis 2018, CSDA). Kmap_prob The posterior probability of the Maximum A Posteriori number of alive clusters for each parameterization and factor level.

## Note

It is recommended to use: normalize = TRUE (default). Tuning of dirPriorAlphas may be necessary to achieve reasonable acceptance rates of chain swaps. Note that the output is reordered in order to deal with the label switching problem, according to the ECR algorithm applied by dealWithLabelSwitching function.

Parallelization is enabled in both the chain-level as well as the model-level. By default all heated chains (specified by the nchains argument) run in parallel using (at most) the same number of threads (if available). If parallelModels = NULL (default), then the selected parameterizations will run (serially) on the same thread. Otherwise, if parallelModels = x (where x denotes a positive integer), the algorithm will first use x threads to fit the specified model parameterizations in parallel, and furthermore will also parallelize the heated chains (according to the remaining free cores on the user's system). The user should combine parallelModels with nChains efficiently, for example: if the number of available threads equals 12 and the user wishes to run 3 heated chains per model (recall that there are 8 parameterizations in total), then, an ideal allocation would be parallelModels = 4 and nChains = 3 because all available threads will be constantly busy. If the user wishes to run nChains = 4 heated chains per model using 12 threads, then an ideal allocation would be parallelModels = 3 models running in parallel. In the case where parallelModels*nChains > m, with m denoting the available number of threads, the algorithm will first allocate min(parallelModels, m) threads to run the same number of parameterizations in parallel, and then the remaining threads (if any) will be used to process the parallel heated chains. If no other threads are available, the heated chains will be allocated to single threads.

## Author(s)

Panagiotis Papastamoulis

## References

Martyn Plummer, Nicky Best, Kate Cowles and Karen Vines (2006). CODA: Convergence Diagnosis and Output Analysis for MCMC, R News, vol 6, 7-11.

McNicholas, P.D. and Murphy, T.B. Stat Comput (2008) 18: 285. https://doi.org/10.1007/s11222-008-9056-0.

Papastamoulis, P. (2018). Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Computational Statistics and Data Analysis, 124: 220-234. DOI: 10.1016/j.csda.2018.03.007.

Papastamoulis, P (2019). Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Statistics and Computing, doi: 10.1007/s11222-019-09891-z.

plot.fabMix.object

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 # TOY EXAMPLE (very small numbers... only for CRAN check purposes) ################################################################# # (a) using 2 cores in parallel, each one running 2 heated chains. ################################################################# library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) # Run fabMix for a _small_ number of iterations # but considering only the UUU (maximal model), # using the default prior parallel heating parameters dirPriorAlphas. # NOTE: dirPriorAlphas may require some tuning in general. qRange <- 2 # values for the number of factors (only the true number # is considered here) Kmax <- 2 # number of components for the overfitted mixture model nChains <- 2 # number of parallel heated chains set.seed(1) fm <- fabMix( model = "UUU", nChains = nChains, rawData = syntheticDataset$data, outDir = "toyExample", Kmax = Kmax, mCycles = 4, burnCycles = 1, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 2, warm_up = 5) # WARNING: the following parameters: # Kmax, nChains, mCycles, burnCycles, warm_up_overfitting, warm_up # should take (much) _larger_ values. E.g. a typical implementation consists of: # Kmax = 20, nChains >= 3, mCycles = 1100, burnCycles = 100, # warm_up_overfitting = 500, warm_up = 5000. # You may also print and plot # print(fm) # plot(fm) ################################################################# # (b) using 12 cores_____________________________________________ #_______4 models with 3 heated chains running in parallel________ #_______considering all 8 model parameterizations________________ ################################################################# ## Not run: library('fabMix') set.seed(99) n = 200 # sample size p = 30 # number of variables q = 2 # number of factors K = 5 # number of clusters sINV_diag = rep(1/20,p) # diagonal of inverse variance of errors syntheticDataset <- simData(sameLambda=FALSE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) qRange <- 1:3 # range of values for the number of factors Kmax <- 20 # number of components for the overfitted mixture model nChains <- 3 # number of parallel heated chains # the next command takes ~ 2 hours in a Linux machine with 12 threads. fm <- fabMix( parallelModels = 4, nChains = nChains, model = c("UUU","CUU","UCU","CCU","UCC","UUC","CUC","CCC"), rawData = syntheticDataset$data, outDir = "toyExample_b", Kmax = Kmax, mCycles = 1100, burnCycles = 100, q = qRange) print(fm) plot(fm, what = "BIC") # see also # plot(fm); summary(fm) ## End(Not run) 

fabMix documentation built on Feb. 20, 2020, 1:09 a.m.