# befa: Bayesian Exploratory Factor Analysis In BayesFM: Bayesian Inference for Factor Modeling

## Description

This function implements the Bayesian Exploratory Factor Analysis (`befa`) approach developed in Conti et al. (CFSHP, 2014). It runs a MCMC sampler for a factor model with dedicated factors, where each manifest variable is allowed to load on at most one latent factor. The allocation of the manifest variables to the latent factors is not fixed a priori but determined stochastically during sampling. The minimum number of variables dedicated to each factor can be controlled by the user to achieve the desired level of identification. The manifest variables can be continuous or dichotomous, and control variables can be introduced as covariates.

## Usage

 ```1 2 3 4 5 6``` ```befa(model, data, burnin = 1000, iter = 10000, Nid = 3, Kmax, A0 = 10, B0 = 10, c0 = 2, C0 = 1, HW.prior = TRUE, nu0 = Kmax + 1, S0 = 1, kappa0 = 2, xi0 = 1, kappa = 1/Kmax, indp.tau0 = TRUE, rnd.step = TRUE, n.step = 5, search.delay = min(burnin, 10), R.delay = min(burnin, 100), dedic.start, alpha.start, sigma.start, beta.start, R.start, verbose = TRUE) ```

## Arguments

 `model` This argument specifies the manifest variables and the covariates used in the model (if any). It can be specified in two different ways: A numeric matrix or a data frame containing the manifest variables. This corresponds to a model without covariates, and the argument `data` is not required. A list of model formulas. Each element of the list specifies a manifest variable and its corresponding control variables (e.g., '`Y1 ~ X1 + X2`' to use `X1` and `X2` as control variables for `Y1`). If a formula has no left-hand side variable, covariates on the right-hand side are included in all equations (e.g., '`~ X3`' means that `X3` is used as a control variable for all the manifest variables). Argument `data` can be passed to the function in that case, otherwise parent data frame is used. Binary manifest variables should be specified as logical vectors in the data frame to be treated as dichotomous. `NA` values are accepted in manifest variables only. `data` Data frame. If missing, parent data frame if used. `burnin` Burn-in period of the MCMC sampler. `iter` Number of MCMC iterations saved for posterior inference (after burn-in). `Nid` Minimum number of manifest variables dedicated to each latent factor for identification. `Kmax` Maximum number of latent factors. If missing, the maximum number of factors that satisfies the identification condition determined by `Nid` and the Ledermann bound is specified (see CFSHP, section 2.2). `A0` Scaling parameters of the variance of the Normal prior on the nonzero factor loadings. Either a scalar or a numeric vector of length equal to the number of manifest variables. `B0` Variances of the Normal prior on the regression coefficients. Either a scalar or a numeric vector of length equal to the number of manifest variables. `c0` Shape parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables. `C0` Scale parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables. `HW.prior` If `TRUE`, implement Huang-Wand (2013) prior on the covariance matrix of the factors in the expanded model, otherwise use an Inverse-Wishart prior if `FALSE`, see CFSHP section 2.3.5. `nu0` Degrees of freedom of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model. `S0` Scale parameters of the Inverse-Wishart prior on the covariance matrix of latent factors in the expanded model: if `HW.prior = TRUE`, scale parameter of the Gamma hyperprior distribution on the individual scales of the Inverse-Wishart prior. if `HW.prior = FALSE`, diagonal elements of the scale matrix of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model. Either a scalar or a numeric vector of length equal to `Kmax`. `kappa0` First shape parameter of the Beta prior distribution on the probability τ_0 that a manifest variable does not load on any factor. `xi0` Second shape parameter of the Beta prior distribution on the probability τ_0 that a manifest variable does not load on any factor. `kappa` Concentration parameters of the Dirichlet prior distribution on the indicators. Either a scalar or a numeric vector of length equal to `Kmax`. `indp.tau0` If `TRUE`, specify the alternative prior specification with independent parameters τ_0m across manifest variables m = 1, ..., M, otherwise use a common parameter τ_0 if `FALSE`. `rnd.step` If `TRUE`, select randomly the number of intermediate steps in non-identified models at each MCMC iteration, otherwise use a fixed number of steps if `FALSE`. `n.step` Controls the number of intermediate steps in non-identified models: if `rnd.step = TRUE`, average number of steps. The number of steps is sampled at each MCMC iteration from 1+Poisson(`n.step`-1). if `rnd.step = FALSE`, fixed number of steps. `search.delay` Number of MCMC iterations run with fixed indicator matrix (specified in `dedic.start`) at beginning of MCMC sampling. `R.delay` Number of MCMC iterations run with fixed correlation matrix (specified in `dedic.start`) at beginning of MCMC sampling. `dedic.start` Starting values for the indicators. Vector of integers of length equal to the number of manifest variables. Each element takes a value among 0, 1, ..., `Kmax`. If missing, random allocation of the manifest variables to the maximum number of factors `Kmax`, with a minimum of `Nid` manifest variables loading on each factor. `alpha.start` Starting values for the factor loadings. Numeric vector of length equal to the number of manifest variables. If missing, sampled from a Normal distribution with zero mean and variance `A0`. `sigma.start` Starting values for the idiosyncratic variances. Numeric vector of length equal to the number of manifest variables. Sampled from prior if missing. `beta.start` Starting values for the regression coefficients. Numeric vector of length equal to the total number of regression coefficients, concatenated for all the manifest variables. Sampled from prior if missing. `R.start` Starting values for the correlation matrix of the latent factors. Numeric matrix with `Kmax` rows and columns, and unit diagonal elements. If missing, identity matrix is used. `verbose` If `TRUE`, display information on the progress of the function.

## Details

Model specification. The model is specified as follows, for each observation i = 1, ..., N:

Y*_i = β X_i + α θ_i + ε_i

θ_i ~ N(0, R)

ε_i ~ N(0, Σ)

Σ = diag(σ^2_1, ..., σ^2_M)

where Y*_i is the M-vector containing the latent variables underlying the corresponding M manifest variables Y_i, which can be continuous such that Y_im = Y*_im, or binary with Y_im = 1[Y*_im > 0], for m = 1, ..., M. The K-vector θ_i contains the latent factors, and α is the (M*K)-matrix of factor loadings. The M-vector ε_i is the vector of error terms. Covariates can be included in the Q-vector X_i and are related to the manifest variables through the (M*Q)-matrix of regression coefficients β. Intercept terms are automatically included, but can be omitted in some or all equations using the usual syntax for R formulae (e.g., 'Y1 ~ X1 - 1' specifies that that Y1 is regressed on X1 and no intercept is included in the corresponding equation).

The number of latent factors K is specified as `Kmax`. However, during MCMC sampling the stochastic search process on the matrix α may produce zero columns, thereby reducing the number of active factors.

The covariance matrix R of the latent factors is assumed to be a correlation matrix for identification.

Each row of the factor loading matrix α contains at most one nonzero element (dedicated factor model). The allocation of the manifest variables to the latent factors is indicated by the binary matrix Δ with same dimensions as α, such that each row Δ_m indicates which factor loading is different from zero, e.g.:

Δ_m = (0, .., 0, 1, 0, ..., 0) = e_k

indicates that variable m loads on the kth factor, where e_k is a K-vector that contains only zeros, besides its kth element that equals 1.

Identification. The function verifies that the maximum number of latent factors `Kmax` does not exceed the Ledermann bound. It also checks that `Kmax` is consistent with the identification restriction specified with `Nid` (enough variables should be available to load on the factors when `Kmax` is reached). The default value for `Kmax` is the minimum between the Ledermann bound and the maximum number of factors that can be loaded by `Nid` variables. The user is free to select the level of identification, see CFSHP section 2.2 (non-identified models are allowed with `Nid = 1`).

Note that identification is achieved only with respect to the scale of the latent factors. Non-identifiability problems may affect the posterior sample because of column switching and sign switching of the factor loadings. These issues can be addressed a posteriori with the functions `post.column.switch` and `post.sign.switch`.

Prior specification. The indicators are assumed to have the following probabilities, for k = 1, ..., K:

Prob(Δ_m = e_k | τ_k) = τ_k

τ = (τ_0, τ_1, ..., τ_K)

If `indp.tau0 = FALSE`, the probabilities are specified as:

τ = [τ_0, (1-τ_0)τ*_1, ..., (1-τ_0)τ*_K]

τ_0 ~ Beta(κ_0, ξ_0)

τ* = (τ*_1, ..., τ*_K) ~ Dir(κ)

with κ_0 = `kappa0`, ξ_0 = `xi0` and κ = `kappa`. Alternatively, if `indp.tau0 = TRUE`, the probabilities are specified as:

τ_m = [τ_0m, (1-τ_0m)τ*_1, ..., (1-τ_0m)τ*_K]

τ_0m ~ Beta(κ_0, ξ_0)

for each manifest variable m = 1, ..., M.

A normal-inverse-Gamma prior distribution is assumed on the nonzero factor loadings and on the idiosyncratic variances:

σ^2_m ~ Inv-Gamma(c0_m, C0_m)

α_m^Δ | σ^2_m ~ N(0, A0_m * σ^2_m)

where α_m^Δ denotes the only nonzero loading in the mth row of α.

For the regression coefficients, a multivariate Normal prior distribution is assumed on each row m = 1, ..., M of β:

β_m ~ N(0, B_0 I_Q)

The covariates can be different across manifest variables, implying zero restrictions on the matrix β. To specify covariates, use a list of formulas as `model` (see example below). Intercept terms can be introduced using

To sample the correlation matrix R of the latent factors, marginal data augmentation is implemented (van Dyk and Meng, 2001), see CFSHP section 2.2. Using the transformation Ω = Λ^{1/2} R Λ^{1/2}, the parameters Λ = diag(λ_1, ..., λ_K) are used as working parameters. These parameters correspond to the variances of the latent factors in an expanded version of the model where the factors do not have unit variances. Two prior distributions can be specified on the covariance matrix Ω in the expanded model:

• If `HW.prior = FALSE`, inverse-Wishart distribution:

Ω ~ Inv-Wishart(ν_0, diag(S0))

with ν_0 = `nu0` and S_0 = `S0`.

• If `HW.prior = TRUE`, Huang-Wand (2013) prior:

Ω ~ Inv-Wishart(nu0, W), W = diag(w_1, ..., w_K)

w_k ~ Gamma(1/2, 1/(2ν*S0_k))

with ν* = `nu0` - `Kmax` + 1, and the shape and rate parameters are specified such that the mean of the gamma distribution is equal to ν* S0_k, for each k = 1, ..., K.

Missing values. Missing values (`NA`) are allowed in the manifest variables. They are drawn from their corresponding conditional distributions during MCMC sampling. Control variables with missing values can be passed to the function. However, all the observations with at least one missing value in the covariates are discarded from the sample (a warning message is issued in that case).

## Value

The function returns an object of class '`befa`' containing the MCMC draws of the model parameters saved in the following matrices (each matrix has '`iter`' rows):

• `alpha`: Factor loadings.

• `sigma`: Idiosyncratic variances.

• `R`: Correlation matrix of the latent factors (off-diagonal elements only).

• `beta`: regression coefficients (if any).

• `dedic`: indicators (integers indicating on which factors the manifest variable load).

The returned object also contains:

• `nfac`: Vector of number of 'active' factors across MCMC iterations (i.e., factors loaded by at least `Nid` manifest variables).

• `MHacc`: Logical vector indicating accepted proposals of Metropolis-Hastings algorithm.

The parameters `Kmax` and `Nid` are saved as object attributes, as well as the function call and the number of mcmc iterations (`burnin` and `iter`), and two logical variables indicating if the returned object has been post processed to address the column switching problem (`post.column.switch`) and the sign switching problem (`post.sign.switch`).

## Author(s)

Rémi Piatek remi.piatek@gmail.com

## References

G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, http://dx.doi.org/10.1016/j.jeconom.2014.06.008.

A. Huang, M.P. Wand (2013): “Simple Marginally Noninformative Prior Distributions for Covariance Matrices”, Bayesian Analysis, 8(2), pages 439-452, http://dx.doi.org/10.1214/13-BA815.

D.A. van Dyk, X.-L. Meng (2001): “The Art of Data Augmentation”, Journal of Computational and Graphical Statistics, 10(1), pages 1-50, http://dx.doi.org/10.1198/10618600152418584.

`post.column.switch` and `post.sign.switch` for column switching and sign switching of the factor loading matrix and of the correlation matrix of the latent factors to restore identification a posteriori.

`summary.befa` and `plot.befa` to summarize and plot the posterior results.

`simul.R.prior` and `simul.nfac.prior` to simulate the prior distribution of the correlation matrix of the factors and the prior distribution of the indicator matrix, respectively. This is useful to perform prior sensitivity analysis and to understand the role of the corresponding parameters in the factor search.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51``` ```#### model without covariates set.seed(6) # generate fake data with 15 manifest variables and 3 factors N <- 100 # number of observations Y <- simul.dedic.facmod(N, dedic = rep(1:3, each = 5)) # run MCMC sampler # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) # post process MCMC draws to restore identification mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) summary(mcmc) # summarize posterior results plot(mcmc) # plot posterior results # summarize highest posterior probability (HPP) model summary(mcmc, what = 'hppm') #### model with covariates # generate covariates and regression coefficients Xcov <- cbind(1, matrix(rnorm(4*N), ncol = 4)) colnames(Xcov) <- c('(Intercept)', paste0('X', 1:4)) beta <- rbind(rnorm(15), rnorm(15), diag(3) %x% t(rnorm(5))) # add covariates to previous model Y <- Y + Xcov %*% beta # specify model model <- c('~ X1', # X1 covariate in all equations paste0('Y', 1:5, ' ~ X2'), # X2 covariate for Y1-Y5 only paste0('Y', 6:10, ' ~ X3'), # X3 covariate for Y6-Y10 only paste0('Y', 11:15, ' ~ X4')) # X4 covariate for Y11-Y15 only model <- lapply(model, as.formula) # make list of formulas # run MCMC sampler, post process and summarize mcmc <- befa(model, data = data.frame(Y, Xcov), Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) mcmc.sum <- summary(mcmc) mcmc.sum # compare posterior mean of regression coefficients to true values beta.comp <- cbind(beta[beta != 0], mcmc.sum\$beta[, 'mean']) colnames(beta.comp) <- c('true', 'mcmc') print(beta.comp, digits = 3) ```

BayesFM documentation built on Oct. 23, 2020, 5:14 p.m.