knitr::opts_chunk$set(echo = TRUE)

In latent class analysis (LCA), the joint distribution of $r$ *items* $Y_1 \ldots Y_r$ is modelled in terms of $i$ latent *classes*. The items are conditionally independent given the unobserved class values.

Here we consider binary LCA models, in which the items can take one of two states, labelled "1" and "2" (Note that the `sBIC`

package is capable of handling LCA models with more than two states).

In order to simulate data from an LCA model, the following parameters must be set:

`alpha`

, a vector determining the prior probabilities of membership in each class, i.e. $\alpha_h$ is the probability of being in class $h$ for $h=1 \ldots i$.`P`

, an $r \times i$ matrix such that $P_{lh}$ gives the probability that $Y_l = 2$ given membership of class $h$.

We simulate from a model with $r=8$ items and $i=4$ classes. The class sizes are equal (i.e. `alpha`

is uniform) and the model is "simple" in the sense that only one of the conditional probabilities $P_{lh}$ is large (0.85) for each item and the others are small (0.1, or 0.2 depending on the item).

alpha = rep(0.25, 4) ## equal p1 = 0.85 p2 = 0.1 p3 = 0.2 P = matrix(c( p1,p2,p2,p2, p1,p3,p3,p3, p2,p1,p2,p2, p3,p1,p3,p3, p2,p2,p1,p2, p3,p3,p1,p3, p2,p2,p2,p1, p3,p3,p3,p1), nrow=8, ncol=4, byrow = TRUE)

The `simLCA()`

function from the `poLCA`

package (Drew and Linzer, 2011) simulates data from an LCA model. It requires the specification of the conditional probabilities in a different form (specifically, as a list of probability matrices) so we use a wrapper function `simLCA()`

. The `simLCA()`

function is designed to facilitate simulation studies, so it allows multiple simulated data sets of a given sample size to be generated. The simulated data sets are returned as a list of matrices length `nsim`

.

simLCA <- function(alpha, ProbMat, sampleSize, nsim) { ## Reformat the probability Matrix to a form required by poLCA.simdata probs = vector("list", nrow(ProbMat)) for (i in 1:nrow(ProbMat)) { probs[[i]] = cbind(ProbMat[i,], 1 - ProbMat[i,]) } X = poLCA.simdata(N = sampleSize*nsim, probs = probs, P = alpha)$dat split(X, rep(1:nsim, each=sampleSize)) }

The code below generates a single data set `X`

it takes the form of a $50 \times 8$ matrix with one column per item and one row per observation.

library(poLCA) set.seed(37421) X = simLCA(alpha, P, sampleSize=50, nsim=1)[[1]] head(X)

LCA models are fitted with the `poLCA::poLCA()`

function. The `poLCA()`

function uses a formula interface to determine which items are included in the model. The number of latent classes is determined by the `nclass`

argument. The code below fits a 3-class model to all 8 items.

f = cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7, Y8) ~ 1 poLCA(f, X, nclass = 3, verbose=FALSE)

As with other mixture models, the learning coefficients for LCA are not known exactly, but bounds can be derived and used to obtain an approximate singular BIC. In the case of LCA models, the bounds are extremely sensitive to the prior distribution on the class probabilities $\alpha$ as this determines the asymptotic behaviour on the boundary $\alpha_h = 0$. The argument is developed in some detail in Drton and Plummer (2017, Section 6.3).

In the case of a Dirichlet prior with shape parameter $\phi$, the bound for the learning coefficient $\lambda_{ij}$ of a model with $j$ classes embedded in a model with $i > j$ classes is given by Drton and Plummer (2017, equation 6.19) $$\lambda_{ij} \leq \min (\lambda^{-\phi}*{ij}, \lambda^{-r/2}*{ij})$$ where $\lambda^{-\phi}*{ij}$ is defined by Drton and Plummer (2017, equation 6.15) $$\lambda^{-\phi}*{ij} = \left{jr + j - 1 + (i-j)\phi\right}$$

These results suggest that $\phi \approx r/2$ should have good asymptotic properties for selecting the number of latent classes $i$. Tables 3 and 4 of Drton and Plummer (2017) show the results of simulation studies in support of this claim. The simulations are too extensive to reproduce in this vignette, but we step through a single example showing how the singular BIC can be recalculated for different values of $\phi$.

The `sBIC::LCAs()`

function creates an object of class "LCAs" representing latent class analysis models for a given number of items (`numVariables`

) taking a given number of states (`numStatesForVariables`

) up to a maximum number of latent states `maxNumClasses`

. The resulting object of class "LCAs" is combined with the data in the `sBIC()`

function, which fits the various latent class models and calculates the approximate singular BIC.

library(sBIC) lcas = LCAs(maxNumClasses=6, numVariables=8, numStatesForVariables=2) results = sBIC(X, lcas)

With $r=8$ items, we choose $\phi$ on the order of $r/2 = 4$. The default behaviour of `sBIC()`

is to use $\phi=(r+1)/2$, i.e. $\phi=4.5$ in this example. We investigate the behaviour of the singular BIC for values around $\phi=4$ by setting up a vector `phivec`

of alternative values for $\phi$.

phivec = (6:10)/2

The singular BIC can be recalculated by calling the `setPhi`

member function of `lcas`

to change the penalty and then calling `sBIC()`

with `NULL`

as the first argument. This recalculates the sBIC penalties without refitting the model to the data and so is much faster than the original call to `sBIC()`

. Note that this is possible because the "LCAs" class uses call by reference semantics. The object `lcas`

is altered by the initial call to `sBIC()`

and on exit contains information about the fitted models.

Figure 1 shows the effect of the different values of $\phi$ on $\overline{BIC}_\phi$.

plot(1:6, results$sBIC - max(results$sBIC), type="n", xlab="number of latent classes", ylab=expression(bar(SBIC)[phi])) for (i in seq_along(phivec)) { setPhi(lcas, phivec[i]) results = sBIC(NULL, lcas) lines(1:6, results$sBIC - max(results$sBIC), pch=i, lty=i, type="b") } legend("topleft", pch=1:6, lty=1:6, legend=paste("phi=", phivec))

Further insight into the properties of $\overline{sBIC}*\phi$ can be obtained by simulation studies. The fitBIC() function is a convenience wrapper that calculates $\overline{sBIC}*\phi$ for various values of $\phi$ and chooses the optimal model according to each criterion. The return value is a vector of chosen models, indexed by the number of classes. The first element is the model chosen by BIC and the other elements are the model chosen by $\overline{sBIC}_\phi$ for the values of $\phi$ in the vector

`phis`

.fitBIC <- function(X, maxNumClasses, phis) { lcas = LCAs(maxNumClasses, numVariables=ncol(X), numStatesForVariables=2) results = sBIC(X, lcas) nclass.bic <- which.max(results$BIC) nclass.sbic = numeric(length(phis)) for (k in seq_along(phis)) { lcas$setPhi(phis[k]) results = sBIC(NULL, lcas) nclass.sbic[k] = which.max(results$sBIC) } c(nclass.bic, nclass.sbic) }

The `createTable`

function calls `fitBIC`

on each element of `Xlist`

, a list of simulated LCA data sets and then produces a frequency table of the optimal model for BIC and for the various $\overline{sBIC}_\phi$ criteria. The function `parallel::parSapply()`

is used to improve the speed of the simulations on a multi-core processor. Some code is required to set up and shutdown the cluster used for the parallel calculations, but the core of the calculations is contained in two lines of code.

createTable = function(Xlist, ...) { ## Set up cluster cl <- makeCluster(detectCores() - 1) clusterEvalQ(cl, library(sBIC)) clusterEvalQ(cl, library(poLCA)) clusterSetRNGStream(cl) ## Fit LCA models to simulated data and tabulate bicResults = parSapply(cl, Xlist, fitBIC, ...) bicTab = apply(bicResults, 1, tabulate, nbin=maxNumClasses) dimnames(bicTab) = list(1:maxNumClasses, c("BIC", paste0("sBIC", phis))) ## End cluster stopCluster(cl) bicTab }

The code below simulates 100 data sets, each with a sample size of 50, and produces frequency tables of the optimal model for the different criteria.

library(parallel) set.seed(1234) Xlist = simLCA(alpha, Pr, sampleSize=50, nsim=100) bictab = createTable(Xlist, maxNumClasses=6, phis=phivec) knitr::kable(bictab, row.names=TRUE)

| | BIC| sBIC3| sBIC3.5| sBIC4| sBIC4.5| sBIC5| |:--|---:|-----:|-------:|-----:|-------:|-----:| |1 | 38| 0| 0| 0| 0| 0| |2 | 42| 0| 0| 0| 1| 3| |3 | 20| 3| 11| 16| 26| 34| |4 | 0| 55| 72| 75| 70| 61| |5 | 0| 38| 16| 9| 3| 2| |6 | 0| 4| 1| 0| 0| 0|

Table: Frequency distribution of the number of classes chosen by BIC and $\overline{sBIC}_\phi$ for values of $\phi$ around 4.

Table 1 shows the resulting frequency table. Regular BIC clearly underfits the model in these simulations. Similarly $\overline{sBIC}_5$ shows signs of underfitting while $\overline{sBIC}_3$ shows overfitting.

- Drew A. Linzer, Jeffrey B. Lewis (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software, 42(10), 1-29. URL http://www.jstatsoft.org/v42/i10/.
- Drton M. and Plummer M. (2017), A Bayesian information criterion for singular models. J. R. Statist. Soc. B; 79: 1-38.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.