# Model description for glca In glca: An R Package for Multiple-Group Latent Class Analysis

Suppose that there are $G$ groups, and the $g$th group consists of $n_g$ observations for $g=1,\ldots,G$, and there are $M$ categorical manifest items, where the $m$th item has $r_m$ categories for $m=1,\ldots,M$. Let $\mathbf{Y}{ig}=(Y{ig1}, \dots, Y_{igM})^\top$ and ${\bf y}{ig} = (y{ig1}, \ldots, y_{igM})^\top$ denote a set of item variables and their responses given by the $i$th individual within the $g$th group, respectively. The number of possible response patterns of $\mathbf{Y}{ig}$ is $\prod{m = 1}^{M}r_m$, and it is likely that most of these response patterns are sparse. The multiple-group LCA assumes that associations among manifest items can be explained by the latent classifier $L_{ig}$, where $L_{ig}$ is the latent class variable having $C$ categories for the $i$th individual within the $g$th group. To reflect multiple-group data structures, we discuss two different LCA approaches, namely fixed-effect and random-effect LCA.

## Models

### Fixed-effect latent class analysis

The fixed-effect LCA can reflect group differences in latent structure by specifying an LCR model for a given subgroup. We extend the simultaneous LCA (Clogg & Goodman, 1985) by incorporating logistic regression in the class prevalence and refer it to multiple-group latent class regression (mgLCR). Let ${\bf x}{ig}=(x{ig1}, \ldots, x_{igp})^\top$ be a subject-specific $p\times 1$ vector of covariates for the $i$th individual within the $g$th group, either discrete or continuous. Then, the observed-data likelihood of mgLCR for the $i$th individual within the $g$th group can be specified as \begin{eqnarray} \mathcal{L}{ig} &=& \sum{c=1}^C P(\mathbf{Y}{ig} = \mathbf{y}{ig}, L_{ig} = c \mid {\bf x}{ig}) \nonumber \ &=& \sum{c=1}^C \left[ P(L_{ig} = c \mid {\bf x}{ig}) \prod{m = 1}^{M} P(y_{igm} = k \mid L_{ig} = c) \right] \nonumber \ &=& \sum_{c=1}^C \left[ \gamma_{c \mid g}({\bf x}{ig}) \prod{m = 1}^{M}\prod_{k = 1}^{r_m} \rho_{mk \mid cg}^{I(y_{igm} = k)} \right], (#eq:likeli) \end{eqnarray} where $I(y_{igm}=k)$ is an indicator function that is equal to 1 when the response to the $m$th item from the $i$th individual within the $g$th group is $k$ and is otherwise equal to 0. The likelihood given in \@ref(eq:likeli) contains two types of parameters:

1. $\rho_{mk \mid cg}$ represents the probability of an individual within the $g$th group responding $k$ to the $m$th item given his or her latent class as $c$.
2. $\gamma_{c \mid g}({\bf x}{ig})$ is the probability of the $i$th individual belonging to the latent class $c$ within the $g$th group, which could be influenced by the subject-specific covariates ${\bf x}{ig}$.

The $\rho$-parameter is the measurement parameter in mgLCR (i.e., item-response probability), describing a tendency of individuals in a latent class $c$ to respond to the $m$th item for $m = 1,\ldots,M$. Comparison of estimated item-response probabilities across groups is a valuable strategy for quantifying measurement invariance because they solely determine the meaning of the latent class. By comparing the model fit with the parameter held constant across groups (i.e., $\rho_{mk \mid c} = \rho_{mk \mid c1} = \dots = \rho_{mk \mid cG}$ for $k=1,\ldots,r_m$, $m=1,\ldots,M$, and $c=1,\ldots,C$) against an alternative model with freely varying parameters, we obtain evidence on whether measurement invariance across groups can be assumed. As given in \@ref(eq:likeli), the subject-specific covariates ${\bf x}{ig}$ may influence the probability of the individual belonging to a specific class in the form of logistic regression as \begin{eqnarray} \gamma{c \mid g}(\mathbf{x}{ig}) = P(L{ig} = c \mid {\bf x}{ig}) = \frac{\exp(\alpha{c \mid g}+{\bf x}{ig}^\top \boldsymbol{\beta}{c \mid g})}{\sum_{c'=1}^C\exp(\alpha_{c' \mid g}+{\bf x}{ig}^\top \boldsymbol{\beta}{c' \mid g})}, (#eq:mgLCA-reg) \end{eqnarray} where the coefficient vector $\boldsymbol{\beta}{c \mid g} = (\beta{1c \mid g}, \ldots, \beta_{pc \mid g})^\top$ can be interpreted as the expected change in the log odds of belonging to a class $c$ versus belonging to the referent class $C$ (i.e., $\alpha_{C \mid g}=0$ and $\boldsymbol{\beta}{C \mid g} = {\bf 0}$ for $g=1, \ldots, G$). Then, the observed log-likelihood function for the mgLCR model can be specified as \ell{mgLCR} = \sum_{g = 1}^{G}\sum_{i = 1}^{n_{g}} \log \mathcal{L}{ig}. (#eq:loglik-mgLCA) It should be noted that, similar to item-response probabilities, coefficients of logistic regression can be constrained to be equal across subgroups (i.e., $\boldsymbol{\beta}{c} = \boldsymbol{\beta}{c \mid 1}= \cdots = \boldsymbol{\beta}{c \mid G}$ for $c=1, \ldots,C$) to test whether the effects of covariates are identical across groups.

### Random-effect latent class analysis

The random-effect LCA considers the group variation in the latent class prevalence for each group using random coefficients, for example, [ P(L_{ig} = c) = \frac{\exp(\alpha_{c} + \sigma_{c} \lambda_{g})}{\sum_{c' = 1}^{C}\exp(\alpha_{c'} + \sigma_{c'} \lambda_{g})}, ] where $\boldsymbol{\lambda}=(\lambda_1, \ldots, \lambda_G)^\top$ represents group variation in the class prevalence. In the parametric random-effect LCA, the random coefficients are assumed to be derived from parametric distributions such as standard normal distribution. However, the nonparametric approach assumes no specific distribution; rather, it only assumes that random coefficients follow a specific probability mass function with some mass points. In other words, the nonparametric approach employs categorical level-2 latent variable (i.e., latent cluster) $U_{g}$ whose probability mass function is $P(U_g=w) = \delta_w$ for $w=1,\ldots,W$. Using the classification mechanics of LCA, the latent cluster membership of level-2 units can be identified by the small number of representative patterns of class prevalences in multiple groups. Therefore, the meaning of the $w$th level of latent cluster variable is determined by the prevalence of latent classes $P(L_{ig} = c \mid U_{g} = w)$ for $c=1,\ldots,C$. Considering latent cluster variable as a group variable, the nonparametric approach provides more meaningful interpretations in group comparison than parametric approach; we can examine whether the latent class structure differs across latent cluster memberships. Therefore, we focus on the nonparametric random-effect LCA, hereafter referred to as nonparametric latent class regression (npLCR).

The observed-data likelihood of npLCR for the $g$th group can be expressed by \begin{eqnarray} \mathcal{L}{g} &=& \sum{w = 1}^{W} P(U_g = w) \prod_{i=1}^{n_g} \left{\sum_{c = 1}^{C} P(Y_{ig}=y_{ig}, L_{ig} = c \mid U_{g} = w, \mathbf{x}{ig}, \mathbf{z}{g})\right} \nonumber \ &=& \sum_{w = 1}^{W} P(U_g = w) \prod_{i=1}^{n_g} \left{\sum_{c = 1}^{C} P(L_{ig} = c \mid U_{g} = w, \mathbf{x}{ig}, \mathbf{z}{g}) \prod_{m=1}^M P(Y_{igm}=k \mid L_{ig} = c ) \right} \nonumber \ &=& \sum_{w = 1}^{W} \delta_{w} \prod_{i=1}^{n_g} \left{ \sum_{c = 1}^{C} \gamma_{c \mid w}(\mathbf{x}{ig}, \mathbf{z}{g}) \prod_{m = 1}^{M}\prod_{k = 1}^{r_m}\rho_{mk \mid c}^{I(y_{igm} = k)}\right}, (#eq:grouploglik-mLCA) \end{eqnarray} where ${\bf x}{ig} = (x{ig1}, \ldots, x_{igp})^\top$ and ${\bf z}g=(z{g1}, \ldots, z_{gq})^\top$ denote vectors of subject-specific (i.e., level-1) and group-specific (i.e., level-2) covariates for $i=1, \ldots, n_g$ and $g=1,\ldots,G$, respectively. The likelihood given in \@ref(eq:grouploglik-mLCA) contains three types of parameters:

1. $\rho_{mk \mid c}$ represents the probability of an individual responding $k$ to the $m$th item given his or her latent class as $c$.
2. $\gamma_{c \mid w}({\bf x}{ig}, {\bf z}_g)$ is the probability of the $i$th individual within the $g$th group belonging to the latent class $c$ given the latent cluster $w$, which could be influenced by the subject-specific covariates ${\bf x}{ig}$ and the group-specific covariates ${\bf z}_g$.
3. $\delta_w$ represents the latent cluster prevalence for $w=1,\ldots,W$.

The class prevalence can be modeled using the logistic regression as \begin{eqnarray} (#eq:MLCA-reg) \gamma_{c \mid w}(\mathbf{x}{ig}, \mathbf{z}{g}) = P(L_{ig} = c \mid U_{g} = w, \mathbf{x}{ig}, \mathbf{z}{g}) = \frac{\exp(\alpha_{c \mid w} + \mathbf{x}^{\top}{ig}\boldsymbol{\beta}{1c \mid w} + \mathbf{z}^{\top}{g}\boldsymbol{\beta}{2c})} {\sum_{c' = 1}^{C} \exp(\alpha_{c' \mid w} + \mathbf{x}^{\top}{ig}\boldsymbol{\beta}{1c' \mid w} + \mathbf{z}^{\top}{g}\boldsymbol{\beta}{2c'})}, \end{eqnarray} where vectors $\boldsymbol{\beta}{1c \mid w} = (\beta{11c \mid w}, \ldots, \beta_{1pc \mid w})^\top$ and $\boldsymbol{\beta}{2c} = (\beta{21c}, \ldots, \beta_{2qc})^\top$ are logistic regression coefficients for level-1 and level-2 covariates, respectively. Then, the observed log-likelihood of npLCR is specified as \begin{eqnarray} (#eq:loglik-mLCA) \ell_{npLCR} = \sum_{g = 1}^{G} \log \mathcal{L}_{g}. \end{eqnarray}

Note that coefficients for level-1 covariates depend on both latent classes and clusters, while coefficients for level-2 covariates depend only on latent class membership. We may refer the model \@ref(eq:MLCA-reg) to the random slope model as coefficients for level-1 covariates are different across latent clusters. The coefficients for level-1 covariates can be constrained to be equal across clusters (i.e., $\boldsymbol{\beta}{1c} = \boldsymbol{\beta}{1c \mid 1} = \cdots = \boldsymbol{\beta}{1c \mid W}$ for $c=1, \ldots,C$) to test whether the effects of level-1 covariates are identical across all latent cluster memberships. It should also be noted that the measurement invariance is assumed across latent cluster memberships in npLCR (i.e., $\rho{mk \mid c} = \rho_{mk \mid c1} = \dots = \rho_{mk \mid cW}$ for $k=1,\ldots,r_m$, $m=1,\ldots,M$, and $c=1,\ldots,C$). If not, the item response probability may vary across latent cluster memberships, suggesting that the latent class structure itself is different between latent clusters. Thus, it no longer makes sense to use latent class prevalences as identifiers for the latent cluster membership.

## Estimation

### Fixed-effect latent class analysis

The package glca finds the maximum-likelihood (ML) estimates for mgLCR and npLCR using expectation-maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm iterates two steps: expectation step (E-step) and maximization step (M-step) in order to find the solution maximizing the log-likelihood functions given in \@ref(eq:loglik-mgLCA) and \@ref(eq:loglik-mLCA).

For mgLCR, E-step computes the posterior probabilities \begin{eqnarray} (#eq:post) \theta_{ig(c)} = P(L_{ig} = c \mid \mathbf{Y}{ig} = \mathbf{y}{ig}, \mathbf{x}{ig}) = \frac{\gamma{c \mid g}(\mathbf{x}{ig}) \prod{m = 1}^{M}\prod_{k = 1}^{r_m} \rho_{mk \mid cg}^{I(y_{igm} = k)}}{\sum_{c' = 1}^{C} \gamma_{c' \mid g}(\mathbf{x}{ig}) \prod{m = 1}^{M}\prod_{k = 1}^{r_m} \rho_{mk \mid c'g}^{I(y_{igm} = k)}}\ \end{eqnarray} with current estimates for $i=1, \ldots, n_g$, $g=1,\ldots,G$, and $c=1,\ldots,C$. M-step maximizes the complete-data likelihood (i.e., the likelihood for the cross-classification by $L_{ig}$ and $\mathbf{y}{ig}$) with respect to $\beta$- and $\rho$-parameters. In particular, when all values of $\theta{ig(c)}$ are known, updated estimates for $\beta$-parameters can be calculated by the Newton-Raphson algorithm for multinomial logistic regression given in \@ref(eq:mgLCA-reg), provided that the computational routine allows fractional responses rather than integer counts (Bandeen-Roche et al., 1997). Therefore, the package glca conducts one-cycle of Newton-Raphson algorithm to update $\beta$-parameters at every iteration in M-step. If there is no covariate in the model, the class prevalence can be updated directly without estimating $\beta$-parameters as $\hat{\gamma}{c \mid g} = P(L{ig}=c) = \sum_{i=1}^{n_g} \theta_{ig(c)}/n_g$ for $c=1,\ldots,C$ and $g=1,\ldots,G$. The item-response probabilities, $\rho_{mk \mid cg}$ can be interpreted as parameters in a multinomial distribution when $\theta_{ig(c)}$ is known, so we have [ \hat{\rho}{mk \mid cg} = \frac{\sum{i=1}^{n_g} \theta_{ig(c)}I(y_{igm} = k)}{\sum_{i=1}^{n_g} \theta_{ig(c)}} ] for $k=1, \ldots, r_m$, $m=1,\ldots,M$, $c=1,\ldots,C$, and $g=1,\ldots,G$. Under the measurement invariance assumption (i.e., $\rho_{mk \mid c} = \rho_{mk \mid c1} = \dots = \rho_{mk \mid cG}$), the $\rho$-parameter will be updated as [ \hat{\rho}{mk \mid c} = \frac{\sum{g = 1}^{G} \sum_{i = 1}^{n_g} \theta_{ig(c)}I(y_{igm} = k)}{\sum_{g = 1}^{G} \sum_{i = 1}^{n_g} \theta_{ig(c)}} ] for $k=1, \ldots, r_m$, $m=1,\ldots,M$, and $c=1,\ldots,C$.

## Try the glca package in your browser

Any scripts or data that you put into this service are public.

glca documentation built on Nov. 3, 2021, 1:09 a.m.