| poLCA | R Documentation |
Latent class analysis, also known as latent structure analysis, is a
technique for the analysis of clustering among observations in multi-way
tables of qualitative/categorical variables. The central idea is to fit a
model in which any confounding between the manifest variables can be
explained by a single unobserved "latent" categorical variable. poLCA uses
the assumption of local independence to estimate a mixture model of latent
multi-way tables, the number of which (nclass) is specified by the user.
Estimated parameters include the class-conditional response probabilities for
each manifest variable, the "mixing" proportions denoting population share of
observations corresponding to each latent multi-way table, and coefficients
on any class-predictor covariates, if specified in the model.
poLCA(
formula,
data,
nclass = 2,
maxiter = 1000,
graphs = FALSE,
tol = 1e-10,
na.rm = TRUE,
probs.start = NULL,
nrep = 1,
verbose = TRUE,
calc.se = TRUE,
calc.chisq = TRUE,
n.thread = parallel::detectCores(),
se.smooth = FALSE
)
formula |
A formula expression of the form |
data |
A data frame containing variables in |
nclass |
The number of latent classes to assume in the model. Setting
|
maxiter |
The maximum number of iterations through which the estimation algorithm will cycle. |
graphs |
Logical, for whether |
tol |
A tolerance value for judging when convergence has been reached.
When the one-iteration change in the estimated log-likelihood is less than
|
na.rm |
Logical, for how |
probs.start |
A list of matrices of class-conditional response
probabilities to be used as the starting values for the estimation algorithm.
Each matrix in the list corresponds to one manifest variable, with one row
for each latent class, and one column for each outcome. The default is
|
nrep |
Number of times to estimate the model, using different values of
|
verbose |
Logical, indicating whether |
calc.se |
Logical, indicating whether |
calc.chisq |
Logical, indicating whether to calculate the goodness of
fit statistics, the chi squared statistics and the log likelihood ratio. The
default is |
n.thread |
Integer, the number of threads to use. Each thread processes a repetition. By default, all detectable threads are used. |
se.smooth |
Logical, experimental, for calculating the standard errors, whether to smooth the outcome probabilities to produce more numerical stable results at the cost of bias. |
Model specification: Latent class models have more than one manifest
variable, so the response variables are cbind(dv1,dv2,dv3...) where dv#
refer to variable names in the data frame. For models with no covariates, the
formula is cbind(dv1,dv2,dv3)~1. For models with covariates, replace the
~1 with the desired function of predictors iv1,iv2,iv3... as, for
example, cbind(dv1,dv2,dv3)~iv1+iv2*iv3.
poLCA treats all manifest variables as qualitative/categorical/nominal
– NOT as ordinal.
The implemention of this function in the package poLCAParallel is rewritten in C++. Multiple threads are used, where each thread processes an initial value or repetition.
Notes:
poLCA uses EM and Newton-Raphson algorithms to maximize the latent class
model log-likelihood function. Depending on the starting parameters, this
algorithm may only locate a local, rather than global, maximum. This becomes
more and more of a problem as nclass increases. It is therefore highly
advisable to run poLCA multiple times until you are relatively certain that
you have located the global maximum log-likelihood. As long as
probs.start=NULL, each function call will use different (random) initial
starting parameters. Alternatively, setting nrep to a value greater than
one enables the user to estimate the latent class model multiple times with a
single call to poLCA, thus conducting the search for the global maximizer
automatically.
The term "Latent class regression" (LCR) can have two meanings. In this
package, LCR models refer to latent class models in which the probability of
class membership is predicted by one or more covariates. However, in other
contexts, LCR is also used to refer to regression models in which the
manifest variable is partitioned into some specified number of latent classes
as part of estimating the regression model. It is a way to simultaneously fit
more than one regression to the data when the latent data partition is
unknown. The flexmix function in package flexmix will estimate this other
type of LCR model. Because of these terminology issues, the LCR models this
package estimates are sometimes termed "latent class models with covariates"
or "concomitant-variable latent class analysis," both of which are accurate
descriptions of this model.
The package poLCAParallel reimplements the poLCA fitting, standard error calculations, goodness of fit tests and the bootstrap log-likelihood ratio test in C++. This was done using Rcpp and RcppArmadillo which allows R to run fast C++ code. Additional notes include:
The API remains the same as the original poLCA with a few additions
It tries to reproduce results from the original poLCA
The code uses Armadillo for linear algebra
Multiple repetitions are done in parallel using
std::jthread
for multi-thread programming and
std::mutex to
prevent data races
Direct inversion of matrices is avoided to improve numerical stability and performance
Response probabilities are reordered to increase cache efficiency
Use of std::map
for the chi-squared calculations to improve performance
Further reading is available on the QMUL ITS Research Blog.
References:
Agresti, Alan. 2002. Categorical Data Analysis, second edition. Hoboken: John Wiley & Sons.
Bandeen-Roche, Karen, Diana L. Miglioretti, Scott L. Zeger, and Paul J. Rathouz. 1997. "Latent Variable Regression for Multiple Discrete Outcomes." Journal of the American Statistical Association. 92(440): 1375-1386.
Hagenaars, Jacques A. and Allan L. McCutcheon, eds. 2002. Applied Latent Class Analysis. Cambridge: Cambridge University Press.
McLachlan, Geoffrey J. and Thriyambakam Krishnan. 1997. The EM Algorithm and Extensions. New York: John Wiley & Sons.
an object of class poLCA; a list containing the following elements:
y: data frame of manifest variables.
x: data frame of covariates, if specified.
N: number of cases used in model.
Nobs: number of fully observed cases (less than or equal to N).
probs: estimated class-conditional response probabilities.
probs.se: standard errors of estimated class-conditional response
probabilities, in the same format as probs.
P: sizes of each latent class; equal to the mixing proportions in the
function basic latent class model, or the mean of the priors in the latent
class regression model.
P.se: the standard errors of the estimated P.
prior: matrix of prior class membership probabilities
posterior: matrix of posterior class membership probabilities; also see
function poLCA.posterior.
predclass: vector of predicted class memberships, by modal assignment.
predcell: table of observed versus predicted cell counts for cases with
no missing values; also see functions poLCA.table and poLCA.predcell
llik: maximum value of the log-likelihood.
numiter: number of iterations until reaching convergence.
maxiter: maximum number of iterations through which the estimation
algorithm was set to run.
coeff: multinomial logit coefficient estimates on covariates (when
estimated). coeff is a matrix with nclass-1 columns, and one row for
each covariate. All logit coefficients are calculated for classes with
respect to class 1.
coeff.se: standard errors of coefficient estimates on covariates (when
estimated), in the same format as coeff.
coeff.V: covariance matrix of coefficient estimates on covariates (when
estimated).
aic: Akaike Information Criterion.
bic: Bayesian Information Criterion.
Gsq: Likelihood ratio/deviance statistic.
Chisq: Pearson Chi-square goodness of fit statistic for fitted vs.
observed multiway tables.
time: length of time it took to run the model.
npar: number of degrees of freedom used by the model (estimated
parameters).
resid.df: number of residual degrees of freedom.
attempts: a vector containing the maximum log-likelihood values found in
each of the nrep attempts to fit the model.
eflag: Logical, error flag. TRUE if estimation algorithm needed to
automatically restart with new initial parameters. A restart is caused in
the event of computational/rounding errors that result in nonsensical
parameter estimates.
probs.start: A list of matrices containing the class-conditional response
probabilities used as starting values in the estimation algorithm. If the
algorithm needed to restart (see eflag), then this contains the starting
values used for the final, successful, run.
probs.start.ok: Logical. FALSE if probs.start was incorrectly
specified by the user, otherwise TRUE.
call: function call to poLCA.
##
## Three models without covariates:
## M0: Loglinear independence model.
## M1: Two-class latent class model.
## M2: Three-class latent class model.
##
data(values)
f <- cbind(A, B, C, D)~1
M0 <- poLCA(f, values, nclass = 1) # log-likelihood: -543.6498
M1 <- poLCA(f, values, nclass = 2) # log-likelihood: -504.4677
# log-likelihood: -503.3011
M2 <- poLCA(f, values, nclass = 3, maxiter = 8000)
##
## Three-class model with a single covariate.
##
data(election)
f2a <- cbind(
MORALG, CARESG, KNOWG, LEADG, DISHONG, INTELG,
MORALB, CARESB, KNOWB, LEADB, DISHONB, INTELB
)~PARTY
# log-likelihood: -16222.32
nes2a <- poLCA(f2a, election, nclass = 3, nrep = 5)
pidmat <- cbind(1, c(1:7))
exb <- exp(pidmat %*% nes2a$coeff)
matplot(c(1:7), (cbind(1, exb) / (1 + rowSums(exb))),
ylim = c(0, 1), type = "l",
main = "Party ID as a predictor of candidate affinity class",
xlab = "Party ID: strong Democratic (1) to strong Republican (7)",
ylab = "Probability of latent class membership", lwd = 2, col = 1
)
text(5.9, 0.35, "Other")
text(5.4, 0.7, "Bush affinity")
text(1.8, 0.6, "Gore affinity")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.