Sherman E. Lo, Queen Mary, University of London
A reimplementation of poLCA [CRAN, GitHub] in C++. It attempts to reproduce results and be as similar as possible to the original code, while running faster, especially with multiple repetitions, by utilising multiple threads.
The package poLCAParallel reimplements the poLCA fitting, standard error calculations, goodness of fit tests and the bootstrap log-likelihood ratio test in C++. This was done using Rcpp and RcppArmadillo which allows R to run fast C++ code. Additional notes include:
std::jthread
for multi-thread programming and
std::mutex to
prevent data racesstd::map for
the chi-squared calculations to improve performanceFurther reading is available on the QMUL ITS Research Blog.
poLCA is a software package for the estimation of latent class models and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment.
Latent class analysis (also known as latent structure analysis) can be used to identify clusters of similar "types" of individuals or observations from multivariate categorical data, estimating the characteristics of these latent groups, and returning the probability that each observation belongs to each group. These models are also helpful in investigating sources of confounding and nonindependence among a set of categorical variables, as well as for density estimation in cross-classification tables. Typical applications include the analysis of opinion surveys; rater agreement; lifestyle and consumer choice; and other social and behavioral phenomena.
The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables with all variables mutually independent. The model stratifies the observed data by a theoretical latent categorical variable, attempting to eliminate any spurious relationships between the observed variables. The latent class regression model makes it possible for the researcher to further estimate the effects of covariates (or "concomitant" variables) on predicting latent class membership.
poLCA uses expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the parameters of the latent class and latent class regression models.
The easiest way to install poLCAParallel is to use R with remotes.
Run the following in R to install the latest version
remotes::install_github("QMUL/poLCAParallel@package")
or for a previous version, for example,
remotes::install_github("QMUL/poLCAParallel@v1.2.4")
Download the .zip or .tar.gz file from the releases. Install it in R using
remotes::install_local(<PATH TO .zip OR .tar.gz FILE>)
Please consider citing the corresponding QMUL ITS Research Blog
and the publication below which this software was originally created for
model <- poLCAParallel::poLCA(), set the parameters
calc.se=FALSE and calc.chisq=FALSE to avoid doing standard error and
goodness of fit calculations respectively. This will save time if you do not
require those results. You can always calculate them afterwards using
model <- poLCAParallel::poLCAParallel.se(model) and
model <- poLCAParallel::poLCAParallel.goodnessfit(model).poLCAParallel::poLCA(), set nrep=1 to do a test run and gauge how long it
takes. Afterwards, set nrep to a bigger number to try different initial
values in parallel.poLCAParallel::poLCA(), set n.thread to set the number of
threads to be used by the computer. By default, it uses all detectable
threads.poLCAParallel::poLCA(), set se.smooth=TRUEpoLCAParallel::poLCAParallel.se(), set is_smooth=TRUEset.seed() before using poLCAParallel::poLCA() to set the seed for
random number generation. This ensures reproducibility when reporting what
seed you have used.R scripts which compare poLCAParallel with poLCA are provided in exec/.
An example use of a bootstrap likelihood ratio test is shown in exec/3_blrt.R.
poLCAParallel::poLCA(), the following arguments have been added:n.thread is provided to specify the number of threads to use.calc.chisq is provided to specify if you want to conduct goodness of fit
tests or not.se.smooth is provided if you wish to use Laplace smoothing on the response
probabilities in the standard error calculations.$prior.probs.start are the initial probabilities used to achieve the
maximum log-likelihood from any repetition rather than from the first
repetition.eflag is set to TRUE if any repetition has to be restarted,
rather than the repetition which achieves maximum log-likelihood.calc.se is set to FALSE even in
poLCA regression. Previously, the standard error was calculated regardless of
calc.se in poLCA regression.stop() rather than return a NULL.predcell.The following installation instructions are useful if you wish to develop the code and install a locally modified version of the package.
Requires the R packages for compiling and testing:
Requires the dependent R packages:
Git clone this repository
git clone https://github.com/QMUL/poLCAParallel.git
and change directory into it
cd poLCAParallel
From there, in the repository root, run the following to generate additional code and documentation so that the package can be compiled correctly
R -e "usethis::use_namespace()"
R -e "Rcpp::compileAttributes()"
R -e "roxygen2::roxygenize()"
Install the package using
R CMD INSTALL --preclean --no-multiarch .
The testing of the C++ code is done using
Catch2 and the R code using
testthat. All test codes are in tests/.
The tests for the C++ code are done by compiling the test code, isolated from any R ecosystem, and running a compiled executable. It requires cmake, Catch2 and armadillo. To compile the code, from the repository root, make a new directory and use cmake inside it
mkdir build
cd build
cmake ..
cmake --build .
This will compile an executable called test_polca_parallel. Execute it to run
the tests. Pass names or tags to run specific tests, see tests/*.cc.
To test the R code, run the following at the repository root
R -e "testthat::test_local()"
The package renv is used to record and manage R dependencies, with versions
pinned, for use during development, maintenance and testing. The file
renv.lock contains these dependencies. It shall be regularly updated during
maintenance. The lock file is also used in the Apptainer definition file
poLCAParallel-dev.def below to further reproduce the environment in a
container.
From the repository root, run the following commands to set up an R environment and install the dependencies, with the specified versions, used for development and testing
R -e "renv::init(bare=TRUE)"
R -e "renv::restore()"
Run R commands from the repository root to use these dependencies.
The lock file may need to be updated during maintenance. This can be done by
starting a fresh R environment, after ensuring the renv artifacts are deleted:
.Rprofilerenv.lockrenv/Then take a snapshot of the latest dependencies
R -e "renv::init()"
R -e "renv::snapshot(dev=TRUE)"
This will overwrite the file renv.lock specifying dependencies with the latest
versions.
Apptainer definition files are provided, which can be used to install the package inside a container. These may be useful for further troubleshooting or development.
poLCAParallel.def installs R and the package only. No
version pinningpoLCAParallel-dev.def installs the R package as well as
generating documentation and running tests within the container. Versions of
dependencies are pinned to the ones used during development or maintenanceTo build the container, use the command (or similar)
apptainer build poLCAParallel-dev.sif poLCAParallel-dev.def
Within the container, the package is located in /usr/src/poLCAParallel. When
using the definition file poLCAParallel-dev.def, the C++ doxygen documentation
is located in /usr/src/poLCAParallel/html.
All generated documents and codes, eg from
R -e "Rcpp::compileAttributes()"
and
R -e "roxygen2::roxygenize()"
shall not be included in the master branch. Instead, they shall be in the
package branch so that this package can be installed using
remotes::install_github("QMUL/poLCAParallel@package"). This is to avoid having
duplicate documentation and generated code on the master branch. The
exception to this rule is renv.lock which is produced by
renv::snapshot(dev=TRUE).
Semantic versioning is used and tagged. Tags on the master branch shall have
v prepended and -master appended, eg. v1.1.0-master. The corresponding
tag on the package branch shall only have v prepended, eg. v1.1.0.
85ee419, the multiplication starts from
DOUBLE_XMAX to avoid underflows but was reverted. Consider investigating
further in future releases.n.thread should be 1 instead of
parallel::detectCores()The R code should follow the Tidyverse style guide. In particular, variables, functions and parameters should be in snake case. This will result in
poLCA. and poLCAParallel. prefix in function and file namesna.rm should be called na_rmThe following R functions, many of which are internal, are marked as deprecated and should be deleted
poLCA.se() and poLCA.dLL2dBeta.C() - no longer needed because the standard
error calculations are reimplemented in poLCAParallel.se()poLCA.probHat.C - no longer needed because the goodness of fit test is
reimplemented in goodness_fit.ccpoLCA.postClass.C() and poLCA.ylik.C() - no longer needed and
reimplemented in polca_rcpp.ccpoLCA.vectorize() and poLCA.unvectorize() - no longer needed and
reimplemented in poLCAParallel.vectorize() and poLCAParallel.unvectorize()
respectivelyAll C code in poLCA.C is deprecated because they are reimplemented in C++.
The parameters:
results in poLCAParallel.goodnessfit()polca in poLCAParallel.se()should be renamed to lc to be consistent with other functions with a parameter
also named lc.
Similarly, the parameters model_null and model_alt in blrt() should be
renamed to lc_null and lc_alt respectively.
There was an attempt to use the Google C++ style guide.
The C++ code documentation can be created with Doxygen by running
doxygen
and viewed at html/index.html.
The software is under the GNU GPL 2.0 license, as with the original poLCA code, stated in their documentation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.