REMLA Tutorial
In REMLA: Robust Expectation-Maximization Estimation for Latent Variable Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=6, fig.height=4
)

library(REMLA)

Introduction

This vignette introduces the use of the robust expectation maximization (REM) algorithm; a statistical method for estimating the parameters of a latent variable model in the presence of population heterogeneity as suggested by Nieser & Cochran (2023). The REM algorithm is based on the expectation-maximization (EM) algorithm, but it allows for the case when not all the data are generated by the assumed data generating model.

Data Description

To illustrate the practical application of the REM (Robust Expectation Maximization) package for both exploratory and confirmatory Factor Analysis, we have chosen to employ it in the context of the Holzinger and Swineford dataset publicly available in the lavaan package.

The Holzinger and Swineford dataset consist of mental ability test scores of seventh and eighth grade children from two different schools. Originally, this dataset consisted of 26 test scores but this modified version in the lavaan package consist of 15 variables, 9 of which are test scores.

Covariates
- id: student's Identification number.
- sex: Gender (1 = male, 2 = female).
- ageyr: Age year part.
- agemo: Age month part.
- school: School (Pasteur or Grant-White).
- grade: Grade (7 = 7th grade, 8 = 8th grade).
- x1: Visual perception.
- x2: Cubes.
- x3: Lozenges.
- x4: Paragraph comprehension.
- x5: Sentence completion.
- x6: Word meaning.
- x7: Speeded addition.
- x8: Speeded counting of dots.
- x9: Speeded discrimination straight and curved capitals.

library(lavaan)

df <- HolzingerSwineford1939
head(df)

We limit the data to the columns that will be used in the analysis. Data can be either in a matrix or data frame and must be numeric.

data = df[,-c(1:6)]

Modeling using REM for Exploratory Factor Analysis

The REM_EFA function, specifically designed for conducting exploratory factor analysis, requires specification of three parameters: X, k_range and delta. The X parameter corresponds to the dataframe or matrix to be analyzed. The k_range parameter comprises a vector of numbers of factors to be modeled. In the example provided, the algorithm conducts exploratory analysis with 1 to 3 factors. Moreover, the delta parameter serves as a hyperparameter, ranging from 0 to 1, and it reflects the researcher's tolerance level for potentially down-weighting data inaccurately within the model. The default for the delta parameter is 0.05.

The output contains a separate list of estimates for each number of factors under consideration.

model_EFA = REM_EFA(X = data, k_range = 1:3)

The summary function

The summary function displays the estimates of the optimal model which is selected using the lowest score of Bayesian Information Criterion (BIC) among the number of factors considered. In other words, it selects the optimal model after considering the BIC in the models with 1, 2, and 3 factors.

summary(model_EFA)

From the REM and EM output above, note that $x_4$, $x_5$, and $x_6$ strongly load on Factor 1. Similarly, $x_1$, $x_2$, and $x_3$ moderately load on Factor 2 and that $x_7$, $x_8$ and $x_9$ moderately load on Factor 3. Additionally, the gamma or average REM weight was 0.849, indicating that about 15 percent of the sample differed in the factor model parameters from the majority of the sample.

Plot Distribution of the weights

In the following plot we show the distribution of the weights to know how much people are getting down-weighted.

hist(model_EFA[[2]]$REM_output$weights,
     main="REM Weight Distribution",
     xlab = "Weights", 
     ylab = "Frequency")

Confirmatory Factor Analysis

To conduct a confirmatory factor analysis, we need to specify a model which consists of structural equations relating the measured variables with the latent variables. We consider the following three factor model:

$$\text{Visual} = x_1 + x_2 + x_3 $$ $$\text{Textual} = x_4 + x_5 + x_6 $$ $$\text{Speed} = x_7 + x_8 + x_9$$ To create the model parameter, first create a string variable that contains each equation in a new line and separate equalities using the $=\sim$ symbol as shown below.

# Define your model as a string
model <- " Visual =~ x1  +  x2  +  x3 
           Textual =~ x4 + x5 + x6
           Speed =~ x7 + x8 + x9
"

This model is used to investigate the relationships between the latent variables "Visual","Textual", and "Speed" and their respective sets of observed variables by examining how well the observed variables predict the underlying constructs. The model will estimate coefficients for these relationships, providing information about the strength and direction of these associations.

Finally, use the REM_CFA() function to perform the confirmatory factor analysis. For this include X (the numeric data frame under consideration), delta (a hyperparameter), and the model parameter.

# CFA model with delta = 0.05
model_CFA = REM_CFA(X = data, model = model)

summary(model_CFA)

The output above depicts estimates for the confirmatory analysis. Note that again $x_4$, $x_5$, and $x_6$ are strongly correlated with Factor 1. Similarly, $x_1$,$x_2$, and $x_3$ are moderately correlated with factor 2 and that $x_7$, $x_8$ and $x_9$ are moderately correlated with Factor 3. However, the gamma or average REM weight was 0.833. Note that the difference between the EM and REM intercepts are negative indicating that this down-weighted group has lower average item scores.