library(blavaan, quietly=TRUE)
library(lavaan, quietly=TRUE)

Introduction

The traditional method for model comparison in frequentist SEM (fSEM) is the $\chi^2$ (Likelihood Ratio Test) and its variations. But for BSEM, we would take the Bayesian model comparison methods, and apply them to SEM.

Specifically, we will focus on two information criteria, (1) Widely Applicable Information Criterion (WAIC), and (2) Leave-One-Out cross-validation (LOO).

These methods intend to evaluate the out-of-sample predictive accuracy of the models, and compare that performance. This is the ability to predict a datapoint that hasn't been used in the training model [@mcelreath_statistical_2020]

For this example we will use the Industrialization and Political Democracy example [@bollen_structural_1989].

model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit1 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)
model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit1 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)

Widely Applicable Information Criterion

WAIC [@watanabeAsymptoticEquivalenceBayesa] can be seen as a fully Bayesian generalization of the Akaike Information Criteria (AIC), where we have a measure of uncertainty/information of the model prediction for each row in the data across all posterior draws. This is the Log-Pointwise-Predictive-Density (lppd). The WAIC is defined as

\begin{equation} WAIC= -2lppd + 2efp_{WAIC}, \end{equation} The first term involves the log-likelihoods of observed data (marginal over latent variables) and the second term is the effective number of parameters. The first term, $lppd$, is estimated as:

\begin{equation} \widehat{lppd} = \sum^{n}{i = 1} log \Bigg(\frac{1}{S}\sum^{S}{S=1}f(y_{i}|\theta^{S}) \Bigg) \end{equation}

where $S$ is the number of posterior draws and $f(y_{i}|\theta^{S})$ is the density of observation $i$ with respect to the parameter sampled at iteration $s$.

The effective number of parameter ($efp_{WAIC}$) is calculated as:

\begin{equation}\label{eq:efpWAIC} efp_{WAIC} = \sum^n_{i=1}var_{s}(logf(y_{i}|\theta)) \end{equation}

A separate variance is estimated for each observation $i$ across the $S$ posterior draws.

Leave-One-Out cross-validation

The LOO measures the predictive density of each observation holding out one observation at the time and use the rest of the observations to update the prior. This estimation is calculated via [@vehtari_practical_2017]:

\begin{equation} LOO = -2\sum_{i=1}^{n} log \Bigg(\frac{\sum^{S}{s =1} w^{s}{i}f(y_{i}|\theta^{s})}{\sum^{s}{s=1} w^{s}{i}}\Bigg) \end{equation}

Where the $w^s_{i}$ are Pareto-smoothed sampling weights based on the relative magnitude of individual $i$ density function across the $S$ posterior samples.

The LOO effective number of parameters involves the $lppd$ term from WAIC:

\begin{equation} efp_{LOO} = lppd + LOO/2 \end{equation}

Model comparison

As both WAIC and LOO approximate the models' performance across posterior draws, we are able to calculate a standard error for them and for model comparisons involving them.

The model differences estimate the differences across the Expected Log-Pointwise-Predictive-Density (elpd), and the standard error of the respective difference.

There are no clear cutoff rules on how to interpret and present these comparisons, and the researchers need to use their expert knowledge as part of the decision process. The best recommendation is to present the differences in elpd $\Delta elpd$, the standard error, and the ratio between them. If the ratio is at least 2 can be consider evidence of differences between the models, and a ratio of 4 would be considered stronger evidence.

For the first example, we will compare the standard political democracy model, with a model where all factor regressions are fixed to 0.

model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ 0*ind60
    dem65 ~ 0*ind60 + 0*dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit2 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)
model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ 0*ind60
    dem65 ~ 0*ind60 + 0*dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit2 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)

Once we have the 2 models, we can compare them with the blavCompare

bc12 <- blavCompare(fit1, fit2)
bc12 <- blavCompare(fit1, fit2)

By looking into this comparison object, you can see the WAIC, LOO, estimates, and the respective differences between them. As these are information criteria, the best model is the one with the lowest value

bc12

In this case we can see that model 1 has lower LOOIC, and the ratio shows that the LOO differences is 5 SE of magnitude. This indicates that the model with the estimated regressions is better

abs(bc12$diff_loo[,"elpd_diff"] / bc12$diff_loo[,"se_diff"])

Now, lets look at an example with a smaller difference between models, where only the smallest regression (dem65~ind60) is fixed to 0.

model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ ind60
    dem65 ~ 0*ind60 + dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit3 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)
bc13 <- blavCompare(fit1, fit3)
model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ ind60
    dem65 ~ 0*ind60 + dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit3 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)
bc13 <- blavCompare(fit1, fit3)

When we see the LOOIC, we see that the difference between the two models is minimal, and the ratio is 0.21. This indicates that the models are functionally equivalent. In a case like this, it is up to the researchers to decide which model is a better representation, and theoretically stronger.

bc13

abs(bc13$diff_loo[,"elpd_diff"] / bc13$diff_loo[,"se_diff"])

Lets do one last model, where only the largest regression (dem65~dem60) is fixed to 0.

model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + 0*dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit4 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)
bc14 <- blavCompare(fit1, fit4)
model <- '
  # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ a*y1 + b*y2 + c*y3 + d*y4
     dem65 =~ a*y5 + b*y6 + c*y7 + d*y8

  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + 0*dem60

  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit4 <- bsem(model, data=PoliticalDemocracy,
            std.lv=T, meanstructure=T, n.chains=3,
            burnin=500, sample=1000)
bc14 <- blavCompare(fit1, fit4)

In this case, by looking at the LOOIC, we see that model one is better (lower value), and the ratio of the difference shows that the model is 5 SE in magnitude. Indicating that there is evidence of model predictive differences

bc14
abs(bc14$diff_loo[,"elpd_diff"] / bc14$diff_loo[,"se_diff"])

Bayes factor

In the Bayesian literature you will make use of the Bayes factor (BF) to compare models. There are a number of criticisms related to the use of the BF in BSEM, including (1) the BF is unstable for large models (like most SEMs), (2) it is highly sensitive to model priors, (3) it requires strong priors to have stable estimation of it, (4) it can require large number of posterior draws, (5) the estimation using the marginal likelihood ignores a lot of information from the posterior distributions. For more details on this discussion please see @tendeiro_review_2019 and @schad_workflow_2022. These criticisms lead us to recommend against use of the BF in everyday BSEM estimation. For researchers who commit to their prior distributions and who commit to exploring the noise in their computations, the BF can used to describe the relative odds of one model over another, which is more intuitive than some other model comparison metrics.

Summary

We recommend the use of LOO or WAIC as general model comparison metrics for BSEM. They allow us to estimate the models' out-of-sample predictive accuracy, and the respective differences across posterior draws. They also provide us uncertainty estimates in the comparison.

In most cases LOO and WAIC will lead to similar results, and LOO is recommended as the most stable metric [@vehtari_practical_2017]. In general, a $\Delta elpd$ of at least 2 standard errors and preferably 4 standard errors can be interpreted as evidence of differential predictive accuracy.

References



ecmerkle/blavaan documentation built on April 28, 2024, 6:12 p.m.