knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette explains the computational shortcut implemented in semfindr to approximate the casewise influence. The approximation is most beneficial when the sample size N is large, where the approximation works better and the computational cost for refitting models N times is high.
library(semfindr) dat <- pa_dat
library(lavaan) mod <- " m1 ~ iv1 + iv2 dv ~ m1 " fit <- sem(mod, dat)
if (file.exists("semfindr_fit_rerun.RDS")) { fit_rerun <- readRDS("semfindr_fit_rerun.RDS") } else { fit_rerun <- lavaan_rerun(fit) saveRDS(fit_rerun, "semfindr_fit_rerun.RDS") }
lavaan provides the handy lavScores() function to evaluate
$$s_i(\theta_m) = \frac{\partial \ell_i(\theta_m)}{\partial \theta_m}$$
for observation $i$, where $\ell_i(\theta)$ denotes the casewise loglikelihood function and $\theta_m$ is the $m$th model parameter.
For example,
head(lavScores(fit)[ , 1, drop = FALSE])
indicates the partial derivative of the casewise loglikelihod with respect to the parameter m1~iv1. Because the sum of the partial derivatives across all observations is zero at the maximum likelihood estimate with the full sample ($\hat \theta_m$; i.e., the derivative of loglikelihood of the full data is 0), $- s_i(\theta_m)$ can be used as an estimate of the partial derivative of the loglikelihood at $\hat \theta_m$ for the sample without observation $i$. This information can be used to approximate the maximum likelihood estimate for $\theta_m$ when case $i$ is dropped, denoted as $\hat \theta_{m(-i)}$
The second-order Taylor series expansion can be used to approximate the parameter vector estimate with an observation deleted, $\hat \theta_{(-i)}$, as in the iterative Newton's method. Specifically,
$$\hat \theta_{(i)} \approx \hat \theta - \frac{N}{N - 1}V(\hat \theta) \nabla \ell(\hat \theta)$$ $$\hat \theta - \hat \theta_{(i)} \approx \frac{N}{N - 1}V(\hat \theta) \nabla \ell_i(\hat \theta),$$ where $\nabla \ell_i(\hat \theta)$ is the gradient vector of the casewise loglikelihood with respect to the parameters (i.e., score). The $N / (N - 1)$ term is used to adjust for the decrease in sample size (this adjustment is trivial in large samples). This procedure should be the same as equation (4) of Tanaka et al. (1991) (p. 3807) and is related to the one-step approximation described by Cook and Weisberg (1982) (p. 182).
The approximation is implemented in the est_change_raw_approx() function:
fit_est_change_approx <- est_change_raw_approx(fit) fit_est_change_approx
Here is a comparison between the approximation using semfindr::est_change_raw_approx() and semfindr::est_change_raw()
# From semfindr fit_est_change_raw <- est_change_raw(fit_rerun) # Plot the differences library(ggplot2) tmp1 <- as.vector(t(as.matrix(fit_est_change_raw))) tmp2 <- as.vector(t(as.matrix(fit_est_change_approx))) est_change_df <- data.frame(param = rep(colnames(fit_est_change_raw), nrow(fit_est_change_raw)), est_change = tmp1, est_change_approx = tmp2) ggplot(est_change_df, aes(x = est_change, y = est_change_approx)) + geom_abline(intercept = 0, slope = 1) + geom_point(size = 0.8, alpha = 0.5) + facet_wrap(~ param) + coord_fixed()
The results are pretty similar.
We can use the approximate parameter changes to approximate the gCD (see also Tanaka et al., 1991, equation 13, p. 3811):
# Information matrix (Hessian) information_fit <- lavInspect(fit, what = "information") # Short cut for computing quadratic form (https://stackoverflow.com/questions/27157127/efficient-way-of-calculating-quadratic-forms-avoid-for-loops) gcd_approx <- (nobs(fit) - 1) * rowSums( (fit_est_change_approx %*% information_fit) * fit_est_change_approx )
This is implemented in the est_change_approx() function:
fit_est_change_approx <- est_change_approx(fit) fit_est_change_approx
# Compare to exact computation fit_est_change <- est_change(fit_rerun) # Plot gcd_df <- data.frame( gcd_exact = fit_est_change[ , "gcd"], gcd_approx = fit_est_change_approx[ , "gcd_approx"] ) ggplot(gcd_df, aes(x = gcd_exact, y = gcd_approx)) + geom_abline(intercept = 0, slope = 1) + geom_point() + coord_fixed()
The approximation tend to underestimate the actual gCD but the rank ordering is almost identical. This is discussed also in Tanaka et al. (1991), who proposed a correction by applying a one-step approximation after the correction (currently not implemented due to the need to recompute scores with updated parameter values).
cor(gcd_df, method = "spearman")
The casewise loglikelihood---the contribution to the likelihood function by an observation---can be computed in lavaan, which approximates the change in loglikelihood when an observation is deleted:
lli <- lavInspect(fit, what = "loglik.casewise") head(lli)
Here, $\ell(\hat \theta)$ will drop r abs(round(lli[1], 2)) when observation 1 is deleted. This should approximate $\ell(\hat \theta_{(-i)})$ as long as $\hat \theta_{(-i)}$ is not too different from $\hat \theta$. Here's a comparison:
# Predicted ll without observation 1 fit@loglik$loglik - lli[1] # Actual ll without observation 1 fit_no1 <- sem(mod, dat[-1, ]) fit_no1@loglik$loglik
They are pretty close. To approximate the change in $\chi^2$, as well as other $\chi^2$-based fit indices, we can use the fit_measures_change_approx() function:
chisq_i_approx <- fit_measures_change_approx(fit) # Compare to the actual chisq when dropping observation 1 c(predict = chisq_i_approx[1, "chisq"] + fitmeasures(fit, "chisq"), actual = fitmeasures(fit_no1, "chisq"))
Change in $\chi^2$
# Exact measure from semfindr out <- fit_measures_change(fit_rerun) # Plot chisq_change_df <- data.frame( chisq_change = out[ , "chisq"], chisq_change_approx = chisq_i_approx[ , "chisq"] ) ggplot(chisq_change_df, aes(x = chisq_change, y = chisq_change_approx)) + geom_abline(intercept = 0, slope = 1) + geom_point() + coord_fixed()
Change in RMSEA
# Plot rmsea_change_df <- data.frame( rmsea_change = out[ , "rmsea"], rmsea_change_approx = chisq_i_approx[ , "rmsea"] ) ggplot(rmsea_change_df, aes(x = rmsea_change, y = rmsea_change_approx)) + geom_abline(intercept = 0, slope = 1) + geom_point() + coord_fixed()
The values aligned reasonably well along the 45-degree line.
The approximate approach is tested only for models fitted by maximum likelihood (ML) with normal theory standard errors (the default).
The approximate approach does not yet support multilevel models.
The lavaan object will be checked by approx_check() to see if it is
supported. If not, an error will be raised.
Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. https://conservancy.umn.edu/handle/11299/37076
Tanaka, Y., Watadani, S., & Ho Moon, S. (1991). Influence in covariance structure analysis: With an application to confirmatory factor analysis. Communications in Statistics - Theory and Methods, 20(12), 3805–3821. https://doi.org/10.1080/03610929108830742
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.