# Influence Diagnostics In HLMdiag: Diagnostic Tools for Hierarchical (Multilevel) Linear Models

is_check <- ("CheckExEnv" %in% search()) || any(c("_R_CHECK_TIMINGS_",

knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.show = "hide", eval = !is_check ) library(dplyr) library(ggplot2)  library(HLMdiag)  # Overview The HLMdiag package provides functionality to examine diagnostics for hierarchical linear models, including residuals values and influence diagnostics. This vignette aims to: • inform users how to use the new function hlm_influence(), which provides any easy way to obtain multiple influence diagnostics in one tibble • inform users how to use the new function hlm_augment(), which combines hlm_influence() and hlm_resid() to return influence diagnostics and residuals • introduce new functionality for lme created from the nlme package • compare the differences between Cook's distances values returned by HLMdiag and those returned by the lme4 and car packages ## hlm_influence() function ### Introduction The hlm_influence() function provides a wrapper that returns influence diagnostics appended to the original model frame. It is useful for assessing the influence of individual observations, or groups of observations, and can also be used with dotplot_diag() to visually explore the influence diagnostics. The diagnostics returned by hlm_influence() include Cook's distance, MDFFITS, covariance trace (covtrace), covariance ratio (covratio), leverage, and relative variance change (RVC). Cook's distance and MDFFITS both measure the distance between fixed effects estimated from the full data set and those obtained from the reduced data set. For Cook's distance, the change in parameter estimates is scaled by the estimated covariance matrix of the original parameter estimates, while MDFFITS scales this change by the estimated covariance matrix of the deletion estimates. The covariance trace and covariance ratio both measure how precision is affected by the deletion of a particular observation or group of observations i. Covariance trace is a measure of the ratio between the covariance matrices with and without unit i to the identity matrix, while covariance ratio is a comparison of the two covariance matrices with and without unit i using their determinants. Relative variance change (RVC) is a measurement of the ratio of estimates of the l th variance component with and without unit i. The final influence diagnostic returned by hlm_influence(), leverage, is the rate of change in the predicted response with respect to the observed response. For a full explanation of these influence diagnostics, including formulas, please refer to Loy and Hofmann (2014). To explore the functionality of hlm_influence(), we will use the data set classroom (West et. al, 2014). The data set consists of 1,190 observations of students. Students are grouped within classes, which are nested within schools. There are 312 distinct classes nested within 107 schools. data("classroom", package = "WWGbook")  print(as_tibble(classroom), width = Inf)  We will fit a simple three-level hierarchical linear model using the lme4 package: class.lmer <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classroom)  ### Obtain influence diagnostics at different levels with level hlm_influence() calculates influence diagnostics for individual cases, or larger groups. For example, to obtain a tibble with influence diagnostics for all 1,190 students we use the following: infl <- hlm_influence(class.lmer, level = 1)  print(infl, width = Inf)  Note that the parameter used to select individual observations is now called level, not group. Previous versions of HLMdiag used the group parameter for influence diagnostics, where group = NULL was the default, and users could choose to delete groups instead by setting group equal to level names. As of HLMdiag version 0.4.0, group has been replaced by level. level defaults to level = 1 which deletes individual observations, exactly how group = NULL used to work. The resulting tibble can be used along with dotplot_diag() to identify potentially influential observations using either an internal cutoff of 3*IQR or a user-specified cutoff. For example, the plot shown below reveals and labels all observations above the internally calculated cutoff for Cook's distance. dotplot_diag(infl$cooksd, name = "cooks.distance", cutoff = "internal")


The plot illustrates that there are many observations with a Cook's distance value above the internally calculated cutoff. dotplot_diag() labels the top 5, which is difficult to see in this case. Rather than investigate all observations above the cutoff, we may be interested in the observations with the highest Cook's distance values. The tibble returned by hlm_influence() provides an easy way to do this when used with the arrange() function from dplyr.

tb1 <- infl %>%
arrange(desc(cooksd))

print(tb1, width = Inf)


There does not appear to be a clear pattern among students who have relatively higher Cook's distance values; they do not tend to be from a particular school or class. In order to investigate these observations further, one could look to summary statistics for the different explanatory variables. Additionally, a similar analysis could be done with the other influence diagnostics provided in the tibble from hlm_influence().

In addition to identifying influential observations, it may also be useful to identify influential groups. The level parameter in hlm_influence() can be used for this purpose. For example, we can obtain influence diagnostics for each class:

infl.classes <- hlm_influence(class.lmer, level = "classid:schoolid")

print(infl.classes, width = Inf)


Note that the level parameter is set as classid:schoolid, not classid. The level parameter should be specified as found in model@flist for lmerMod objects. For three level models, this usually takes the form of "level2:level3", where level2 is the name of the second level variable, and level3 is the name of the third level variable.

Using dotplot_diag() again with one of the columns from the resulting tibble of hlm_influence reveals that class 218 has a relatively high Cook's distance value.

dotplot_diag(infl.classes$cooksd, name = "cooks.distance", cutoff = "internal")  We can repeat a similar analysis to flag influential schools. infl.schools <- hlm_influence(class.lmer, level = "schoolid")  print(infl.schools, width = Inf)  Similarly, we use dotplot_diag to visually represent the differences. In this case, there are only four observations above the cutoff, so we set modify = "dotplot" in order to better visualize the influential observations. modify = "dotplot" works well when there are a relatively small number of observations above the cutoff. dotplot_diag(infl.schools$cooksd, name = "cooks.distance", cutoff = "internal", modify = "dotplot")


The plot reveals that schools 68, 75, 70, and 27 have relatively high Cook's distance values.

### Investigate deletion of specific observations or groups with delete

The delete parameter allows the user to calculate influence diagnostics for a specified group of observations, or group factor level. For example, influence diagnostics for the influential schools flagged above (68, 75, 70, and 27) can be calculated as shown below, deleting all four schools at once:

hlm_influence(class.lmer, level = "schoolid", delete = c("27", "70", "75", "68"))


Note that in this case, delete is specified as a character vector consisting of the school ID's of interest. delete can also be set as a numeric vector of indices; in this case, setting delete to the row numbers of all students in the data frame who attend schools 68, 75, 70, or 27 would be equivalent to the line above.

### Select different types of leverage with leverage argument

In the previous examples, hlm_influence() only returned overall leverage. However, hlm_influence() also allows for other types of leverage, including the leverage corresponding to the fixed effects, the leverage corresponding to the random effects as proposed by Demidenko and Stukel (2005), and the unconfounded leverage corresponding to the random effects proposed by Nobre and Singer (2011). These types of leverage are referred to as fixef, ranef, and ranef.uc, respectively.

None of our observations are flagged for high leverage when looking only at overall leverage:

dotplot_diag(infl$leverage.overall, name = "leverage", cutoff = "internal")  However, we can obtain the other types of leverage as follows: infl2 <- hlm_influence(class.lmer, level = 1, leverage = c("overall", "fixef", "ranef", "ranef.uc"))  This example illustrates how to select all four types of leverage, but any one or more may also be selected. We can then use dotplot_diag() with one of the new leverage columns to investigate outlier for that type of leverage. dotplot_diag(infl2$leverage.fixef, name = "leverage", cutoff = "internal", modify = "dotplot")


However, further analysis reveals that several observations have high leverage when considering only the leverage corresponding to the fixed effects.

### Choose between approximations or full refits with approx

hlm_influence() defaults to calculating influence diagnostics based off a one step approximation (Christensen et.al 1992; Shi and Chen 2008; Zewotir 2008). However, the approx parameter allows the user to calculate influence diagnostics based off a full refit of the data using hlm_influence(). For example, if we wished to calculate influence diagnostics for each school by fully refitting the model each time, we could use:

infl3 <- hlm_influence(class.lmer, level = "schoolid", approx = FALSE)

print(infl3, width = Inf)


In most cases, using the default of approx = TRUE is sufficient for influence analysis. Setting approx = FALSE also takes much longer than the default setting since the model must be refit each time with each group or observation deleted. However, the full refit method also allows for relative variance change (RVC) to be returned. If this is desired, approx = FALSE should be used.

### na.action and the data argument

hlm_influence() was written to respect the na.action parameter from the lme4 package. This argument defaults to na.omit, which means all rows in the data sets with NAs present are simply deleted from the model. However, there is also an na.exclude option, which pads the resulting tibbles with NAs in the the rows that contained NAs in the original data set in place of deleting them altogether. In order for hlm_influence() to do this, the original data set must be passed into hlm_influence() via the data argument whenever the na.action was set to na.exclude in the model fitting process.

For example, while the class data set does not have any NAs in the data set, we can introduce a couple for the purposes of an example.

classNA <- classroom
classNA[2,3] <- NA
classNA[3,4] <- NA


We can then fit the same model using the lme4 package as before. Below, we fit two models, one with na.action = "na.exclude" and one with the default na.action = "na.omit".

class.lmer.exclude <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.exclude")
class.lmer.omit <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.omit")


We then run hlm_influence() on the model where na.action = "na.omit".

infl4 <- hlm_influence(class.lmer.omit, level = 1)

print(infl4, width = Inf)


Note than there are only 1,188 rows in the returned tibble, although there were 1,190 observations in the original data set. The two rows with NAs were deleted from the returned tibble.

We repeat this with the model where na.action = "na.exclude".

 infl5 <-hlm_influence(class.lmer.exclude, level = 1, data = classNA)

print(infl5, width = Inf)


In this tibble, there are 1,190 rows. Furthermore, the two rows with NAs display NAs for the influence diagnostics, instead of being entirely absent as in the above example. It is important to note that the data argument is necessary. Failing to provide the data set through the data argument in this situation will result in an error.

## hlm_augment function

The hlm_augment() function combines the hlm_resid() and hlm_influence() functions to return a tibble containing information about the residuals and the influence diagnostics appended to the data. For example, we can obtain residuals and influence diagnostics at once for all students in the class data set with the following:

aug <- hlm_augment(class.lmer, level = 1)

print(aug, width = Inf)


This is useful for inspecting residuals values and influence diagnostics values at the same time. However, hlm_augment() lacks some of the functionality that hlm_influence() and hlm_resid() have. The delete and approx parameters available for hlm_influence() are not available in hlm_augment(), so the function will always use a one step approximation and delete all observations or groups instead of a selected few. The standardize parameter from hlm_resid() is also not included, so hlm_influence() or hlm_resid() should be used instead if this functionality is desired. For more information about available functionality in hlm_resid(), see the hlm_resid vignette.

hlm_augment() is especially useful for inspecting residual values of observations with relatively high influence diagnostics values, or vice versa.

aug2 <- aug %>%
arrange(desc(cooksd))

print(aug2, width = Inf)


We can sort by a particular column in order to inspect the values of other influence diagnostics and the residuals of influential observations.

## lme objects from nlme package

Previously, only the individual one step approximation influence functions worked on lme models fit using the nlme package. However, hlm_influence() can also be used on lme objects, as can hlm_augment(). This allows the user to calculate influence diagnostics for lme models by fulling refitting the model using the approx = FALSE argument.

### Important differences for lme objects

In most cases, using a lme object for hlm_influence() or hlm_augment() is identical to their usage with lmerMod objects from lme4. However, there are a few notable exceptions.

For both hlm_influence() and hlm_augment(), levels should be specified by names that appear in model@flist. For the second level of a three level lmerMod model, this usually looks like "level2:level3" where level2 and level3 are the names of the second and third level variables, respectively. However, for a lme model, levels should be specified by names that appear in model$groups. For example, we can obtain influence diagnostics for each classroom from the class data set in the following way: class.lme <- nlme::lme(mathgain ~ mathkind + sex + minority + ses + housepov, random = ~ 1|schoolid/classid, data = classroom) hlm_influence(class.lme, level = "classid")  For the lmerMod model, we specified level = classid:schoolid. However, for lme models, the name of the second level alone is sufficient. However, specifying names of specific groups to delete for lme models is also slightly different. For example, we can obtain influence diagnostics for a model created when classes 160 and 217 are deleted for a lmerMod model in the following way: hlm_influence(class.lmer, level = "classid:schoolid", delete = c("160:1", "217:1"))  Obtaining influence diagnostics for a model created with the deletion of classes 160 and 217 from a lme model is a bit different: hlm_influence(class.lme, level = "classid", delete = c("1/160", "1/217"))  Note that both examples specify units for deletion in the same way they are specified in model@flist (lmerMod models) or model$groups (lme models).

Additionally, lmerMod models require that the data argument is passed into hlm_influence() and hlm_augment() when na.action = "na.exclude" during the model fitting process. However, this is unnecessary for lme models.

## Cook's distance values comparison to other packages

The HLMdiag package has two different ways to calculate Cook's distance. One of them, which is more exact, refits the model with each observation or group deleted from the model and calculates Cook's distance from the resulting coefficient estimates and variance components estimates. The second is a one-step approximation. For more information about how the one step approximation works, we refer the reader to Christensen et.al (1992); Shi and Chen (2008); and Zewotir (2008). Other R packages also have functions that can calculate Cook's distance. In this section, we highlight the differences between the ways of calculating Cook's distance in HLMdiag versus other methods in the car and lme4 packages.

### The car package

The cooks_distance() function from HLMdiag calculates Cook's distance through a full refit of the model using the following formula:

$C_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta})}^{-1}(\hat{\beta} - \hat{\beta}{(i)})$

Notice that the difference in the change in the parameter estimates is scaled by the estimated covariance matrix of the original parameter estimates. We can calculate the Cook's distance values in the HLMdiag package by first using the case_delete() function, followed by the cooks.distance() function.

hlm_case <- HLMdiag:::case_delete.lmerMod(class.lmer)
hlm_cd_full <- HLMdiag:::cooks.distance.case_delete(hlm_case)


Conversely, the mdffits() function from HLMdiag calculates MDFFITS, which is a multivariate version of the DFFITS statistic. This is calculated as follows:

$MDFFITS_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$

Instead of scaling by the estimated covariance matrix of the original parameter estimates, calculations for MDFFITS are scaled by the estimated covariance matrix of the deletion estimates. For each deleted observation or group, the model is refit and the covariance structure re-estimated without unit i. We can calculate this similarly to how we calculated Cook's distance, using the same case_delete object.

hlm_mdffits <- mdffits(hlm_case)


These estimates are pretty similar to those produced by cooks_distance(); the difference is due to the use of the covariance matrix of the deletion estimates, rather than the original estimates.

The car package calculates Cook's distance similarly to how the HLMdiag package calculates MDFFITS. Instead of using the covariance matrix of the original parameter estimates like HLMdiag's cooks_distance() function, the cooks.distance() function from car uses the estimated covariance matrix of the deletion estimates. However, the cooks_distance() function from car is not identical to the mdffits function from HLMdiag; the car::cooks.distance() function also scaled each observation, i, by dividing it by the estimated variance of the model calculated without observation i. Therefore, the formula used to calculate Cook's distance in the car package is as follows:

$C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$

We can calculate this by first using the influence() function followed by the cooks.distance() function.

library(car)
car_case <- influence(class.lmer)
car_cd <- cooks.distance(car_case)


These values initially seem fairly different from the MDFFITS and Cook's distance values produced by HLMdiag, due to the additional scaling by the inverse of the variance of each refitted model. We can adjust for this by multiplying the vector of Cook's distance values from car by the estimated variance from each refitted model.

sig.sq <- car_case[[4]][,1]


Once the values from car have been adjusted by sigma squared, they now appear very similar to the MDFFITS values from HLMdiag. The plot below shows the difference between the MDFFITS estimates from HLMdiag and the variance-adjusted Cook's distance estimates from car for all 1,190 observations in the classroom data set.

data.frame(car_adjusted = car_cd_adjusted, mdffits = hlm_mdffits, obs = 1:1190) %>%
mutate(diff = mdffits - car_adjusted) %>%
ggplot() +
geom_point(aes(y = diff, x = obs))


While the difference between the estimates varies slightly due to differences in how the fixed effects and variance components are calculated for refit models in both packages, the difference in values tends to be very small.

### The lme4 package

Similar to HLMdiag, the lme4 package has two methods that calculate Cook's distance. One of them, similar to the case_delete method, conducts a full refit of models without each observation or group. The second is a quicker approximation.

#### lme4 approximation

In order to calculate the approximation of Cook's distance values from lme4, we use the cooks.distance() function.

lme_cd_approx <- lme4:::cooks.distance.merMod(class.lmer)


However, this approximation is fairly different from the one produced from HLMdiag.

hlm_cd_approx <- HLMdiag:::cooks.distance.lmerMod(class.lmer)


The approximation from HLMdiag is also closer to the full refit calculated by HLMdiag than the lme4 approximation is. The plots below show the difference between the full refit from HLMdiag and the HLMdiag approximation (left) and the lme4 approximation (right).

library(gridExtra)
p1 <- data.frame(hlm_full = hlm_cd_full, hlm_approx = hlm_cd_approx, obs = 1:1190) %>%
mutate(diff = hlm_full - hlm_approx) %>%
ggplot() +
geom_point(aes(x = diff, y = obs)) +
labs(x = "HLMdiag full refit - HLMdiag approximation")

p2 <- data.frame(hlm_full = hlm_cd_full, lme_approx = lme_cd_approx, obs = 1:1190) %>%
mutate(diff = hlm_full - lme_approx) %>%
ggplot() +
geom_point(aes(x = diff, y = obs)) +
labs(x = "HLMdiag full refit - lme4 approximation")

grid.arrange(p1, p2, nrow = 2)


The estimates produced by the lme4 approximation are never smaller than the values from the HLMdiag full refit, but they tend to be further off the HLMdiag full refit values than the HLMdiag approximation values are. The difference between the full refit and the approximation from HLMdiag tends to be very small (< 0.0005), while the difference between the full refit and the lme4 approximation values tends to be much larger.

#### lme4 full refit

lme4 also has a function that performs a full refit of the models without each observation or group and calculates Cook's distance values based off those refits. We can calculate this by using the influence function from lme4.

lme_case <- lme4:::influence.merMod(class.lmer, data = classroom)
cd_lme_full <- cooks.distance(lme_case)


lme4's cook's distance function for objects created from the influence function has a bug that prevents us from obtaining Cook's distance values from the influence object using lme4; however, we can use the cooks.distance function from the stats package on the influence object from lme4.

The values obtained from lme4 match those from car, meaning that lme4 is also scaling the estimates by dividing by the estimated variance of each refit model, in addition to using the estimated covariance matrix from the deletion estimates, rather than the original estimates. Therefore, the lme4 package is also using this formula to calculate Cook's distance:

$C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$

The full refits from the lme4 and car packages both choose to calculate what is MDFFITS in the HLMdiag package, and additionally choose to scale by dividing by the variance from each refit model. Those choices explain almost all of the differences between Cook's distance values from the three different packages; additional variation is due to differences in how the new models are fit and coefficients estimated.

## References

Bates D, Maechler M, Bolker B, Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48.

Christensen R, Pearson L, Johnson W (1992). “Case-Deletion Diagnostics for Mixed Models.” Technometrics, 34(1), 38–45.

Fox J, Weisburg S (2019). A {R} Companion to Applied Regression, Third Addition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/

Loy A, Hofmann H (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models in R. Journal of Statistical Software, 56(5), 1-28.

Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2020). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-148, .

Shi L, Chen G (2008). “Case Deletion Diagnostics in Multilevel Models.” Journal of Multivariate Analysis, 99(9), 1860–1877.

West, B., Welch, K. & Galecki, A. (2006) Linear Mixed Models: A Practical Guide Using Statistical Software. First Edition. Chapman Hall / CRC Press. ISBN 1584884800

Zewotir T (2008). “Multiple Cases Deletion Diagnostics for Linear Mixed Models.” Communications in Statistics – Theory and Methods, 37(7), 1071–1084.

## Try the HLMdiag package in your browser

Any scripts or data that you put into this service are public.

HLMdiag documentation built on May 2, 2021, 9:06 a.m.