Loglinear Models"
In vcdExtra: 'vcd' Extensions and Additions

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  fig.height = 6,
  fig.width = 7,
  fig.path = "fig/tut03-",
  dev = "png",
  comment = "##"
)

# save some typing
knitr::set_alias(w = "fig.width",
                 h = "fig.height",
                 cap = "fig.cap")

# preload datasets ???
set.seed(1071)
library(vcd)
library(vcdExtra)
library(ggplot2)
data(HairEyeColor)
data(PreSex)
data(Arthritis, package="vcd")
art <- xtabs(~Treatment + Improved, data = Arthritis)
if(!file.exists("fig")) dir.create("fig")

You can use the loglm() function in the MASS package to fit log-linear models. Equivalent models can also be fit (from a different perspective) as generalized linear models with the glm() function using the family='poisson' argument, and the gnm package provides a wider range of generalized nonlinear models, particularly for testing structured associations.

The visualization methods for these models were originally developed for models fit using loglm(), so this approach is emphasized here. Some extensions of these methods for models fit using glm() and gnm() are contained in the vcdExtra package and illustrated in @ref(sec:glm).

Assume we have a 3-way contingency table based on variables A, B, and C. The possible different forms of loglinear models for a 3-way table are shown in the table below. \@(tab:loglin-3way) The Model formula column shows how to express each model for loglm() in R. ^[For glm(), or gnm(), with the data in the form of a frequency data.frame, the same model is specified in the form glm(Freq $\sim$ ..., family="poisson"), where Freq is the name of the cell frequency variable and ... specifies the Model formula.] In the Interpretation column, the symbol "$\perp$" is to be read as "is independent of," and "$\;|\;$" means "conditional on," or "adjusting for," or just "given".

| Model | Model formula | Symbol | Interpretation | |:-------------------------|:-------------------|:---------------|:-----------------------| | Mutual independence | ~A + B + C | $[A][B][C]$ | $A \perp B \perp C$ | | Joint independence | ~A*B + C | $[AB][C]$ | $(A \: B) \perp C$ | | Conditional independence | ~(A+B)*C | $[AC][BC]$ | $(A \perp B) \;|\; C$ | | All two-way associations | ~A*B + A*C + B*C | $[AB][AC][BC]$ | homogeneous association| | Saturated model | ~A*B*C | $[ABC]$ | 3-way association |

For example, the formula ~A + B + C specifies the model of mutual independence with no associations among the three factors. In standard notation for the expected frequencies $m_{ijk}$, this corresponds to

$$ \log ( m_{ijk} ) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C \equiv A + B + C $$

The parameters $\lambda_i^A , \lambda_j^B$ and $\lambda_k^C$ pertain to the differences among the one-way marginal frequencies for the factors A, B and C.

Similarly, the model of joint independence, $(A \: B) \perp C$, allows an association between A and B, but specifies that C is independent of both of these and their combinations,

$$ \log ( m_{ijk} ) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C + \lambda_{ij}^{AB} \equiv A * B + C $$

where the parameters $\lambda_{ij}^{AB}$ pertain to the overall association between A and B (collapsing over C).

In the literature or text books, you will often find these models expressed in shorthand symbolic notation, using brackets, [ ] to enclose the high-order terms in the model. Thus, the joint independence model can be denoted [AB][C], as shown in the Symbol column in the table. \@(tab:loglin-3way).

Models of conditional independence allow (and fit) two of the three possible two-way associations. There are three such models, depending on which variable is conditioned upon. For a given conditional independence model, e.g., [AB][AC], the given variable is the one common to all terms, so this example has the interpretation $(B \perp C) \;|\; A$.

Fitting with `loglm()` {#sec:loglm}

For example, we can fit the model of mutual independence among hair color, eye color and sex in HairEyeColor as

library(MASS)
## Independence model of hair and eye color and sex.  
hec.1 <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.1

Similarly, the models of conditional independence and joint independence are specified as

## Conditional independence
hec.2 <- loglm(~(Hair + Eye) * Sex, data=HairEyeColor)
hec.2

## Joint independence model.  
hec.3 <- loglm(~Hair*Eye + Sex, data=HairEyeColor)
hec.3

Note that printing the model gives a brief summary of the goodness of fit. A set of models can be compared using the anova() function.

anova(hec.1, hec.2, hec.3)

Fitting with `glm()` and `gnm()` {#sec:glm}

The glm() approach, and extensions of this in the gnm package allows a much wider class of models for frequency data to be fit than can be handled by loglm(). Of particular importance are models for ordinal factors and for square tables, where we can test more structured hypotheses about the patterns of association than are provided in the tests of general association under loglm(). These are similar in spirit to the non-parametric CMH tests described in \@ref(sec:CMH).

Example: The data Mental in the vcdExtra package gives a two-way table in frequency form classifying young people by their mental health status and parents' socioeconomic status (SES), where both of these variables are ordered factors.

data(Mental, package = "vcdExtra")
str(Mental)
xtabs(Freq ~ mental + ses, data=Mental)   # display the frequency table

Simple ways of handling ordinal variables involve assigning scores to the table categories, and the simplest cases are to use integer scores, either for the row variable (column effects'' model), the column variable (row effects'' model), or both (``uniform association'' model).

indep <- glm(Freq ~ mental + ses, family = poisson, data = Mental)  # independence model

To fit more parsimonious models than general association, we can define numeric scores for the row and column categories

# Use integer scores for rows/cols 
Cscore <- as.numeric(Mental$ses)
Rscore <- as.numeric(Mental$mental)

Then, the row effects model, the column effects model, and the uniform association model can be fit as follows. The essential idea is to replace a factor variable with its numeric equivalent in the model formula for the association term.

# column effects model (ses)
coleff <- glm(Freq ~ mental + ses + Rscore:ses, family = poisson, data = Mental)

# row effects model (mental)
roweff <- glm(Freq ~ mental + ses + mental:Cscore, family = poisson, data = Mental)

# linear x linear association
linlin <- glm(Freq ~ mental + ses + Rscore:Cscore, family = poisson, data = Mental)

The LRstats() function in vcdExtra provides a nice, compact summary of the fit statistics for a set of models, collected into a glmlist object. Smaller is better for AIC and BIC.

# compare models using AIC, BIC, etc
vcdExtra::LRstats(glmlist(indep, roweff, coleff, linlin))

For specific model comparisons, we can also carry out tests of nested models with anova() when those models are listed from smallest to largest. Here, there are two separate paths from the most restrictive (independence) model through the model of uniform association, to those that allow only one of row effects or column effects.

anova(indep, linlin, coleff, test="Chisq")  
anova(indep, linlin, roweff, test="Chisq")

The model of linear by linear association seems best on all accounts. For comparison, one might try the CMH tests on these data:

CMHtest(xtabs(Freq~ses+mental, data=Mental))

Non-linear terms

The strength of the gnm package is that it handles a wide variety of models that handle non-linear terms, where the parameters enter the model beyond a simple linear function. The simplest example is the Goodman RC(1) model [@Goodman:79], which allows a multiplicative term to account for the association of the table variables. In the notation of generalized linear models with a log link, this can be expressed as

$$ \log \mu_{ij} = \alpha_i + \beta_j + \gamma_{i} \delta_{j} ,$$

where the row-multiplicative effect parameters $\gamma_i$ and corresponding column parameters $\delta_j$ are estimated from the data.% ^[This is similar in spirit to a correspondence analysis with a single dimension, but as a statistical model.]

Similarly, the RC(2) model adds two multiplicative terms to the independence model,

$$ \log \mu_{ij} = \alpha_i + \beta_j + \gamma_{i1} \delta_{j1} + \gamma_{i2} \delta_{j2} . $$

In the gnm package, these models may be fit using the Mult() to specify the multiplicative term, and instances() to specify several such terms.

Example: For the Mental data, we fit the RC(1) and RC(2) models, and compare these with the independence model.

RC1 <- gnm(Freq ~ mental + ses + Mult(mental,ses), data=Mental, 
             family=poisson, verbose=FALSE)
RC2 <- gnm(Freq ~ mental+ses + instances(Mult(mental,ses),2), data=Mental, 
             family=poisson, verbose=FALSE)
anova(indep, RC1, RC2, test="Chisq")