knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warning = FALSE, fig.height = 6, fig.width = 7, fig.path = "fig/demo-housing-", dev = "png", comment = "##" ) # save some typing knitr::set_alias(w = "fig.width", h = "fig.height", cap = "fig.cap") # colorize text colorize <- function(x, color) { if (knitr::is_latex_output()) { sprintf("\\textcolor{%s}{%s}", color, x) } else if (knitr::is_html_output()) { sprintf("<span style='color: %s;'>%s</span>", color, x) } else x }
This vignette was one of a series of demo()
files in the package. It is still there
as demo("housing")
, but is now presented here with additional commentary and analysis,
designed to highlight some aspects of analysis of categorical data and graphical display.
I'll use the following packages in this vignette.
library(vcdExtra) library(MASS) library(effects)
The content here is the dataset MASS::housing
, giving a 4-way,
$3 \times 3 \times 4 \times 2$
frequency table of 1681 individuals from the Copenhagen Housing Conditions
Survey, classified by their:
Sat
) with their housing circumstances (low, medium or high),Type
of rental dwelling (Tower, Apartment, Atrium or Terrace) Infl
) on management of the property (low, medium, high), and Cont
) with other residents (low or high)Load the data:
data(housing, package="MASS") str(housing)
Satisfaction (Sat
) of these householders with their present
housing circumstances is the outcome variable here.
For purposes of analysis,
note that Sat
is an ordered factor with levels "Low" < "Medium" < "High"
.
Note also that Influence, with the same levels is just a "Factor", not an ordered one.
I consider here just models using glm(..., family=poisson)
or
the equivalent in MASS::loglm()
.
The ordering of factor levels is important in graphical displays. We
don't want to see them ordered alphabetically, "High", "Low", "Medium".
The housing
data.frame was constructed so that the levels of
Sat
and Infl
appear in the dataset in their appropriate order.
levels(housing$Sat) levels(housing$Infl)
Other models, e.g., the proportional odds model, fit using
MASS:polr()
can take the ordinal nature of satisfaction into
account. In glm()
one could re-assign Infl
as an ordered
factor and examine linear vs. non-linear associations for this factor.
But I don't do this here.
The most ignorant model asserts that all the table factors are mutually independent.
In symbolic notation, this is [S] [I] [T] [C]
where all terms in separate [ ]
are
supposed to be independent. This is Freq ~ Sat + Infl + Type + Cont
as a formula for
glm()
.
house.null <- glm(Freq ~ Sat + Infl + Type + Cont, family = poisson, data = housing)
When Sat
is the outcome variable, a minimal baseline model should allow
for all associations among the predictors, symbolized as [S] [I T C]
.
That is,
Influence, Type and Contact may be associated in arbitrary ways, just
as multiple predictors can be correlated in regression models.
In this framework, what remains to be explained is whether/how Sat
depends on the combinations of the other variables. The baseline
model therefore includes the full three-way term for the predictors.
house.glm0 <- glm(Freq ~ Sat + Infl*Type*Cont, family = poisson, data = housing)
Both of these models fit terribly, but we can always use anova(mod1, mod2,...)
to
compare the relative fits of nested models.
anova(house.null, house.glm0, test = "Chisq")
The baseline model is shown in the mosaic plot below. Note that
this is applied not to the housing
data, but rather to the
house.glm0
object (of class glm
)
resulting to a call to vcdExtra::mosaic.glm()
.
With four variables in the mosaic, labeling of the variable names
and factor levels is a bit tricky, because labels must appear on all four
sides of the plot.
The labeling_args
argument can be used to set more informative variable
names and abbreviate factor levels where necessary.
#| label= mosaic-glm0a, #| warning = TRUE # labeling_args for mosaic() largs <- list(set_varnames = c( Infl="Influence on management", Cont="Contact among residents", Type="Type of dwelling", Sat="Satisfaction"), abbreviate=c(Type=3)) mosaic(house.glm0, labeling_args=largs, main='Baseline model: [ITC][Sat]')
In this plot we can see largish
r colorize("positive residuals", "blue")
in the blocks corresponding
to (low satisfaction, low influence) and (high satisfaction, high influence)
and clusters of largish
r colorize("negative residuals", "red")
in the opposite corners.
By default, variables are used in the mosaic display in their order in the data
table or frequency data.frame. The r colorize("warning", "red")
reminds us that the
order of conditioning used is ~Sat + Infl + Type + Cont
.
For mosaic.glm()
, the conditioning order of variables in the mosaic can be set using
the formula
argument. Here, I rearrange the variables to put Sat
as the last variable in the splitting / conditioning sequence.
I also use vcdExtra::modFit()
to add the LR $G^2$ fit statistic to
the plot title.
mosaic(house.glm0, formula = ~ Type + Infl + Cont + Sat, labeling_args=largs, main=paste('Baseline model: [ITC][Sat],', modFit(house.glm0)) )
Clearly, satisfaction depends on one or more of the predictors, Infl
, Type
and Cont
and possibly their interactions.
As a first step it is useful to consider sequentially adding the association terms
Infl:Sat
, Type:Sat
, Cont:Sat
one at a time.
This analysis is carried out using MASS::addterm()
.
MASS::addterm(house.glm0, ~ . + Sat:(Infl + Type + Cont), test = "Chisq")
Based on this, it is useful to consider a "main-effects" model for satisfaction, adding all three two-way terms involving satisfaction.
The update()
method provides an easy way to add (or subtract) terms from a fitted
model object. In the model formula, .
stands for whatever was on the left
side (Freq
) or on the right side (Sat + Infl*Type*Cont
) of the model
(house.glm0
) that is being updated.
house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl + Type + Cont))
For comparison, we note that the same model can be fit using the
iterative proportional scaling algorithm of MASS::loglm()
.
(house.loglm1 <- MASS::loglm(Freq ~ Infl * Type * Cont + Sat*(Infl + Type + Cont), data = housing))
As before, anova()
tests the added contribution of
each more complex model over the one before.
The residual deviance $G^2$ has been reduced from
$G^2 (46) = 217.46$ for the baseline model house.glm0
to
$G^2 (34) = 38.66$ for the revised model house.glm1
.
The difference, $G^2(M1 | M0) = G^2 (12) = 178.79$ tests the collective additional fit
provided by the two-way association of satisfaction with the predictors.
anova(house.glm0, house.glm1, test="Chisq")
glm1
The model house.glm1
fits reasonably well, r modFit(house.glm1)
, so most residuals
are small. In the mosaic below, I use gp=shading_Friendly
to shade the tiles
so that positive and negative residuals are distinguished by color,
and they are filled when the absolute value of the residual is outside $\pm 2, 4$.
mosaic(house.glm1, labeling_args=largs, main=paste('Model [IS][TS][CS],', modFit(house.glm1) ), gp=shading_Friendly)
One cell is highlighted here: The combination of medium influence, low contact and tower type, is more likely to give low satisfaction than the model predicts. Is this just an outlier, or is there something that can be interpreted and perhaps improve the model fit? It is hard tell, but the virtues of mosaic displays are that they help to:
When we add terms using MASS::addterm()
, they are added sequentially.
It might be the case that once some term is added, a previously added term
is no longer important. Running MASS::dropterm()
on the housel.glm1
model
checks for this.
MASS::dropterm(house.glm1, test = "Chisq")
Note that the three-way term Infl:Type:Cont
is not significant. However,
with Sat
as the response, the associations of all predictors must be included
in the model.
The model so far says that each of influence, type and control have separate, additive effects on the level of satisfaction, what I called a "main-effects" model. It might be the case that some of the predictors have interaction effects, e.g., that the effect of influence on satisfaction might vary with the type of dwelling or the level of control.
An easy way to test for these is to update the main-effects model, adding
all possible two-way interactions for Sat
, one at a time, with addterm()
.
MASS::addterm(house.glm1, ~. + Sat:(Infl + Type + Cont)^2, test = "Chisq")
The result shows that adding the term Infl:Type:Sat
reduces the deviance $G^2$ from 38.66 to 16.11.
The difference, $G^2(M1 + ITS | M1) = G^2 (12) = 22.55$
reflects a substantial improvement.
The remaining two-way interaction terms reduce the deviance
by smaller and non-significant amounts, relative to house.glm1
.
Model fitting should be guided by substance, not just statistical
machinery. Nonetheless, it seems arguably sensible to add one
two-way term to the model, giving house.glm2
.
house.glm2 <- update(house.glm1, . ~ . + Sat:Infl:Type)
Adding more association terms to a model will always improve it. The question is, whether that is "worth it"?
"Worth it" concerns the trade-off between model fit and parsimony. Sometimes we might prefer a model with fewer parameters to one that has a slightly better fit, but requires more model terms and parameters.
The AIC and BIC statistics are designed to adjust our assessment of model fit by penalizing it for using more parameters. Equivalently, they deduct from the likelihood ratio $G^2$ a term proportional to the residual $\text{df}$ of the model. In any case -- smaller is better for both AIC and BIC.
$$AIC = G^2 - 2 \: \text{df}$$ $$BIC = G^2 - \log(n) \: \text{df}$$
These measures are provided
by AIC()
, BIC()
, and can be used to compare models using
vcdExtra::LRstats()
.
LRstats(house.glm0, house.glm1, house.glm2)
By these metrics, model house.glm1
is best on both AIC and BIC.
The increased goodness-of-fit (smaller $G^2$) of model house.glm2
is not worth the extra cost of parameters in the house.glm2
model.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.