title: "Enriching `glm`

objects"
author: "Ioannis Kosmidis"
date: "`r Sys.Date()`

"
output: rmarkdown::html_vignette
bibliography: enrichwith.bib
vignette: >
%\VignetteIndexEntry{enriching glm objects}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}

The **enrichwith** R
package provides the `enrich`

method to enrich list-like R objects
with new, relevant components. The resulting objects preserve their
class, so all methods associated with them still apply.

This vignette is a demo of the available enrichment options for `glm`

objects.

The following data set is provided in @mccullagh:89 [Section 8.4] and
consists of observations on $n = 18$ mean clotting times of blood in
seconds (`time`

) for each combination of nine percentage
concentrations of normal plasma (`conc`

) and two lots of clotting
agent (`lot`

).

clotting <- data.frame(conc = c(5,10,15,20,30,40,60,80,100, 5,10,15,20,30,40,60,80,100), time = c(118, 58, 42, 35, 27, 25, 21, 19, 18, 69, 35, 26, 21, 18, 16, 13, 12, 12), lot = factor(c(rep(1, 9), rep(2, 9))))

@mccullagh:89 [Section 8.4] fitted a series of nested generalized
linear models assuming that the times are realisations of independent
Gamma random
variables whose mean varies appropriately with concentration and
lot. In particular, @mccullagh:89 linked the inverse mean of the gamma
random variables to the linear predictors `~ 1`

, `~ log(conc)`

, ```
~
log(conc) + lot
```

, `~ log(conc) * lot`

and carried out an analysis of
deviance to conclude that the `~ log(u) * lot`

provides the best
explanation of clotting times. The really close fit of that model to
the data can be seen in the figure below.

library("ggplot2") clottingML <- glm(time ~ log(conc) * lot, family = Gamma, data = clotting) alpha <- 0.01 pr_out <- predict(clottingML, type = "response", se.fit = TRUE) new_data <- clotting new_data$time <- pr_out$fit new_data$type <- "fitted" clotting$type <- "observed" all_data <- rbind(clotting, new_data) new_data <- within(new_data, { low <- pr_out$fit - qnorm(1 - alpha/2) * pr_out$se.fit upp <- pr_out$fit + qnorm(1 - alpha/2) * pr_out$se.fit }) ggplot(all_data) + geom_point(aes(conc, time, col = type), alpha = 0.8) + geom_segment(data = new_data, aes(x = conc, y = low, xend = conc, yend = upp, col = type)) + facet_grid(. ~ lot) + theme_bw() + theme(legend.position = "top")

The score function is the gradient of the log-likelihood and is a key object for likelihood inference.

The **enrichwith** R package provides methods for the enrichment of
`glm`

objects with the corresponding score function. This can either
be done by enriching the `glm`

object with `auxiliary_functions`

and
then extracting the score function

library("enrichwith") enriched_clottingML <- enrich(clottingML, with = "auxiliary functions") scores_clottingML <- enriched_clottingML$auxiliary_functions$score

or directly using the `get_score_function`

convenience method

```
scores_clottingML <- get_score_function(clottingML)
```

By definition, the score function has to have zero components when evaluated at the maximum likelihood estimates (only numerically zero here).

scores_clottingML()

Another key quantity in likelihood inference is the expected information.

The `auxiliary_functions`

enrichment option of the `enrich`

method
enriches a `glm`

object with a function for the evaluation of the
expected information.

info_clottingML <- enriched_clottingML$auxiliary_functions$information

and can also be computed directly using the `get_information_function`

method

```
info_clottingML <- get_information_function(clottingML)
```

One of the uses of the expected information is the calculation of
standard errors for the model parameters. The `stats::summary.glm`

function already does that for `glm`

objects, estimating the
dispersion parameter (if any) using Pearson residuals.

summary_clottingML <- summary(clottingML)

Duly, `info_clottingML`

returns (numerically) the same standard errors
as the `summary`

method does.

summary_std_errors <- coef(summary_clottingML)[, "Std. Error"] einfo <- info_clottingML(dispersion = summary_clottingML$dispersion) all.equal(sqrt(diag(solve(einfo)))[1:4], summary_std_errors, tolerance = 1e-05)

Another estimate of the standard errors results by the observed information, which is the negative Hessian matrix of the log-likelihood.

At least at the time of writting the current vignette, there appears
to be no general implementation of the observed information for `glm`

objects. I guess the reason for that is the dependence of the observed
information on higher-order derivatives of the inverse link and
variance functions, which are not readily available in base R.

**enrichwith** provides options for the enrichment of `link-glm`

and
`family`

objects with such derivatives (see `?enrich.link-glm`

and
`?enrich.family`

for details), and, based on those allows the
enrichment of `glm`

objects with a function to compute the observed
information.

The observed and the expected information for the regression
parameters coincide for GLMs with canonical link, like `clottingML`

oinfo <- info_clottingML(dispersion = summary_clottingML$dispersion, type = "observed") all.equal(oinfo[1:4, 1:4], einfo[1:4, 1:4])

which is not generally true for a `glm`

with a non-canonical link, as
seen below.

clottingML_log <- update(clottingML, family = Gamma("log")) summary_clottingML_log <- summary(clottingML_log) info_clottingML_log <- get_information_function(clottingML_log) einfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "expected") oinfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "observed") round(einfo_log, 3) round(oinfo_log, 3)

We can now use `scores_clottingML`

and `info_clottingML`

to carry out
a score test to compare
nested models.

If $l_\psi(\psi, \lambda)$ is the gradient of the log-likelihood with respect to $\psi$ evaluated at $\psi$ and $\lambda$, $i^{\psi\psi}(\psi, \lambda)$ is the $(\psi, \psi)$ block of the inverse of the expected information matrix, and $\hat\lambda_\psi$ is the maximum likelihood estimator of $\lambda$ for fixed $\psi$, then, assuming that the model is adequate [ l_\psi(\psi, \hat\lambda_\psi)^\top i^{\psi\psi}(\psi,\hat\lambda_\psi) l_\psi(\psi, \hat\lambda_\psi) ] has an asymptotic $\chi^2_{dim(\psi)}$ distribution.

The following code chunk computes the maximum likelihood estimates
when the parameters for `lot`

and `log(conc):lot`

are fixed to zero
($\hat\lambda\psi$ for $\psi = 0$ above), along with the score and
expected information matrix evaluated at them
($l(\psi,\hat\lambda_\psi) and $i(\psi,\hat\lambda_\psi)$, above).

clottingML_nested <- update(clottingML, . ~ log(conc)) enriched_clottingML_nested <- enrich(clottingML_nested, with = "mle of dispersion") coef_full <- coef(clottingML) coef_hypothesis <- structure(rep(0, length(coef_full)), names = names(coef_full)) coef_hypothesis_short <- coef(enriched_clottingML_nested, model = "mean") coef_hypothesis[names(coef_hypothesis_short)] <- coef_hypothesis_short disp_hypothesis <- coef(enriched_clottingML_nested, model = "dispersion") scores <- scores_clottingML(coef_hypothesis, disp_hypothesis) info <- info_clottingML(coef_hypothesis, disp_hypothesis)

The object `enriched_clottingML_nested`

inherits from `enriched_glm`

,
`glm`

and `lm`

, and, as illustrated above, **enrichwith** provides a
corresponding `coef`

method to extract the estimates for the mean
regression parameters and/or the estimate for the dispersion
parameter.

The score statiistic is then

(score_statistic <- drop(scores%*%solve(info)%*%scores))

which gives a p-value of

pchisq(score_statistic, 2, lower.tail = FALSE)

For comparison, the Wald statistic for the same hypothesis is

coef_full[3:4]%*%solve(solve(info)[3:4, 3:4])%*%coef_full[3:4]

and the log-likelihood ratio statistic is

(deviance(clottingML_nested) - deviance(clottingML))/disp_hypothesis

which is close to the score statistic

`glm`

objects at parameter values**enrichwith** also provides the `get_simulate_function`

method for
`glm`

or `lm`

objects. The `get_simulate_function`

computes a function
to simulate response vectors at /arbitrary/ values of the model
parameters, which can be useful when setting up simulation experiments
and for various inferential procedures (e.g. indirect inference).

For example, the following code chunk simulates three response vectors
at the maximum likelihood estimates of the parameters for `clottingML`

.

simulate_clottingML <- get_simulate_function(clottingML) simulate_clottingML(nsim = 3, seed = 123)

The result is the same to what the `simulate`

method returns

simulate(clottingML, nsim = 3, seed = 123)

but `simulate_clottingML`

can also be used to simulate at any given
parameter value

coefficients <- c(0, 0.01, 0, 0.01) dispersion <- 0.001 samples <- simulate_clottingML(coefficients = coefficients, dispersion = dispersion, nsim = 500000, seed = 123)

The empirical means and variances based on `samples`

agree with the exact means and variances at `coefficients`

and `dispersion`

means <- 1/(model.matrix(clottingML) %*% coefficients) variances <- dispersion * means^2 max(abs(rowMeans(samples) - means)) max(abs(apply(samples, 1, var) - variances))

**enrichwith** also provides the `get_dmodel_function`

, `get_pmodel_function`

and `get_qmodel_function`

methods, which can be used to evaluate densities or probability mass functions, distribution functions, and quantile functions, respectively at arbitrary parameter values and data settings.

For example, the following code chunk computes the density at the observations in the `clotting`

data set under then maximum likelihood fit

```
cML_dmodel <- get_dmodel_function(clottingML)
cML_dmodel()
```

We can also compute the densities at user-supplied parameter values

cML_dmodel(coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)

or even at different data points

new_data <- data.frame(conc = 5:10, time = 50:45, lot = factor(c(1, 1, 1, 2, 2, 2))) cML_dmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)

We can also compute cumulative probabilities and quantile functions. So

cML_qmodel <- get_qmodel_function(clottingML) cML_pmodel <- get_pmodel_function(clottingML) probs <- cML_pmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1) cML_qmodel(probs, data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)

returns `new_data$time`

.

For example the fitted densities when `(conc, lot)`

is `c(15, 1)`

, `c(15, 2)`

, `c(40, 1)`

or `c(40, 2)`

are

new_data <- expand.grid(conc = c(15, 40), lot = factor(1:2), time = seq(0, 50, length = 100)) new_data$density <- cML_dmodel(new_data) ggplot(data = new_data) + geom_line(aes(time, density)) + facet_grid(conc ~ lot) + theme_bw() + geom_vline(data = clotting[c(3, 6, 12, 15), ], aes(xintercept = time), lty = 2)

The dashed vertical lines are the observed values of time for the `(conc, lot)`

combinations in the figure.

`enriched_glm`

**enrichwith** provides a wrapper to the `glm`

interface that results directly in `enriched_glm`

objects that carry all the components described above. For example,

enriched_clottingML <- enriched_glm(time ~ log(conc) * lot, family = Gamma, data = clotting) names(enriched_clottingML$auxiliary_functions) enriched_clottingML$score_mle enriched_clottingML$expected_information_mle enriched_clottingML$observed_information_mle enriched_clottingML$bias_mle enriched_clottingML$dispersion_mle

The vignette
bias
of **enrichwith** describes how to use the `auxiliary_functions`

to
reduce the bias of the maximum likelihood estimator for generalized
linear models.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.