title: "Enriching glm objects"
author: "Ioannis Kosmidis"
date: "r Sys.Date()"
output: rmarkdown::html_vignette
bibliography: enrichwith.bib
vignette: >
%\VignetteIndexEntry{enriching glm objects}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
The enrichwith R
package provides the enrich method to enrich list-like R objects
with new, relevant components. The resulting objects preserve their
class, so all methods associated with them still apply.
This vignette is a demo of the available enrichment options for glm
objects.
The following data set is provided in @mccullagh:89 [Section 8.4] and
consists of observations on $n = 18$ mean clotting times of blood in
seconds (time) for each combination of nine percentage
concentrations of normal plasma (conc) and two lots of clotting
agent (lot).
clotting <- data.frame(conc = c(5,10,15,20,30,40,60,80,100, 5,10,15,20,30,40,60,80,100), time = c(118, 58, 42, 35, 27, 25, 21, 19, 18, 69, 35, 26, 21, 18, 16, 13, 12, 12), lot = factor(c(rep(1, 9), rep(2, 9))))
@mccullagh:89 [Section 8.4] fitted a series of nested generalized
linear models assuming that the times are realizations of independent
Gamma random
variables whose mean varies appropriately with concentration and
lot. In particular, @mccullagh:89 linked the inverse mean of the gamma
random variables to the linear predictors ~ 1, ~ log(conc), ~
log(conc) + lot, ~ log(conc) * lot and carried out an analysis of
deviance to conclude that the ~ log(u) * lot provides the best
explanation of clotting times. The really close fit of that model to
the data can be seen in the figure below.
library("ggplot2") clottingML <- glm(time ~ log(conc) * lot, family = Gamma, data = clotting) alpha <- 0.01 pr_out <- predict(clottingML, type = "response", se.fit = TRUE) new_data <- clotting new_data$time <- pr_out$fit new_data$type <- "fitted" clotting$type <- "observed" all_data <- rbind(clotting, new_data) new_data <- within(new_data, { low <- pr_out$fit - qnorm(1 - alpha/2) * pr_out$se.fit upp <- pr_out$fit + qnorm(1 - alpha/2) * pr_out$se.fit }) |> merge(all_data, all = TRUE) ggplot(all_data) + geom_point(aes(conc, time, col = type), alpha = 0.8) + geom_segment(data = new_data, aes(x = conc, y = low, xend = conc, yend = upp, col = type)) + facet_grid(. ~ lot) + theme_bw() + theme(legend.position = "top")
The score function is the gradient of the log-likelihood and is a key object for likelihood inference.
The enrichwith R package provides methods for the enrichment of
glm objects with the corresponding score function. This can either
be done by enriching the glm object with auxiliary_functions and
then extracting the score function
library("enrichwith") enriched_clottingML <- enrich(clottingML, with = "auxiliary functions") scores_clottingML <- enriched_clottingML$auxiliary_functions$score
or directly using the get_score_function convenience method
scores_clottingML <- get_score_function(clottingML)
By definition, the score function has to have zero components when evaluated at the maximum likelihood estimates (only numerically zero here).
scores_clottingML()
Another key quantity in likelihood inference is the expected information.
The auxiliary_functions enrichment option of the enrich method
enriches a glm object with a function for the evaluation of the
expected information.
info_clottingML <- enriched_clottingML$auxiliary_functions$information
and can also be computed directly using the get_information_function method
info_clottingML <- get_information_function(clottingML)
One of the uses of the expected information is the calculation of
standard errors for the model parameters. The stats::summary.glm
function already does that for glm objects, estimating the
dispersion parameter (if any) using Pearson residuals.
summary_clottingML <- summary(clottingML)
Duly, info_clottingML returns (numerically) the same standard errors
as the summary method does.
summary_std_errors <- coef(summary_clottingML)[, "Std. Error"] einfo <- info_clottingML(dispersion = summary_clottingML$dispersion) all.equal(sqrt(diag(solve(einfo)))[1:4], summary_std_errors, tolerance = 1e-05)
Another estimate of the standard errors results by the observed information, which is the negative Hessian matrix of the log-likelihood.
At least at the time of writing the current vignette, there appears
to be no general implementation of the observed information for glm
objects. I guess the reason for that is the dependence of the observed
information on higher-order derivatives of the inverse link and
variance functions, which are not readily available in base R.
enrichwith provides options for the enrichment of link-glm and
family objects with such derivatives (see ?enrich.link-glm and
?enrich.family for details), and, based on those allows the
enrichment of glm objects with a function to compute the observed
information.
The observed and the expected information for the regression
parameters coincide for GLMs with canonical link, like clottingML
oinfo <- info_clottingML(dispersion = summary_clottingML$dispersion, type = "observed") all.equal(oinfo[1:4, 1:4], einfo[1:4, 1:4])
which is not generally true for a glm with a non-canonical link, as
seen below.
clottingML_log <- update(clottingML, family = Gamma("log")) summary_clottingML_log <- summary(clottingML_log) info_clottingML_log <- get_information_function(clottingML_log) einfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "expected") oinfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "observed") round(einfo_log, 3) round(oinfo_log, 3)
We can now use scores_clottingML and info_clottingML to carry out
a score test to compare
nested models.
If $l_\psi(\psi, \lambda)$ is the gradient of the log-likelihood with respect to $\psi$ evaluated at $\psi$ and $\lambda$, $i^{\psi\psi}(\psi, \lambda)$ is the $(\psi, \psi)$ block of the inverse of the expected information matrix, and $\hat\lambda_\psi$ is the maximum likelihood estimator of $\lambda$ for fixed $\psi$, then, assuming that the model is adequate [ l_\psi(\psi, \hat\lambda_\psi)^\top i^{\psi\psi}(\psi,\hat\lambda_\psi) l_\psi(\psi, \hat\lambda_\psi) ] has an asymptotic $\chi^2_{dim(\psi)}$ distribution.
The following code chunk computes the maximum likelihood estimates
when the parameters for lot and log(conc):lot are fixed to zero
($\hat\lambda\psi$ for $\psi = 0$ above), along with the score and
expected information matrix evaluated at them
($l(\psi,\hat\lambda_\psi) and $i(\psi,\hat\lambda_\psi)$, above).
clottingML_nested <- update(clottingML, . ~ log(conc)) enriched_clottingML_nested <- enrich(clottingML_nested, with = "mle of dispersion") coef_full <- coef(clottingML) coef_hypothesis <- structure(rep(0, length(coef_full)), names = names(coef_full)) coef_hypothesis_short <- coef(enriched_clottingML_nested, model = "mean") coef_hypothesis[names(coef_hypothesis_short)] <- coef_hypothesis_short disp_hypothesis <- coef(enriched_clottingML_nested, model = "dispersion") scores <- scores_clottingML(coef_hypothesis, disp_hypothesis) info <- info_clottingML(coef_hypothesis, disp_hypothesis)
The object enriched_clottingML_nested inherits from enriched_glm,
glm and lm, and, as illustrated above, enrichwith provides a
corresponding coef method to extract the estimates for the mean
regression parameters and/or the estimate for the dispersion
parameter.
The score statistic is then
(score_statistic <- drop(scores%*%solve(info)%*%scores))
which gives a p-value of
pchisq(score_statistic, 2, lower.tail = FALSE)
For comparison, the Wald statistic for the same hypothesis is
coef_full[3:4]%*%solve(solve(info)[3:4, 3:4])%*%coef_full[3:4]
and the log-likelihood ratio statistic is
(deviance(clottingML_nested) - deviance(clottingML))/disp_hypothesis
which is close to the score statistic
glm objects at parameter valuesenrichwith also provides the get_simulate_function method for
glm or lm objects. The get_simulate_function computes a function
to simulate response vectors at /arbitrary/ values of the model
parameters, which can be useful when setting up simulation experiments
and for various inferential procedures (e.g. indirect inference).
For example, the following code chunk simulates three response vectors
at the maximum likelihood estimates of the parameters for clottingML.
simulate_clottingML <- get_simulate_function(clottingML) simulate_clottingML(nsim = 3, seed = 123)
The result is the same to what the simulate method returns
simulate(clottingML, nsim = 3, seed = 123)
but simulate_clottingML can also be used to simulate at any given
parameter value
coefficients <- c(0, 0.01, 0, 0.01) dispersion <- 0.001 samples <- simulate_clottingML(coefficients = coefficients, dispersion = dispersion, nsim = 500000, seed = 123)
The empirical means and variances based on samples agree with the exact means and variances at coefficients and dispersion
means <- 1/(model.matrix(clottingML) %*% coefficients) variances <- dispersion * means^2 max(abs(rowMeans(samples) - means)) max(abs(apply(samples, 1, var) - variances))
enrichwith also provides the get_dmodel_function, get_pmodel_function and get_qmodel_function methods, which can be used to evaluate densities or probability mass functions, distribution functions, and quantile functions, respectively at arbitrary parameter values and data settings.
For example, the following code chunk computes the density at the observations in the clotting data set under then maximum likelihood fit
cML_dmodel <- get_dmodel_function(clottingML) cML_dmodel()
We can also compute the densities at user-supplied parameter values
cML_dmodel(coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
or even at different data points
new_data <- data.frame(conc = 5:10, time = 50:45, lot = factor(c(1, 1, 1, 2, 2, 2))) cML_dmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
We can also compute cumulative probabilities and quantile functions. So
cML_qmodel <- get_qmodel_function(clottingML) cML_pmodel <- get_pmodel_function(clottingML) probs <- cML_pmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1) cML_qmodel(probs, data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
returns new_data$time.
For example the fitted densities when (conc, lot) is c(15, 1), c(15, 2), c(40, 1) or c(40, 2) are
new_data <- expand.grid(conc = c(15, 40), lot = factor(1:2), time = seq(0, 50, length = 500)) new_data$density <- cML_dmodel(new_data) ggplot(data = new_data) + geom_line(aes(time, density)) + facet_grid(conc ~ lot) + theme_bw() + geom_vline(data = clotting[c(3, 6, 12, 15), ], aes(xintercept = time), lty = 2)
The dashed vertical lines are the observed values of time for the (conc, lot) combinations in the figure.
enriched_glmenrichwith provides a wrapper to the glm interface that results directly in enriched_glm objects that carry all the components described above. For example,
enriched_clottingML <- enriched_glm(time ~ log(conc) * lot, family = Gamma, data = clotting) names(enriched_clottingML$auxiliary_functions) enriched_clottingML$score_mle enriched_clottingML$expected_information_mle enriched_clottingML$observed_information_mle enriched_clottingML$bias_mle enriched_clottingML$dispersion_mle
The vignette
bias
of enrichwith describes how to use the auxiliary_functions to
reduce the bias of the maximum likelihood estimator for generalized
linear models.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.