title: "Enriching glm
objects"
author: "Ioannis Kosmidis"
date: "r Sys.Date()
"
output: rmarkdown::html_vignette
bibliography: enrichwith.bib
vignette: >
%\VignetteIndexEntry{enriching glm objects}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
The enrichwith R
package provides the enrich
method to enrich list-like R objects
with new, relevant components. The resulting objects preserve their
class, so all methods associated with them still apply.
This vignette is a demo of the available enrichment options for glm
objects.
The following data set is provided in @mccullagh:89 [Section 8.4] and
consists of observations on $n = 18$ mean clotting times of blood in
seconds (time
) for each combination of nine percentage
concentrations of normal plasma (conc
) and two lots of clotting
agent (lot
).
clotting <- data.frame(conc = c(5,10,15,20,30,40,60,80,100, 5,10,15,20,30,40,60,80,100), time = c(118, 58, 42, 35, 27, 25, 21, 19, 18, 69, 35, 26, 21, 18, 16, 13, 12, 12), lot = factor(c(rep(1, 9), rep(2, 9))))
@mccullagh:89 [Section 8.4] fitted a series of nested generalized
linear models assuming that the times are realisations of independent
Gamma random
variables whose mean varies appropriately with concentration and
lot. In particular, @mccullagh:89 linked the inverse mean of the gamma
random variables to the linear predictors ~ 1
, ~ log(conc)
, ~
log(conc) + lot
, ~ log(conc) * lot
and carried out an analysis of
deviance to conclude that the ~ log(u) * lot
provides the best
explanation of clotting times. The really close fit of that model to
the data can be seen in the figure below.
library("ggplot2") clottingML <- glm(time ~ log(conc) * lot, family = Gamma, data = clotting) alpha <- 0.01 pr_out <- predict(clottingML, type = "response", se.fit = TRUE) new_data <- clotting new_data$time <- pr_out$fit new_data$type <- "fitted" clotting$type <- "observed" all_data <- rbind(clotting, new_data) new_data <- within(new_data, { low <- pr_out$fit - qnorm(1 - alpha/2) * pr_out$se.fit upp <- pr_out$fit + qnorm(1 - alpha/2) * pr_out$se.fit }) ggplot(all_data) + geom_point(aes(conc, time, col = type), alpha = 0.8) + geom_segment(data = new_data, aes(x = conc, y = low, xend = conc, yend = upp, col = type)) + facet_grid(. ~ lot) + theme_bw() + theme(legend.position = "top")
The score function is the gradient of the log-likelihood and is a key object for likelihood inference.
The enrichwith R package provides methods for the enrichment of
glm
objects with the corresponding score function. This can either
be done by enriching the glm
object with auxiliary_functions
and
then extracting the score function
library("enrichwith") enriched_clottingML <- enrich(clottingML, with = "auxiliary functions") scores_clottingML <- enriched_clottingML$auxiliary_functions$score
or directly using the get_score_function
convenience method
scores_clottingML <- get_score_function(clottingML)
By definition, the score function has to have zero components when evaluated at the maximum likelihood estimates (only numerically zero here).
scores_clottingML()
Another key quantity in likelihood inference is the expected information.
The auxiliary_functions
enrichment option of the enrich
method
enriches a glm
object with a function for the evaluation of the
expected information.
info_clottingML <- enriched_clottingML$auxiliary_functions$information
and can also be computed directly using the get_information_function
method
info_clottingML <- get_information_function(clottingML)
One of the uses of the expected information is the calculation of
standard errors for the model parameters. The stats::summary.glm
function already does that for glm
objects, estimating the
dispersion parameter (if any) using Pearson residuals.
summary_clottingML <- summary(clottingML)
Duly, info_clottingML
returns (numerically) the same standard errors
as the summary
method does.
summary_std_errors <- coef(summary_clottingML)[, "Std. Error"] einfo <- info_clottingML(dispersion = summary_clottingML$dispersion) all.equal(sqrt(diag(solve(einfo)))[1:4], summary_std_errors, tolerance = 1e-05)
Another estimate of the standard errors results by the observed information, which is the negative Hessian matrix of the log-likelihood.
At least at the time of writing the current vignette, there appears
to be no general implementation of the observed information for glm
objects. I guess the reason for that is the dependence of the observed
information on higher-order derivatives of the inverse link and
variance functions, which are not readily available in base R.
enrichwith provides options for the enrichment of link-glm
and
family
objects with such derivatives (see ?enrich.link-glm
and
?enrich.family
for details), and, based on those allows the
enrichment of glm
objects with a function to compute the observed
information.
The observed and the expected information for the regression
parameters coincide for GLMs with canonical link, like clottingML
oinfo <- info_clottingML(dispersion = summary_clottingML$dispersion, type = "observed") all.equal(oinfo[1:4, 1:4], einfo[1:4, 1:4])
which is not generally true for a glm
with a non-canonical link, as
seen below.
clottingML_log <- update(clottingML, family = Gamma("log")) summary_clottingML_log <- summary(clottingML_log) info_clottingML_log <- get_information_function(clottingML_log) einfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "expected") oinfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "observed") round(einfo_log, 3) round(oinfo_log, 3)
We can now use scores_clottingML
and info_clottingML
to carry out
a score test to compare
nested models.
If $l_\psi(\psi, \lambda)$ is the gradient of the log-likelihood with respect to $\psi$ evaluated at $\psi$ and $\lambda$, $i^{\psi\psi}(\psi, \lambda)$ is the $(\psi, \psi)$ block of the inverse of the expected information matrix, and $\hat\lambda_\psi$ is the maximum likelihood estimator of $\lambda$ for fixed $\psi$, then, assuming that the model is adequate [ l_\psi(\psi, \hat\lambda_\psi)^\top i^{\psi\psi}(\psi,\hat\lambda_\psi) l_\psi(\psi, \hat\lambda_\psi) ] has an asymptotic $\chi^2_{dim(\psi)}$ distribution.
The following code chunk computes the maximum likelihood estimates
when the parameters for lot
and log(conc):lot
are fixed to zero
($\hat\lambda\psi$ for $\psi = 0$ above), along with the score and
expected information matrix evaluated at them
($l(\psi,\hat\lambda_\psi) and $i(\psi,\hat\lambda_\psi)$, above).
clottingML_nested <- update(clottingML, . ~ log(conc)) enriched_clottingML_nested <- enrich(clottingML_nested, with = "mle of dispersion") coef_full <- coef(clottingML) coef_hypothesis <- structure(rep(0, length(coef_full)), names = names(coef_full)) coef_hypothesis_short <- coef(enriched_clottingML_nested, model = "mean") coef_hypothesis[names(coef_hypothesis_short)] <- coef_hypothesis_short disp_hypothesis <- coef(enriched_clottingML_nested, model = "dispersion") scores <- scores_clottingML(coef_hypothesis, disp_hypothesis) info <- info_clottingML(coef_hypothesis, disp_hypothesis)
The object enriched_clottingML_nested
inherits from enriched_glm
,
glm
and lm
, and, as illustrated above, enrichwith provides a
corresponding coef
method to extract the estimates for the mean
regression parameters and/or the estimate for the dispersion
parameter.
The score statistic is then
(score_statistic <- drop(scores%*%solve(info)%*%scores))
which gives a p-value of
pchisq(score_statistic, 2, lower.tail = FALSE)
For comparison, the Wald statistic for the same hypothesis is
coef_full[3:4]%*%solve(solve(info)[3:4, 3:4])%*%coef_full[3:4]
and the log-likelihood ratio statistic is
(deviance(clottingML_nested) - deviance(clottingML))/disp_hypothesis
which is close to the score statistic
glm
objects at parameter valuesenrichwith also provides the get_simulate_function
method for
glm
or lm
objects. The get_simulate_function
computes a function
to simulate response vectors at /arbitrary/ values of the model
parameters, which can be useful when setting up simulation experiments
and for various inferential procedures (e.g. indirect inference).
For example, the following code chunk simulates three response vectors
at the maximum likelihood estimates of the parameters for clottingML
.
simulate_clottingML <- get_simulate_function(clottingML) simulate_clottingML(nsim = 3, seed = 123)
The result is the same to what the simulate
method returns
simulate(clottingML, nsim = 3, seed = 123)
but simulate_clottingML
can also be used to simulate at any given
parameter value
coefficients <- c(0, 0.01, 0, 0.01) dispersion <- 0.001 samples <- simulate_clottingML(coefficients = coefficients, dispersion = dispersion, nsim = 500000, seed = 123)
The empirical means and variances based on samples
agree with the exact means and variances at coefficients
and dispersion
means <- 1/(model.matrix(clottingML) %*% coefficients) variances <- dispersion * means^2 max(abs(rowMeans(samples) - means)) max(abs(apply(samples, 1, var) - variances))
enrichwith also provides the get_dmodel_function
, get_pmodel_function
and get_qmodel_function
methods, which can be used to evaluate densities or probability mass functions, distribution functions, and quantile functions, respectively at arbitrary parameter values and data settings.
For example, the following code chunk computes the density at the observations in the clotting
data set under then maximum likelihood fit
cML_dmodel <- get_dmodel_function(clottingML) cML_dmodel()
We can also compute the densities at user-supplied parameter values
cML_dmodel(coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
or even at different data points
new_data <- data.frame(conc = 5:10, time = 50:45, lot = factor(c(1, 1, 1, 2, 2, 2))) cML_dmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
We can also compute cumulative probabilities and quantile functions. So
cML_qmodel <- get_qmodel_function(clottingML) cML_pmodel <- get_pmodel_function(clottingML) probs <- cML_pmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1) cML_qmodel(probs, data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
returns new_data$time
.
For example the fitted densities when (conc, lot)
is c(15, 1)
, c(15, 2)
, c(40, 1)
or c(40, 2)
are
new_data <- expand.grid(conc = c(15, 40), lot = factor(1:2), time = seq(0, 50, length = 100)) new_data$density <- cML_dmodel(new_data) ggplot(data = new_data) + geom_line(aes(time, density)) + facet_grid(conc ~ lot) + theme_bw() + geom_vline(data = clotting[c(3, 6, 12, 15), ], aes(xintercept = time), lty = 2)
The dashed vertical lines are the observed values of time for the (conc, lot)
combinations in the figure.
enriched_glm
enrichwith provides a wrapper to the glm
interface that results directly in enriched_glm
objects that carry all the components described above. For example,
enriched_clottingML <- enriched_glm(time ~ log(conc) * lot, family = Gamma, data = clotting) names(enriched_clottingML$auxiliary_functions) enriched_clottingML$score_mle enriched_clottingML$expected_information_mle enriched_clottingML$observed_information_mle enriched_clottingML$bias_mle enriched_clottingML$dispersion_mle
The vignette
bias
of enrichwith describes how to use the auxiliary_functions
to
reduce the bias of the maximum likelihood estimator for generalized
linear models.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.