knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(nabla)
The D operator composes to arbitrary order: D(f, x, order = k) returns
the $k$-th derivative tensor. Gradients ($k = 1$) and Hessians ($k = 2$)
are standard MLE tools. Third-order derivatives unlock additional diagnostic
information — in particular, the asymptotic skewness of the MLE.
This vignette demonstrates D(f, x, order = 3) on a Gamma MLE problem,
computing:
Given $x_1, \ldots, x_n \sim \text{Gamma}(\alpha, \beta)$ with shape $\alpha > 0$ and rate $\beta > 0$, the log-likelihood in terms of sufficient statistics $S_1 = \sum \log x_i$ and $S_2 = \sum x_i$ is:
$$\ell(\alpha, \beta) = n\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha - 1)S_1 - \beta S_2$$
This model is ideal for third-order AD because:
lgamma, exercising nabla's special function supportpsigamma(alpha, deriv = 2) (tetragamma)# Simulate data set.seed(42) n <- 200 alpha_true <- 3 beta_true <- 2 x_data <- rgamma(n, shape = alpha_true, rate = beta_true) # Sufficient statistics S1 <- sum(log(x_data)) S2 <- sum(x_data) # Log-likelihood using sufficient statistics ll_gamma <- function(theta) { alpha <- theta[1]; beta <- theta[2] n * alpha * log(beta) - n * lgamma(alpha) + (alpha - 1) * S1 - beta * S2 }
optim with AD gradients finds the MLE:
fit <- optim( par = c(2, 1), fn = function(theta) -ll_gamma(theta), gr = function(theta) -gradient(ll_gamma, theta), method = "L-BFGS-B", lower = c(1e-4, 1e-4) ) theta_hat <- fit$par names(theta_hat) <- c("alpha", "beta") theta_hat
At the MLE, the score (gradient of the log-likelihood) should be approximately zero. The observed information matrix $J$ is the negative Hessian.
# Score at MLE — should be near zero score <- gradient(ll_gamma, theta_hat) score # Observed information H <- hessian(ll_gamma, theta_hat) J <- -H cat("Observed information matrix:\n") J
Verify against analytical formulas. For the Gamma model:
alpha_hat <- theta_hat[1]; beta_hat <- theta_hat[2] J_analytical <- matrix(c( n * trigamma(alpha_hat), -n / beta_hat, -n / beta_hat, n * alpha_hat / beta_hat^2 ), nrow = 2, byrow = TRUE) cat("Max |AD - analytical| in J:", max(abs(J - J_analytical)), "\n")
Standard errors from the inverse information:
J_inv <- solve(J) se <- sqrt(diag(J_inv)) cat("SE(alpha_hat) =", se[1], "\n") cat("SE(beta_hat) =", se[2], "\n")
D(f, x, order = 3) returns the third-order derivative tensor
$T_{rst} = \partial^3 \ell / \partial\theta_r\partial\theta_s\partial\theta_t$.
For a two-parameter model, this is a $2 \times 2 \times 2$ array with 8
entries (symmetry reduces the distinct values).
T3 <- D(ll_gamma, theta_hat, order = 3) T3
Verify each entry analytically. For the Gamma model:
| Entry | Formula | Note | |-------|---------|------| | $T_{111}$ | $-n \cdot \psi''(\alpha)$ | tetragamma | | $T_{112} = T_{121} = T_{211}$ | $0$ | | | $T_{122} = T_{212} = T_{221}$ | $n / \beta^2$ | | | $T_{222}$ | $-2n\alpha / \beta^3$ | |
T3_analytical <- array(0, dim = c(2, 2, 2)) T3_analytical[1, 1, 1] <- -n * psigamma(alpha_hat, deriv = 2) # T3[1,1,2] and permutations = 0 (already zero) T3_analytical[1, 2, 2] <- n / beta_hat^2 T3_analytical[2, 1, 2] <- n / beta_hat^2 T3_analytical[2, 2, 1] <- n / beta_hat^2 T3_analytical[2, 2, 2] <- -2 * n * alpha_hat / beta_hat^3 cat("Max |AD - analytical| in T3:", max(abs(T3 - T3_analytical)), "\n")
★ Insight ─────────────────────────────────────
The D(f, x, order = 3) call applies the D operator three times, running
$p = 2$ forward passes at each level for $2^3 = 8$ total passes. Each pass
propagates triply-nested dual numbers through lgamma, which internally
chains through digamma, trigamma, and psigamma(alpha, deriv = 2).
─────────────────────────────────────────────────
For a regular model where the Hessian is non-random (as in the Gamma model), the asymptotic skewness of the MLE $\hat\theta_a$ is:
$$\gamma_1(\hat\theta_a) = \frac{1}{\sqrt{n}} \sum_{r,s,t} \frac{J^{-1}{ar} J^{-1}{as} J^{-1}{at} \cdot m{rst}}{[J^{-1}_{aa}]^{3/2}}$$
where $m_{rst} = 4 \cdot T_{rst} / n$ captures the per-observation third derivative contribution to the asymptotic third cumulant.
mle_skewness <- function(J_inv, T3, n) { p <- nrow(J_inv) skew <- numeric(p) for (a in seq_len(p)) { s <- 0 for (r in seq_len(p)) { for (s_ in seq_len(p)) { for (t in seq_len(p)) { s <- s + J_inv[a, r] * J_inv[a, s_] * J_inv[a, t] * 4 * T3[r, s_, t] / n } } } skew[a] <- s / J_inv[a, a]^(3/2) } skew } theoretical_skew <- mle_skewness(J_inv, T3, n) cat("Theoretical skewness of alpha_hat:", theoretical_skew[1], "\n") cat("Theoretical skewness of beta_hat: ", theoretical_skew[2], "\n")
A positive skewness for $\hat\alpha$ and $\hat\beta$ means their sampling distributions are right-skewed — there is a longer tail of overestimates than underestimates. This is typical for shape and rate parameters, which are bounded below by zero.
Validating the theoretical skewness by simulating many datasets and computing the MLE for each:
skewness <- function(x) { m <- mean(x); s <- sd(x) mean(((x - m) / s)^3) } set.seed(42) B <- 5000 alpha_hats <- numeric(B) beta_hats <- numeric(B) for (b in seq_len(B)) { x_b <- rgamma(n, shape = alpha_true, rate = beta_true) S1_b <- sum(log(x_b)) S2_b <- sum(x_b) ll_b <- function(theta) { a <- theta[1]; be <- theta[2] n * a * log(be) - n * lgamma(a) + (a - 1) * S1_b - be * S2_b } fit_b <- optim( par = c(2, 1), fn = function(theta) -ll_b(theta), method = "L-BFGS-B", lower = c(1e-4, 1e-4) ) alpha_hats[b] <- fit_b$par[1] beta_hats[b] <- fit_b$par[2] } empirical_skew <- c(skewness(alpha_hats), skewness(beta_hats))
Compare theoretical vs empirical skewness:
comparison <- data.frame( parameter = c("alpha", "beta"), theoretical = round(theoretical_skew, 4), empirical = round(empirical_skew, 4) ) comparison
The agreement between the theoretical prediction (computed from third-order AD) and the Monte Carlo estimate validates both the AD computation and the asymptotic skewness formula.
The sampling distributions of the MLEs show the asymmetric shape predicted by the skewness analysis:
oldpar <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) # alpha_hat hist(alpha_hats, breaks = 50, freq = FALSE, col = "grey85", border = "grey60", main = expression(hat(alpha)), xlab = expression(hat(alpha))) curve(dnorm(x, mean = mean(alpha_hats), sd = sd(alpha_hats)), col = "steelblue", lwd = 2, add = TRUE) abline(v = alpha_true, col = "firebrick", lty = 2, lwd = 2) legend("topright", legend = c("Normal approx", expression(alpha[true])), col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2, bty = "n", cex = 0.8) # beta_hat hist(beta_hats, breaks = 50, freq = FALSE, col = "grey85", border = "grey60", main = expression(hat(beta)), xlab = expression(hat(beta))) curve(dnorm(x, mean = mean(beta_hats), sd = sd(beta_hats)), col = "steelblue", lwd = 2, add = TRUE) abline(v = beta_true, col = "firebrick", lty = 2, lwd = 2) legend("topright", legend = c("Normal approx", expression(beta[true])), col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2, bty = "n", cex = 0.8) par(oldpar)
The right tail is visibly heavier than the left, consistent with the positive skewness values. The normal approximation (blue curve) misses this asymmetry — which is precisely what third-order derivative analysis captures.
Third-order derivatives computed by D(f, x, order = 3) provide
information beyond the standard gradient and Hessian:
nabla propagates through lgamma → digamma → trigamma →
psigamma(deriv = 2) automatically.The D operator makes this analysis composable: D(f, x, order = 3)
is just three applications of the same operator that gives gradients and
Hessians. No separate code paths, no symbolic algebra, no finite-difference
tuning.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.