library(ggplot2)
library(patchwork)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "cairo_pdf",
  fig.align = "center",
  fig.height = 3.5,
  out.width = "80%"
)

theme_set(theme_bw(base_size = 11))

theme_update(
  legend.position = "none",
  axis.text = element_text(color = "black", size = 11),
  axis.title.x.bottom = element_text(margin = margin(t = 16.5)),
  axis.title.y.left = element_text(margin = margin(r = 16.5))
)

set.seed(1337)

\newcommand{\qvec}{\bm{q}} \newcommand{\rvec}{\bm{r}} \newcommand{\svec}{\bm{s}} \newcommand{\xvec}{\bm{x}} \newcommand{\yvec}{\bm{y}} \newcommand{\zvec}{\bm{z}}

\newcommand{\betavec}{\bm{\beta}} \newcommand{\gammavec}{\bm{\gamma}}

\newcommand{\wmat}{\mathbf{W}} \newcommand{\xmat}{\mathbf{X}} \newcommand{\zmat}{\mathbf{Z}}

\newcommand{\zeromat}{\mathbf{0}}

\newcommand{\cov}{\operatorname{Cov}} \newcommand{\gammadist}{\operatorname{Gamma}} \newcommand{\var}{\operatorname{Var}}

\newcommand{\E}{\mathbb{E}} \newcommand{\I}{\mathcal{I}} \newcommand{\N}{\mathcal{N}}

\newcommand{\betahat}{\hat{\betavec}} \newcommand{\betaols}{\betahat^{\text{(OLS)}}} \newcommand{\betawls}{\betahat^{\text{(WLS)}}}

\newcommand{\gammahat}{\hat{\gammavec}}

\newcommand{\shat}{\hat{s}} \newcommand{\svechat}{\hat{\svec}}

\newcommand{\what}{\hat{\wmat}}

\renewcommand{\epsilon}{\varepsilon}

\newcommand{\iid}{\overset{\text{i.i.d.}}{\sim}} \newcommand{\ind}{\overset{\text{ind.}}{\sim}}

Motivation and model

The lmls package comes with the abdom dataset (which it borrows from the gamlss.data package). The dataset consists of only two variables: the size of 610 fetuses (as measurements of their abdominal circumference taken from ultrasound scans) and their gestational age ranging from 12 to 42 weeks.

(ref:abdom-plot) The abdom dataset containing the gestational age and abdominal circumference of 610 fetuses.

library(lmls)

ggplot(abdom, aes(x, y)) +
  geom_point(color = "darkgray", size = 1) +
  xlab("Age [weeks]") +
  ylab("Size [mm]")

As expected, Figure \@ref(fig:abdom-plot) indicates that the size of the babies increases with their age. We can use a simple linear regression model to quantify this relationship:

m1 <- lm(y ~ x, data = abdom)
summary(m1)

The simple linear regression model with normally distributed errors is defined as $$y_i = \beta_0 + x_i \beta_1 + \epsilon_i, \text{ where } \epsilon_i \iid \N(0, \sigma^2).$$ Based on the scatterplot of the data in Figure \@ref(fig:abdom-plot), the homoscedasticity (= constant variance) assumption of the simple linear regression model seems implausible. In fact, the variance of the babies' size seems to increase with their age. We can confirm this by plotting the regression residuals against the babies' age:

(ref:abdom-resid) The residuals of the simple linear regression model for the abdom dataset show clear heteroscedasticity and some non-linearity.

abdom$resid <- resid(m1)

ggplot(abdom, aes(x, resid)) +
  geom_point(color = "darkgray", size = 1) +
  geom_hline(yintercept = 0, linewidth = 0.5) +
  xlab("Age [weeks]") +
  ylab("Residuals")

It follows from the definition of the simple linear regression model that the response variable $y_i$ is normally distributed with mean $\mu_i = \beta_0 + x_i \beta_1$ and standard deviation $\sigma$, yielding $$y_i \ind \N(\beta_0 + x_i \beta_1, \sigma^2).$$ From this representation, we can extend the simple linear regression model and use the explanatory variable $x_i$ to predict not only the mean $\mu_i$ but also the standard deviation $\sigma_i$ of the response variable $y_i$. We translate this idea into the model \begin{equation} y_i \ind \N(\beta_0 + x_i \beta_1, (\exp(\gamma_0 + x_i \gamma_1))^2), (#eq:lmls) \end{equation} which is a minimal linear model for location and scale (LMLS). The regression coefficients $\gamma_0$ and $\gamma_1$ are the intercept and the slope parameter for the log-standard deviation, and the exponential function is the inverse link (or response) function. It ensures that the predictions for the standard deviation are always valid (= positive). This step is necessary, because the linear predictor $\gamma_0 + x_i \gamma_1$ can become zero or negative for some choices of $\gamma_0$ and $\gamma_1$.

Using the lmls package, we can estimate Model \@ref(eq:lmls) for the abdom dataset with a maximum likelihood algorithm running the following line of code:

(ref:abdom-lmls) The LMLS for the abdom dataset with a linear effect of the babies' age on their average size and a linear effect on the log-standard deviation.

m2 <- lmls(y ~ x, ~ x, data = abdom)

abdom$mu <- predict(m2, type = "response", predictor = "location")
abdom$sigma <- predict(m2, type = "response", predictor = "scale")
abdom$upper <- abdom$mu + 1.96 * abdom$sigma
abdom$lower <- abdom$mu - 1.96 * abdom$sigma

ggplot(abdom, aes(x, y)) +
  geom_point(color = "darkgray", size = 1) +
  geom_line(aes(y = mu), linewidth = 0.7) +
  geom_line(aes(y = upper), linewidth = 0.3) +
  geom_line(aes(y = lower), linewidth = 0.3) +
  xlab("Age [weeks]") +
  ylab("Size [mm]")

The general LMLS can include multiple explanatory variables, transformations of the explanatory variables, and it does not require the explanatory variables for the mean and the standard deviation to be identical. We define the general LMLS as $$y_i \ind \N(\xvec_i'\betavec, (\exp(\zvec_i'\gammavec))^2),$$ where $\xvec_i$ and $\betavec$ are the covariate vector and the vector of regression coefficients for the mean, and $\zvec_i$ and $\gammavec$ are the covariate vector and the vector of regression coefficients for the standard deviation.

Polynomials are a straightforward way to include transformations of explanatory variables in a model. For example, we can improve Model \@ref(eq:lmls) for the abdom dataset and add a quadratic effect of the babies' age on their average size with this command:

m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom)

The AIC drops from r round(AIC(m2), 3) to r round(AIC(m3), 3) for this model compared to Model \@ref(eq:lmls). Figure \@ref(fig:abdom-poly-2) illustrates how the quadratic effect improves the fit of the model.

(ref:abdom-poly-2) The LMLS for the abdom dataset with a quadratic effect of the babies' age on their average size and a linear effect on the log-standard deviation.

abdom$mu <- predict(m3, type = "response", predictor = "location")
abdom$sigma <- predict(m3, type = "response", predictor = "scale")
abdom$upper <- abdom$mu + 1.96 * abdom$sigma
abdom$lower <- abdom$mu - 1.96 * abdom$sigma

ggplot(abdom, aes(x, y)) +
  geom_point(color = "darkgray", size = 1) +
  geom_line(aes(y = mu), linewidth = 0.7) +
  geom_line(aes(y = upper), linewidth = 0.3) +
  geom_line(aes(y = lower), linewidth = 0.3) +
  xlab("Age [weeks]") +
  ylab("Size [mm]")

Statistical inference {#sec:inference}

The lmls package implements two inference algorithms: a frequentist maximum likelihood (ML) algorithm, which it uses by default, and a Bayesian Markov chain Monte Carlo (MCMC) algorithm. The derivatives that are required for these inference algorithms are derived in the Appendix in Section \@ref(sec:appendix).

To simplify the notation in this and the next section, we first define the response vector $\yvec = [y_1, \dots, y_n]'$, the design matrices $\xmat = [\xvec_1', \dots, \xvec_n']'$ and $\zmat = [\zvec_1', \dots, \zvec_n']'$, and the standard deviation of the $i$-th observation $\sigma_i = \exp(\zvec_i'\gammavec)$. Finally, let $\wmat$ be the weight matrix \begin{equation} \wmat = \begin{bmatrix} 1 / \sigma^2_1 & 0 & \dots & 0 \ 0 & 1 / \sigma^2_2 & & 0 \ \vdots & & \ddots & \vdots \ 0 & 0 & \dots & 1 / \sigma^2_n \end{bmatrix}. (#eq:wmat) \end{equation}

Maximum likelihood

The ML algorithm combines weighted least squares (WLS) updates for $\betahat$ and Fisher scoring updates for $\gammahat$. We first discuss this update strategy and then take a look at the initialization strategy.

Update strategy

If we know the true $\gammavec$ and hence the true weight matrix $\wmat$, we can estimate $\betavec$ with WLS: $$\betawls = (\xmat'\wmat\xmat)^{-1} \xmat'\wmat\yvec.$$

On the other hand, if we know the true $\betavec$, we can estimate $\gammavec$ with the Fisher scoring algorithm. Fisher scoring is an iterative algorithm for ML estimation, similar to Newton's method. The $k$-th update of the Fisher scoring algorithm is defined as $$\gammahat^{(k)} = \gammahat^{(k - 1)} + (\I(\gammahat^{(k - 1)}))^{-1} s(\gammahat^{(k - 1)}),$$ where $\I$ is the Fisher information and $s$ is the score of $\gammavec$.

In most cases, we know neither the true $\betavec$ nor the true $\gammavec$, but we can estimate them with an iterative algorithm alternating between a Fisher scoring update for $\gammahat$ and a WLS update for $\betahat$. The other vector of regression coefficients is kept fixed at each step.

Initialization strategy

For the iterative algorithm to work well, we need to find good starting values. To initialize $\betahat$, the ordinary least squares (OLS) estimator $\betaols = (\xmat'\xmat)^{-1} \xmat'\yvec$ is a natural choice. Note that $\betaols$ is unbiased for the LMLS, because $$ \E(\betaols) = (\xmat'\xmat)^{-1} \xmat'\E(\yvec) = (\xmat'\xmat)^{-1} \xmat'\xmat\betavec = \betavec. $$ Under mild regularity conditions, $\betaols$ is also consistent. However, it is not the best linear unbiased estimator (BLUE), because the homoscedasticity requirement of the Gauss-Markov theorem does not hold.

To initialize $\gammahat$, we first estimate $\log{\sigma_i}$ with $\shat_i = \log{|y_i - \xvec_i'\betaols|} + 0.635$ and then regress $\svechat = [\shat_1, \dots, \shat_n]'$ on the design matrix $\zmat$ with OLS to obtain $\gammahat^{(0)} = (\zmat'\zmat)^{-1} \zmat'\svechat$.

The motivation for $\shat_i$ as an estimator for $\log{\sigma_i}$ is as follows: Consider \begin{equation} \log{\biggl\lvert\frac{y_i - \mu_i}{\sigma_i}\biggr\rvert} = \log{|y_i - \mu_i|} - \log{\sigma_i} (#eq:bias1) \end{equation} and \begin{equation} \log{\biggl\lvert\frac{y_i - \mu_i}{\sigma_i}\biggr\rvert} = \log{\sqrt{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2}} = \frac{1}{2} \cdot \log{\underbrace{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2}{\sim \chi^2(1)}}. (#eq:bias2) \end{equation} Setting the RHS of Equation \@ref(eq:bias1) equal to the RHS of Equation \@ref(eq:bias2) and taking expectations yields $$\E(\log{|y_i - \mu_i|}) - \log{\sigma_i} = \underbrace{1/2 \cdot (\psi(1/2) + \log 2)}{\approx -0.635},$$ where $\psi$ is the digamma function. The term $\psi(1/2) + \log 2$ follows from the fact that a $\chi^2(\nu)$ distribution is identical to a $\gammadist(k = \nu/2, \theta = 2)$ distribution, and that for $X \sim \gammadist(k, \theta)$, we have $\E(\log X) = \psi(k) + \log\theta$. Rearranging the equation to $$\E\underbrace{(\log{|y_i - \mu_i|} + 0.635)}_{= s_i} = \log{\sigma_i}$$ shows that $s_i$ is an unbiased estimator for $\log{\sigma_i}$. Unfortunately, we do not know the true $\mu_i$ in practice and have to use the unbiased estimator $\xvec_i'\betaols$ as an approximation instead.

Complete algorithm

  1. Estimate $\betaols = (\xmat'\xmat)^{-1} \xmat'\yvec$.
  2. Initialize $\gammahat^{(0)} = (\zmat'\zmat)^{-1} \zmat'\svechat$, where $\svechat = [\shat_i] = \log{|y_i - \xvec_i'\betaols|} + 0.635$.
  3. Initialize $\betahat^{(0)} = (\xmat'\what^{(0)}\xmat)^{-1} \xmat'\what^{(0)}\yvec$, where $\what^{(0)}$ is the weight matrix for $\gammahat^{(0)}$.
  4. Set $k = 1$.
  5. Repeat the following steps until convergence:
  6. Update $\gammahat^{(k)} = \gammahat^{(k - 1)} + (\I(\gammahat^{(k - 1)}))^{-1} s(\gammahat^{(k - 1)})$ keeping $\betahat^{(k - 1)}$ fixed.
  7. Update $\betahat^{(k)} = (\xmat'\what^{(k)}\xmat)^{-1} \xmat'\what^{(k)}\yvec$, where $\what^{(k)}$ is the weight matrix for $\gammahat^{(k)}$.
  8. Increase $k$ by 1.

Markov chain Monte Carlo

To estimate an LMLS with MCMC, we can call the mcmc() function on an existing model object. The sampler requires a model object that contains the design matrices, so we need to make sure the lmls() function was called with the argument light = FALSE. Finally, we can use the summary() function with the argument type = "mcmc" to output some summary statistics of the posterior samples.

m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom, light = FALSE)
m3 <- mcmc(m3)

summary(m3, type = "mcmc")

The posterior samples for one regression coefficient, $\gamma_1$ in this case, can be extracted and plotted as follows:

theme_update(axis.title.y.left = element_text(margin = margin(r = 5.5)))

(ref:abdom-mcmc-3) Trace plot and histogram of the posterior samples for $\gamma_1$.

samples <- as.data.frame(m3$mcmc$scale)
samples$iteration <- 1:nrow(samples)

p1 <- ggplot(samples, aes(iteration, x)) +
  geom_line() +
  xlab("Iteration") +
  ylab(expression(gamma[1]))

p2 <- ggplot(samples, aes(x, after_stat(density))) +
  geom_histogram(bins = 15) +
  xlab(expression(gamma[1])) +
  ylab("Density")

p1 + p2
theme_update(axis.title.y.left = element_text(margin = margin(r = 16.5)))

MCMC algorithm

The MCMC algorithm assumes flat priors for $\betavec$ and $\gammavec$ and works as follows:

  1. Initialize $\betavec^{(0)}$ and $\gammavec^{(0)}$ with the ML estimates.
  2. Set $k = 1$.
  3. Repeat the following steps num_samples times:
  4. Sample $\betavec^{(k)}$ from the full conditional in a Gibbs update step (see Section \@ref(sec:fullcond)).
  5. Sample $\gammavec^{(k)}$ with the Riemann manifold Metropolis-adjusted Langevin algorithm (MMALA) from @Girolami2011 using the Fisher-Rao metric tensor.
  6. Increase $k$ by 1.

Full conditional of $\betavec$ {#sec:fullcond}

The full conditional of $\betavec$ used in the Gibbs update step is given by $$p(\betavec \mid \cdot\,) \propto \exp(-1/2 \cdot (\yvec - \xmat\betavec)' \wmat (\yvec - \xmat\betavec)) \cdot p(\betavec) \cdot p(\gammavec),$$ where the weight matrix $\wmat$ is defined as in Equation \@ref(eq:wmat). The priors $p(\betavec)$ and $p(\gammavec)$ can be ignored, because we assume them to be flat. It can be shown with some tedious but simple linear algebra that \begin{align} &(\yvec - \xmat\betavec)' \wmat (\yvec - \xmat\betavec) \ =\ &(\betavec - \betawls)' \xmat'\wmat\xmat (\betavec - \betawls) + \yvec' (\wmat + \wmat\xmat (\xmat'\wmat\xmat)^{-1} \xmat'\wmat) \yvec, \end{align} where $\betawls$ is the WLS estimator for $\betavec$ using the weight matrix $\wmat$. As the second addend in the last equation is independent of $\betavec$, the full conditional reduces to $$p(\betavec \mid \cdot\,) \propto \exp(-1/2 \cdot (\betavec - \betawls)' \xmat'\wmat\xmat (\betavec - \betawls)),$$ which implies the following multivariate normal distribution: $$\betavec \mid \cdot\, \sim \N(\betawls, (\xmat'\wmat\xmat)^{-1}).$$

Comparison with other R packages

There are a number of R packages with similar capabilities as lmls. The popular mgcv package [@Wood2017], which is typically used to estimate generalized additive models (GAMs), added support for multi-predictor models including the LMLS with a normally distributed response variable in version 1.8. The gamlss package implements generalized additive models for location, scale and shape [GAMLSS, @Rigby2005] in a very general way, and the bamlss package [@Umlauf2018] is a Bayesian alternative to the gamlss package.

Compared to these packages, the scope of the lmls package is much narrower. Its functionality is limited to the LMLS with a normally distributed response variable. Other response distributions or the regularization of covariate effects are not supported. Instead, lmls aims to be...

Finally, we compare the performance of the lmls package on the abdom dataset with mgcv and gamlss using the microbenchmark package. The results of the benchmark are shown in Figure \@ref(fig:abdom-bench-2).

library(gamlss)
library(mgcv)

bench <- microbenchmark::microbenchmark(
  lmls = lmls(y ~ poly(x, 2), ~ x, data = abdom),
  mgcv = gam(list(y ~ poly(x, 2), ~ x), family = gaulss(), data = abdom),
  gamlss = gamlss(y ~ poly(x, 2), ~ x, data = abdom)
)

(ref:abdom-bench-2) The lmls package is about 3.5 to 4 times faster than mgcv and gamlss on the abdom dataset.

load("abdom-bench.RData")

ggplot(bench, aes(time / 1e6, expr, color = expr, fill = expr)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(alpha = 0.4) +
  coord_cartesian(xlim = c(0, NA)) +
  scale_y_discrete(limits = rev(levels(bench$expr))) +
  xlab("Execution time [ms]") +
  ylab("Package")

Appendix: Score and Fisher information {#sec:appendix}

The inference algorithms from Section \@ref(sec:inference) require the score (= the gradient of the log-likelihood) and the Fisher information (= the covariance of the score) with respect to $\betavec$ and $\gammavec$.

Location coefficients $\betavec$

The score of the location coefficients $\betavec$ is $$ s(\betavec) = \frac{\partial\ell}{\partial\betavec} = \xmat'\qvec, $$ where $\qvec$ is an $n$-dimensional column vector with the elements $q_i = (y_i - \mu_i) / \sigma^2_i$. The corresponding Fisher information is given by $$ \I(\betavec) = \cov(s(\betavec)) = \cov(\xmat'\qvec) = \xmat' \cov(\qvec) \xmat, $$ where the diagonal elements of the covariance matrix $\cov(\qvec)$ are $$ \var(q_i) = \var\biggl(\frac{y_i - \mu_i}{\sigma^2_i}\biggr) = \frac{1}{\sigma^2_i} \cdot \var\underbrace{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)}_{\sim \N(0, 1)} = \frac{1}{\sigma^2_i}, $$ and the off-diagonal elements are $\cov(q_i, q_j) = 0$ for $i \ne j$, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of $\betavec$:

crossprod(X / scale)

Here, scale is the vector $[\sigma_1, \dots, \sigma_n]$. This code works, because R stores matrices in column-major order and recycles shorter vectors for operations like element-wise division.

Scale coefficients $\gammavec$

The score of the scale coefficients $\gammavec$ is $$ s(\gammavec) = \frac{\partial\ell}{\partial\gammavec} = \zmat'\rvec, $$ where $\rvec$ is an $n$-dimensional column vector with the elements $r_i = ((y_i - \mu_i) / \sigma_i)^2 - 1$. The corresponding Fisher information is given by $$ \I(\gammavec) = \cov(s(\gammavec)) = \cov(\zmat'\rvec) = \zmat' \cov(\rvec) \zmat, $$ where the diagonal elements of the covariance matrix $\cov(\rvec)$ are $$ \var(r_i) = \var\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2 - 1\biggr) = \var\underbrace{\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2\biggr)}_{\sim \chi^2(1)} = 2, $$ and the off-diagonal elements are $\cov(r_i, r_j) = 0$ for $i \ne j$, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of $\gammavec$:

2 * crossprod(Z)

Mixed Fisher information of $\betavec$ and $\gammavec$

The inference algorithms from Section \@ref(sec:inference) update the location coefficients $\betavec$ and the scale coefficients $\gammavec$ separately. Would a joint update maybe work better? Let us take a look at the Fisher information of the stacked regression coefficients $$ \I\biggl( \begin{bmatrix} \betavec \ \gammavec \end{bmatrix} \biggr) = \begin{bmatrix} \I(\betavec) & \cov(s(\betavec), s(\gammavec)) \ \cov(s(\gammavec), s(\betavec)) & \I(\gammavec) \end{bmatrix}. $$ We are still missing the off-diagonal elements $$ \cov(s(\betavec), s(\gammavec)) = \cov(s(\gammavec), s(\betavec))' = \cov(\xmat'\qvec, \zmat'\rvec) = \xmat' \cov(\qvec, \rvec) \zmat, $$ where the diagonal elements of the cross-covariance matrix $\cov(\qvec, \rvec)$ are \begin{align} \cov(q_i, r_i) &= \cov\biggl(\frac{y_i - \mu_i}{\sigma^2_i}, \biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2 - 1\biggr) \ &= \frac{1}{\sigma_i} \cdot \cov\biggl(\frac{y_i - \mu_i}{\sigma_i}, \biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2\biggr) \ &= 0. \end{align} The third equality holds because $(y_i - \mu_i) / \sigma_i$ is a standard normal random variable and hence uncorrelated with its square. The off-diagonal elements of $\cov(\qvec, \rvec)$ are $\cov(q_i, r_j) = 0$ for $i \ne j$, due to the independence assumption of the LMLS. It follows that $\cov(s(\betavec), s(\gammavec)) = \cov(s(\gammavec), s(\betavec)) = \zeromat$, which means there is no additional information we can make use of for a joint update of $\betavec$ and $\gammavec$ (at least not in terms of the Fisher information).

References



hriebl/lmls documentation built on Nov. 13, 2024, 2:32 a.m.