library(ggplot2) library(patchwork) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dev = "cairo_pdf", fig.align = "center", fig.height = 3.5, out.width = "80%" ) theme_set(theme_bw(base_size = 11)) theme_update( legend.position = "none", axis.text = element_text(color = "black", size = 11), axis.title.x.bottom = element_text(margin = margin(t = 16.5)), axis.title.y.left = element_text(margin = margin(r = 16.5)) ) set.seed(1337)
\newcommand{\qvec}{\bm{q}} \newcommand{\rvec}{\bm{r}} \newcommand{\svec}{\bm{s}} \newcommand{\xvec}{\bm{x}} \newcommand{\yvec}{\bm{y}} \newcommand{\zvec}{\bm{z}}
\newcommand{\betavec}{\bm{\beta}} \newcommand{\gammavec}{\bm{\gamma}}
\newcommand{\wmat}{\mathbf{W}} \newcommand{\xmat}{\mathbf{X}} \newcommand{\zmat}{\mathbf{Z}}
\newcommand{\zeromat}{\mathbf{0}}
\newcommand{\cov}{\operatorname{Cov}} \newcommand{\gammadist}{\operatorname{Gamma}} \newcommand{\var}{\operatorname{Var}}
\newcommand{\E}{\mathbb{E}} \newcommand{\I}{\mathcal{I}} \newcommand{\N}{\mathcal{N}}
\newcommand{\betahat}{\hat{\betavec}} \newcommand{\betaols}{\betahat^{\text{(OLS)}}} \newcommand{\betawls}{\betahat^{\text{(WLS)}}}
\newcommand{\gammahat}{\hat{\gammavec}}
\newcommand{\shat}{\hat{s}} \newcommand{\svechat}{\hat{\svec}}
\newcommand{\what}{\hat{\wmat}}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\iid}{\overset{\text{i.i.d.}}{\sim}} \newcommand{\ind}{\overset{\text{ind.}}{\sim}}
The lmls package comes with the abdom dataset (which it borrows from the gamlss.data package). The dataset consists of only two variables: the size of 610 fetuses (as measurements of their abdominal circumference taken from ultrasound scans) and their gestational age ranging from 12 to 42 weeks.
(ref:abdom-plot) The abdom dataset containing the gestational age and abdominal circumference of 610 fetuses.
library(lmls) ggplot(abdom, aes(x, y)) + geom_point(color = "darkgray", size = 1) + xlab("Age [weeks]") + ylab("Size [mm]")
As expected, Figure \@ref(fig:abdom-plot) indicates that the size of the babies increases with their age. We can use a simple linear regression model to quantify this relationship:
m1 <- lm(y ~ x, data = abdom) summary(m1)
The simple linear regression model with normally distributed errors is defined as $$y_i = \beta_0 + x_i \beta_1 + \epsilon_i, \text{ where } \epsilon_i \iid \N(0, \sigma^2).$$ Based on the scatterplot of the data in Figure \@ref(fig:abdom-plot), the homoscedasticity (= constant variance) assumption of the simple linear regression model seems implausible. In fact, the variance of the babies' size seems to increase with their age. We can confirm this by plotting the regression residuals against the babies' age:
(ref:abdom-resid) The residuals of the simple linear regression model for the abdom dataset show clear heteroscedasticity and some non-linearity.
abdom$resid <- resid(m1) ggplot(abdom, aes(x, resid)) + geom_point(color = "darkgray", size = 1) + geom_hline(yintercept = 0, linewidth = 0.5) + xlab("Age [weeks]") + ylab("Residuals")
It follows from the definition of the simple linear regression model that the response variable $y_i$ is normally distributed with mean $\mu_i = \beta_0 + x_i \beta_1$ and standard deviation $\sigma$, yielding $$y_i \ind \N(\beta_0 + x_i \beta_1, \sigma^2).$$ From this representation, we can extend the simple linear regression model and use the explanatory variable $x_i$ to predict not only the mean $\mu_i$ but also the standard deviation $\sigma_i$ of the response variable $y_i$. We translate this idea into the model \begin{equation} y_i \ind \N(\beta_0 + x_i \beta_1, (\exp(\gamma_0 + x_i \gamma_1))^2), (#eq:lmls) \end{equation} which is a minimal linear model for location and scale (LMLS). The regression coefficients $\gamma_0$ and $\gamma_1$ are the intercept and the slope parameter for the log-standard deviation, and the exponential function is the inverse link (or response) function. It ensures that the predictions for the standard deviation are always valid (= positive). This step is necessary, because the linear predictor $\gamma_0 + x_i \gamma_1$ can become zero or negative for some choices of $\gamma_0$ and $\gamma_1$.
Using the lmls package, we can estimate Model \@ref(eq:lmls) for the abdom dataset with a maximum likelihood algorithm running the following line of code:
(ref:abdom-lmls) The LMLS for the abdom dataset with a linear effect of the babies' age on their average size and a linear effect on the log-standard deviation.
m2 <- lmls(y ~ x, ~ x, data = abdom) abdom$mu <- predict(m2, type = "response", predictor = "location") abdom$sigma <- predict(m2, type = "response", predictor = "scale") abdom$upper <- abdom$mu + 1.96 * abdom$sigma abdom$lower <- abdom$mu - 1.96 * abdom$sigma ggplot(abdom, aes(x, y)) + geom_point(color = "darkgray", size = 1) + geom_line(aes(y = mu), linewidth = 0.7) + geom_line(aes(y = upper), linewidth = 0.3) + geom_line(aes(y = lower), linewidth = 0.3) + xlab("Age [weeks]") + ylab("Size [mm]")
The general LMLS can include multiple explanatory variables, transformations of the explanatory variables, and it does not require the explanatory variables for the mean and the standard deviation to be identical. We define the general LMLS as $$y_i \ind \N(\xvec_i'\betavec, (\exp(\zvec_i'\gammavec))^2),$$ where $\xvec_i$ and $\betavec$ are the covariate vector and the vector of regression coefficients for the mean, and $\zvec_i$ and $\gammavec$ are the covariate vector and the vector of regression coefficients for the standard deviation.
Polynomials are a straightforward way to include transformations of explanatory variables in a model. For example, we can improve Model \@ref(eq:lmls) for the abdom dataset and add a quadratic effect of the babies' age on their average size with this command:
m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom)
The AIC drops from r round(AIC(m2), 3)
to r round(AIC(m3), 3)
for this model compared to Model \@ref(eq:lmls). Figure \@ref(fig:abdom-poly-2) illustrates how the quadratic effect improves the fit of the model.
(ref:abdom-poly-2) The LMLS for the abdom dataset with a quadratic effect of the babies' age on their average size and a linear effect on the log-standard deviation.
abdom$mu <- predict(m3, type = "response", predictor = "location") abdom$sigma <- predict(m3, type = "response", predictor = "scale") abdom$upper <- abdom$mu + 1.96 * abdom$sigma abdom$lower <- abdom$mu - 1.96 * abdom$sigma ggplot(abdom, aes(x, y)) + geom_point(color = "darkgray", size = 1) + geom_line(aes(y = mu), linewidth = 0.7) + geom_line(aes(y = upper), linewidth = 0.3) + geom_line(aes(y = lower), linewidth = 0.3) + xlab("Age [weeks]") + ylab("Size [mm]")
The lmls
package implements two inference algorithms: a frequentist maximum likelihood (ML) algorithm, which it uses by default, and a Bayesian Markov chain Monte Carlo (MCMC) algorithm. The derivatives that are required for these inference algorithms are derived in the Appendix in Section \@ref(sec:appendix).
To simplify the notation in this and the next section, we first define the response vector $\yvec = [y_1, \dots, y_n]'$, the design matrices $\xmat = [\xvec_1', \dots, \xvec_n']'$ and $\zmat = [\zvec_1', \dots, \zvec_n']'$, and the standard deviation of the $i$-th observation $\sigma_i = \exp(\zvec_i'\gammavec)$. Finally, let $\wmat$ be the weight matrix \begin{equation} \wmat = \begin{bmatrix} 1 / \sigma^2_1 & 0 & \dots & 0 \ 0 & 1 / \sigma^2_2 & & 0 \ \vdots & & \ddots & \vdots \ 0 & 0 & \dots & 1 / \sigma^2_n \end{bmatrix}. (#eq:wmat) \end{equation}
The ML algorithm combines weighted least squares (WLS) updates for $\betahat$ and Fisher scoring updates for $\gammahat$. We first discuss this update strategy and then take a look at the initialization strategy.
If we know the true $\gammavec$ and hence the true weight matrix $\wmat$, we can estimate $\betavec$ with WLS: $$\betawls = (\xmat'\wmat\xmat)^{-1} \xmat'\wmat\yvec.$$
On the other hand, if we know the true $\betavec$, we can estimate $\gammavec$ with the Fisher scoring algorithm. Fisher scoring is an iterative algorithm for ML estimation, similar to Newton's method. The $k$-th update of the Fisher scoring algorithm is defined as $$\gammahat^{(k)} = \gammahat^{(k - 1)} + (\I(\gammahat^{(k - 1)}))^{-1} s(\gammahat^{(k - 1)}),$$ where $\I$ is the Fisher information and $s$ is the score of $\gammavec$.
In most cases, we know neither the true $\betavec$ nor the true $\gammavec$, but we can estimate them with an iterative algorithm alternating between a Fisher scoring update for $\gammahat$ and a WLS update for $\betahat$. The other vector of regression coefficients is kept fixed at each step.
For the iterative algorithm to work well, we need to find good starting values. To initialize $\betahat$, the ordinary least squares (OLS) estimator $\betaols = (\xmat'\xmat)^{-1} \xmat'\yvec$ is a natural choice. Note that $\betaols$ is unbiased for the LMLS, because $$ \E(\betaols) = (\xmat'\xmat)^{-1} \xmat'\E(\yvec) = (\xmat'\xmat)^{-1} \xmat'\xmat\betavec = \betavec. $$ Under mild regularity conditions, $\betaols$ is also consistent. However, it is not the best linear unbiased estimator (BLUE), because the homoscedasticity requirement of the Gauss-Markov theorem does not hold.
To initialize $\gammahat$, we first estimate $\log{\sigma_i}$ with $\shat_i = \log{|y_i - \xvec_i'\betaols|} + 0.635$ and then regress $\svechat = [\shat_1, \dots, \shat_n]'$ on the design matrix $\zmat$ with OLS to obtain $\gammahat^{(0)} = (\zmat'\zmat)^{-1} \zmat'\svechat$.
The motivation for $\shat_i$ as an estimator for $\log{\sigma_i}$ is as follows: Consider \begin{equation} \log{\biggl\lvert\frac{y_i - \mu_i}{\sigma_i}\biggr\rvert} = \log{|y_i - \mu_i|} - \log{\sigma_i} (#eq:bias1) \end{equation} and \begin{equation} \log{\biggl\lvert\frac{y_i - \mu_i}{\sigma_i}\biggr\rvert} = \log{\sqrt{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2}} = \frac{1}{2} \cdot \log{\underbrace{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2}{\sim \chi^2(1)}}. (#eq:bias2) \end{equation} Setting the RHS of Equation \@ref(eq:bias1) equal to the RHS of Equation \@ref(eq:bias2) and taking expectations yields $$\E(\log{|y_i - \mu_i|}) - \log{\sigma_i} = \underbrace{1/2 \cdot (\psi(1/2) + \log 2)}{\approx -0.635},$$ where $\psi$ is the digamma function. The term $\psi(1/2) + \log 2$ follows from the fact that a $\chi^2(\nu)$ distribution is identical to a $\gammadist(k = \nu/2, \theta = 2)$ distribution, and that for $X \sim \gammadist(k, \theta)$, we have $\E(\log X) = \psi(k) + \log\theta$. Rearranging the equation to $$\E\underbrace{(\log{|y_i - \mu_i|} + 0.635)}_{= s_i} = \log{\sigma_i}$$ shows that $s_i$ is an unbiased estimator for $\log{\sigma_i}$. Unfortunately, we do not know the true $\mu_i$ in practice and have to use the unbiased estimator $\xvec_i'\betaols$ as an approximation instead.
To estimate an LMLS with MCMC, we can call the mcmc()
function on an existing model object. The sampler requires a model object that contains the design matrices, so we need to make sure the lmls()
function was called with the argument light = FALSE
. Finally, we can use the summary()
function with the argument type = "mcmc"
to output some summary statistics of the posterior samples.
m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom, light = FALSE) m3 <- mcmc(m3) summary(m3, type = "mcmc")
The posterior samples for one regression coefficient, $\gamma_1$ in this case, can be extracted and plotted as follows:
theme_update(axis.title.y.left = element_text(margin = margin(r = 5.5)))
(ref:abdom-mcmc-3) Trace plot and histogram of the posterior samples for $\gamma_1$.
samples <- as.data.frame(m3$mcmc$scale) samples$iteration <- 1:nrow(samples) p1 <- ggplot(samples, aes(iteration, x)) + geom_line() + xlab("Iteration") + ylab(expression(gamma[1])) p2 <- ggplot(samples, aes(x, after_stat(density))) + geom_histogram(bins = 15) + xlab(expression(gamma[1])) + ylab("Density") p1 + p2
theme_update(axis.title.y.left = element_text(margin = margin(r = 16.5)))
The MCMC algorithm assumes flat priors for $\betavec$ and $\gammavec$ and works as follows:
num_samples
times:The full conditional of $\betavec$ used in the Gibbs update step is given by $$p(\betavec \mid \cdot\,) \propto \exp(-1/2 \cdot (\yvec - \xmat\betavec)' \wmat (\yvec - \xmat\betavec)) \cdot p(\betavec) \cdot p(\gammavec),$$ where the weight matrix $\wmat$ is defined as in Equation \@ref(eq:wmat). The priors $p(\betavec)$ and $p(\gammavec)$ can be ignored, because we assume them to be flat. It can be shown with some tedious but simple linear algebra that \begin{align} &(\yvec - \xmat\betavec)' \wmat (\yvec - \xmat\betavec) \ =\ &(\betavec - \betawls)' \xmat'\wmat\xmat (\betavec - \betawls) + \yvec' (\wmat + \wmat\xmat (\xmat'\wmat\xmat)^{-1} \xmat'\wmat) \yvec, \end{align} where $\betawls$ is the WLS estimator for $\betavec$ using the weight matrix $\wmat$. As the second addend in the last equation is independent of $\betavec$, the full conditional reduces to $$p(\betavec \mid \cdot\,) \propto \exp(-1/2 \cdot (\betavec - \betawls)' \xmat'\wmat\xmat (\betavec - \betawls)),$$ which implies the following multivariate normal distribution: $$\betavec \mid \cdot\, \sim \N(\betawls, (\xmat'\wmat\xmat)^{-1}).$$
There are a number of R packages with similar capabilities as lmls. The popular mgcv package [@Wood2017], which is typically used to estimate generalized additive models (GAMs), added support for multi-predictor models including the LMLS with a normally distributed response variable in version 1.8. The gamlss package implements generalized additive models for location, scale and shape [GAMLSS, @Rigby2005] in a very general way, and the bamlss package [@Umlauf2018] is a Bayesian alternative to the gamlss package.
Compared to these packages, the scope of the lmls package is much narrower. Its functionality is limited to the LMLS with a normally distributed response variable. Other response distributions or the regularization of covariate effects are not supported. Instead, lmls aims to be...
Finally, we compare the performance of the lmls package on the abdom dataset with mgcv and gamlss using the microbenchmark package. The results of the benchmark are shown in Figure \@ref(fig:abdom-bench-2).
library(gamlss) library(mgcv) bench <- microbenchmark::microbenchmark( lmls = lmls(y ~ poly(x, 2), ~ x, data = abdom), mgcv = gam(list(y ~ poly(x, 2), ~ x), family = gaulss(), data = abdom), gamlss = gamlss(y ~ poly(x, 2), ~ x, data = abdom) )
(ref:abdom-bench-2) The lmls package is about 3.5 to 4 times faster than mgcv and gamlss on the abdom dataset.
load("abdom-bench.RData") ggplot(bench, aes(time / 1e6, expr, color = expr, fill = expr)) + geom_boxplot(alpha = 0.4) + geom_jitter(alpha = 0.4) + coord_cartesian(xlim = c(0, NA)) + scale_y_discrete(limits = rev(levels(bench$expr))) + xlab("Execution time [ms]") + ylab("Package")
The inference algorithms from Section \@ref(sec:inference) require the score (= the gradient of the log-likelihood) and the Fisher information (= the covariance of the score) with respect to $\betavec$ and $\gammavec$.
The score of the location coefficients $\betavec$ is $$ s(\betavec) = \frac{\partial\ell}{\partial\betavec} = \xmat'\qvec, $$ where $\qvec$ is an $n$-dimensional column vector with the elements $q_i = (y_i - \mu_i) / \sigma^2_i$. The corresponding Fisher information is given by $$ \I(\betavec) = \cov(s(\betavec)) = \cov(\xmat'\qvec) = \xmat' \cov(\qvec) \xmat, $$ where the diagonal elements of the covariance matrix $\cov(\qvec)$ are $$ \var(q_i) = \var\biggl(\frac{y_i - \mu_i}{\sigma^2_i}\biggr) = \frac{1}{\sigma^2_i} \cdot \var\underbrace{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)}_{\sim \N(0, 1)} = \frac{1}{\sigma^2_i}, $$ and the off-diagonal elements are $\cov(q_i, q_j) = 0$ for $i \ne j$, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of $\betavec$:
crossprod(X / scale)
Here, scale
is the vector $[\sigma_1, \dots, \sigma_n]$. This code works, because R stores matrices in column-major order and recycles shorter vectors for operations like element-wise division.
The score of the scale coefficients $\gammavec$ is $$ s(\gammavec) = \frac{\partial\ell}{\partial\gammavec} = \zmat'\rvec, $$ where $\rvec$ is an $n$-dimensional column vector with the elements $r_i = ((y_i - \mu_i) / \sigma_i)^2 - 1$. The corresponding Fisher information is given by $$ \I(\gammavec) = \cov(s(\gammavec)) = \cov(\zmat'\rvec) = \zmat' \cov(\rvec) \zmat, $$ where the diagonal elements of the covariance matrix $\cov(\rvec)$ are $$ \var(r_i) = \var\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2 - 1\biggr) = \var\underbrace{\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2\biggr)}_{\sim \chi^2(1)} = 2, $$ and the off-diagonal elements are $\cov(r_i, r_j) = 0$ for $i \ne j$, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of $\gammavec$:
2 * crossprod(Z)
The inference algorithms from Section \@ref(sec:inference) update the location coefficients $\betavec$ and the scale coefficients $\gammavec$ separately. Would a joint update maybe work better? Let us take a look at the Fisher information of the stacked regression coefficients $$ \I\biggl( \begin{bmatrix} \betavec \ \gammavec \end{bmatrix} \biggr) = \begin{bmatrix} \I(\betavec) & \cov(s(\betavec), s(\gammavec)) \ \cov(s(\gammavec), s(\betavec)) & \I(\gammavec) \end{bmatrix}. $$ We are still missing the off-diagonal elements $$ \cov(s(\betavec), s(\gammavec)) = \cov(s(\gammavec), s(\betavec))' = \cov(\xmat'\qvec, \zmat'\rvec) = \xmat' \cov(\qvec, \rvec) \zmat, $$ where the diagonal elements of the cross-covariance matrix $\cov(\qvec, \rvec)$ are \begin{align} \cov(q_i, r_i) &= \cov\biggl(\frac{y_i - \mu_i}{\sigma^2_i}, \biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2 - 1\biggr) \ &= \frac{1}{\sigma_i} \cdot \cov\biggl(\frac{y_i - \mu_i}{\sigma_i}, \biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2\biggr) \ &= 0. \end{align} The third equality holds because $(y_i - \mu_i) / \sigma_i$ is a standard normal random variable and hence uncorrelated with its square. The off-diagonal elements of $\cov(\qvec, \rvec)$ are $\cov(q_i, r_j) = 0$ for $i \ne j$, due to the independence assumption of the LMLS. It follows that $\cov(s(\betavec), s(\gammavec)) = \cov(s(\gammavec), s(\betavec)) = \zeromat$, which means there is no additional information we can make use of for a joint update of $\betavec$ and $\gammavec$ (at least not in terms of the Fisher information).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.