slim: Singular Linear Models
In slim: Singular Linear Models for Longitudinal Data

Introduction

The slim package fits singular linear models to longitudinal data. In particular, for given sets ${x}$ ($m \times p$ design matrices, where $m$ is the number of observations for a particular subject), ${V_{0}}$ ($m \times m$ possibly singular covariance matrices), ${V_{}}$ ($m \times m$ positive-definite covariance matrices), and ${y}$ (length $m$ vectors containing the longitudinal data for each subject), slim computes $\hat{\beta} = \lim_{\omega \to 0} \hat{\beta}(\omega)$ where $\hat{\beta}(\omega)$ satisfies $$ \sum x^{\top} V(\omega)^{-1} {y - x \hat{\beta}(\omega)} = 0 $$ and $V(\omega) = V_{} \omega + V_{0} (1 - \omega)$.

For a few more details you can consult @farewell2010. But for now let's load up the package and see how it works.

library(slim)

You'll see that data.table is a required package. That's because the main slim modelling function expects to receive the data in the form of a keyed data.table, not just an ordinary data.frame. The dialysis data, included with the package, has two keys: id (the individual or subject identifier) and month (the time variable).

data(dialysis)
str(dialysis)

The slim function makes use of these two keys in a couple different ways. First, by having these rather fundamental aspects of longitudinal data specified within the data, we can avoid having to ask for them each time we want to fit a model. Second, it ensures that the data are sorted in a consistent way (by individual, then time), so that associated quantities derived from calls to other modelling packages can be reliably re-aligned with the right subject and time.

Let's look at the data.

library(ggplot2)

print(p <- ggplot(dialysis, aes(x = month, y = renalfn, group = id)) +
        geom_line(alpha = 0.1) + facet_grid(~ group))

Most individuals have a generally downward trajectory, towards poorer renal function. However, if we overlay the average renal function at each measurement occasion, we see the opposite pattern:

naive_means <- dialysis[, list(id = group, renalfn = mean(renalfn)),
        by = list(group, month)]

print(p <- p + geom_point(data = naive_means))

A strong possibility, then, is that the individuals with poorest renal function are dropping out of the study earlier. This is the kind of effect slim models are designed to adjust for.

slim_basic_fit <- slim(renalfn ~ group + month, dialysis)

summary(slim_basic_fit)

Notice that the sign of the estimated time trend (the coefficient associated with month) now matches the predominant individual pattern. Let's overlay this simple fit on the data:

# identify one fully observed individual in each group
dialysis[, fully_observed := (length(month) == 5), by = id]
group_reps <- dialysis[(fully_observed), list(id = min(id)), by = group]  
setkey(group_reps, id) 

dialysis[, slim_basic := fitted(slim_basic_fit)]

print(p <- p + geom_line(aes(y = slim_basic), data = dialysis[group_reps]))

The slim function uses two "working" covariance matrices for each individual: the initial, positive definite, matrix $V_{*}$ and the limiting, possibly singular, matrix $V_{0}$. The default value of $V_{0}$ is the singular matrix of ones, but we can change this by passing a formula to the limit argument:

slim_intercept_slope_fit <- slim(renalfn ~ group + month, dialysis,
        limit = ~ 1 + month) 

dialysis[, slim_intercept_slope := fitted(slim_intercept_slope_fit)]

print(p <- p + geom_line(aes(y = slim_intercept_slope),
                data = dialysis[group_reps], linetype = "dashed"))

For each individual, the matrix $V_{0}$ is constructed as the product of the columns of the model matrix generated by the limit formula with the transpose of the same model matrix. For example, for an individual with observations at months 0, 3, and 18,

small_example <- data.table(month = c(0, 3, 18))
z <- model.matrix(~ 1, small_example)
z %*% t(z)
z <- model.matrix(~ 1 + month, small_example)
z %*% t(z)

There are several ways to specify $V_{}$. The default (covariance = "randomwalk") is to use the covariance matrix of a random walk with unit-variance increments, that is $\mathrm{cov}(y_{j}, y_{k}) = \min(j, k)$ , but but other options include the identity matrix (covariance = "identity"), the covariance of Brownian motion ($\mathrm{cov}(y_{j}, y_{k}) = \min(t_{j}, t_{k})$, covariance = "brownian") or the symmetric Pascal matrix given by $$\mathrm{cov}(y_{j}, y_{k}) = \left(\begin{array}{c}j + k - 2 \ j - 1\end{array}\right)$$and specified by covariance = "pascal". All these matrices are employed for their structure* and make no reference to the actual variance of the data. When using such working covariance structures, therefore, it is important to use empirical standard errors to obtain reasonable estimates of parameter uncertainty:

vcov(slim_basic_fit)
vcov(slim_basic_fit, empirical = FALSE) # a bad idea for unmodelled covariances!
summary(slim_basic_fit)
summary(slim_basic_fit, empirical = FALSE) # still a bad idea

Use of empirical = FALSE is intended for cases when the covariance $V_{*}$ has itself been modelled from the same data. This can be accomplished by passing another model fit to the covariance argument. Methods exist to handle fits from lmer (package lme4, class lmerMod) and jmcm (package jmcm, class jmcmMod). For example, using a mixed effects model fit by lmer with a random intercept and slope, we can pass the estimated covariance matrices to slim via the covariance argument:

library(lme4)

lmer_fit <- lmer(renalfn ~ group + month + (1 + month | id), dialysis)
slim_lmer_fit <- slim(renalfn ~ group + month, dialysis, covariance = lmer_fit)

summary(slim_lmer_fit)
summary(slim_lmer_fit, empirical = FALSE) # this now makes sense

The jmcm function uses a more comprehensive formula object to specify the response, subject identifier, time variable and mean and covariance structures. These models provide even more flexible ways to estimate covariances:

library(jmcm)

jmcm_fit <- jmcm(renalfn | id | month ~ group | 1, data = dialysis,
        triple = rep(2L, 3), cov.method = "mcd")
slim_jmcm_fit <- slim(renalfn ~ group + month, dialysis, covariance = jmcm_fit)

summary(slim_jmcm_fit)
summary(slim_jmcm_fit, empirical = FALSE)

We can add these fits to our plots:

dialysis[, slim_lmer := fitted(slim_lmer_fit)]
dialysis[, slim_jmcm := fitted(slim_jmcm_fit)]

print(p <- p + geom_line(aes(y = slim_lmer), data = dialysis[group_reps],
                colour = "darkblue") +
        geom_line(aes(y = slim_jmcm), data = dialysis[group_reps],
                colour = "darkred"))

There is reasonable consistency across all these slim model fits. It is possible to demonstrate sensitivity to choice of covariance using more extreme structures (the Pascal matrix is very extreme):

slim_pascal_fit <- slim(renalfn ~ group + month, dialysis,
        covariance = "pascal")

dialysis[, slim_pascal := fitted(slim_pascal_fit)]

print(p <- p + geom_line(aes(y = slim_pascal), data = dialysis[group_reps],
                colour = "cyan"))

Notice the greatly increased standard error associated with the time trend in this fit:

extract_se <- function(fit) sqrt(diag(vcov(fit)))
sapply(list(se_basic = slim_basic_fit, se_pascal = slim_pascal_fit), extract_se)

This sensitivity is indicative of the bias-variance trade-off associated with choice of covariance structure. Roughly, the Pascal structure allows the timings of observations to depend on a random intercept, random slope, random curvature and so on to higher derivatives as far as the number of observations on an individual allow. This consistency across a wider class of models comes at the price of greatly increased variance, although observe that the 'baseline' parameters of the three groups are basically unaffected, relating as they do to the time when all individuals remain in the study.

We end on this cautionary note: it does matter what covariance matrices you choose! Our (limited) experience suggests that random walk or well-modelled covariance structures lead to "sensible" answers, but we would greatly value hearing your insights or experience, too.