In rpsychologist/powerlmm: Power Analysis for Longitudinal Multilevel Models

This vignette details how the power calculations are implemented in powerlmm. We will focus on the fully nested three-level model, since the two- and partially nested three-level model are just reduced forms of the three-level model. Thus, in standard multilevel notation the fully nested three-level model is \begin{align} \text{Level 1}& \notag \ Y_{ijk} &= \beta_{0jk} + \beta_{1jk}t_{ijk} + R_{ijk} \ \text{Level 2}& \notag \ \beta_{0jk} &= \gamma_{00k} + U_{0jk} \ \beta_{1jk} &= \gamma_{10k} + U_{1jk} \ \text{Level 3}& \notag \ \gamma_{00k} &= \delta_{000} + \delta_{001} TX_k + V_{0k} \ \gamma_{10k} &= \delta_{100} + \delta_{101} TX_k + V_{1k} \ \end{align} where we have $i = 1, \ldots, n_{1j}$ equally spaced time points for subject $j = 1, \ldots, N_2$, where $N_2$ is the total number of subjects in the treatment arm. Furthermore, the subjects are nested within $k = 1, \ldots, n_3$ clusters, where $n_3$ is the total number of clusters in the treatment arm. To allow for varying cluster sizes we let each cluster have $j = 1, \ldots, n_{2[k]}$ subjects, where $n_{2[k]}$ is the total number of subjects in cluster $k$.

The parameter of interest is $\delta_{101}$, i.e. the mean difference in slopes between the two treatment groups. However, in powerlmm the calculations are simplified by calculating the variance of the slope-coefficient separately for each treatment group. Since the slopes in the treatment and control group are independent, the variance of the interaction-term is simply

$$\text{V}(\delta_{101}) = \text{V}(\delta_{100[tx]} - \delta_{100[c]}) = \text{V}(\delta_{100[tx]}) + \text{V}(\delta_{100[c]}),$$ where $\delta_{100[tx]}$ and $\delta_{100[c]}$ are the fixed time effects in the treatment and control group respectively. In order to calculate the variances we begin by formulating the three-level model for the complete data vector $\textbf{Y}$ from a single treatment arm,

\begin{equation} \label{eq-3lvl-def} \textbf{Y} = \textbf{X} \textbf{Z} \textbf{W} \bm{\beta} + \textbf{X}\textbf{u} + \textbf{X}\textbf{Z}\textbf{v} + \bm{\epsilon}, \end{equation}

where \textbf{Y} is the $N_1 \times 1$ outcome vector containing all the observations from all the subjects in the treatment arm, \textbf{X} is a $N_1 \times 2N_2$ matrix containing co-variate information for all $N_2$ subjects in the treatment arm, \textbf{X} is also used as the design matrix for the second-level random effects. $\textbf{Z}$ is a $2n_3 \times 2N_2$ matrix containing the level-three random effects design matrices for each $k$th cluster in the treatment arm. $\textbf{W}$ is a $2n_3 \times 2$ matrix relating the third-level to the overall effects $\beta$, and here $\beta$ is simply a $2 \times 1$ vector with the population values for the fixed intercept and slope effects. Lastly, $\textbf{u}$ is a $2N_2 \times 1$ vector with the level two random effects, $\textbf{v}$ is a $2n_3 \times 1$ vector with the third-level random effects, and $\bm{\epsilon}$ a $N_1 \times 1$ vector with the level one residuals.

The random effects and residuals are distributed as follows, \begin{align} \textbf{u} \sim& \mathcal{N}(\textbf{0}, \bm{\Psi}2), \ \textbf{v} \sim& \mathcal{N}(\textbf{0}, \bm{\Psi}_3), \ \bm{\epsilon} \sim& \mathcal{N}(\textbf{0}, \sigma^2 \textbf{I}{N_1}). \end{align}

With the second and third level variance components being

$$ \bm{\Psi}{2} = \textbf{I}{N_2} \otimes \begin{pmatrix} u_0^2 & u_{01} \ u_{01} & u_{1}^2 \end{pmatrix}, \bm{\Psi}{3} = \textbf{I}{n_3} \otimes \begin{pmatrix} v_0^2 & v_{01} \ v_{01} & v_{1}^2 \end{pmatrix}, $$

with $\otimes$ denoting the Kronecker product. The co-variate matrix $\textbf{X}$ is block-diagonal containing a sub-matrix $\textbf{X}{jk}$ for each subject (level-two unit), thus $$ \textbf{X} = \begin{pmatrix} \textbf{X}_1 & 0 & \cdots & 0 \ 0 & \textbf{X}_2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \textbf{X}{N_2} \end{pmatrix}. $$ Since each subject can have a different number of observations due to dropout, each $\textbf{X}{jk}$ will have dimension $n{1[j]} \times 2$, where $n_{1[j]}$ is the total number of observations for subject $j$ in cluster $k$,

\begin{equation} \label{eq-ind-X} \textbf{X}{jk} = \begin{pmatrix} 1 & T_0 \ 1 & T_1 \ \vdots & \vdots \ 1 & T{n1[j]} \end{pmatrix}. \end{equation}

\textbf{Z} is a block-diagonal matrix containing the level-three design matrices for each cluster $k$, $$ \textbf{Z} = \begin{pmatrix} \textbf{Z}1 & 0 & \cdots & 0 \ 0 & \textbf{Z}_2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \textbf{Z}{n3} \end{pmatrix}. $$

With the sub-matrices $\textbf{Z}_k$ being stacks of $2 \times 2$ matrices for each subject in cluster $k$,

$$ \textbf{Z}k = \textbf{1}{n2[k]} \otimes \begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}, $$ thus the dimension of $\textbf{Z}k$ will be $n{2[k]} \times 2$, where $n_{2[k]}$ is the number of subjects in cluster $k$. This enables power calculations for designs with varying number of subjects per cluster.

The matrix $\textbf{W}$, relates the cluster-level effects to the overall effects $\bm\beta$, $$ \textbf{W} = \textbf{1}_{n3} \otimes \begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}, $$

and thus $\textbf{X} \textbf{Z} \textbf{W}$ simply stacks all the $N_2$ sub-matrices, $\textbf{X}_{jk}$, into a $N_1 \times 2$ matrix.

Then we can calculate the marginal variance-covariance matrix for $\textbf{Y}$ as $$ \text{V}(\textbf{Y}) = \textbf{X} \bm{\Psi}2 \textbf{X}^\top + \textbf{X} \textbf{Z} \bm{\Psi}_3 \textbf{Z}^\top \textbf{X}^\top + \bm{\epsilon}^2\textbf{I}{N1}, $$

and the variance of the population parameters in $\bm{\beta}$ as

\begin{equation} \label{eq-var-beta} \text{V}(\bm{\beta}) = [(\textbf{X}\textbf{Z}\textbf{W})^\top \text{V}(\textbf{Y}) ^{-1}\textbf{X}\textbf{Z}\textbf{W}]^{-1}. \end{equation}

The lower right corner of $\text{V}(\bm{\beta})$ corresponds to the variance of the time-coefficient. As we noted earlier we can use the slope variances to calculate the variance of the time$\times$treatment-interaction.

Accounting for dropout

Dropout is accounted for by defining a dropout vector $\textbf{D} = (p_1, \ldots, p_{n_1})^{\top}$, where $p_i$ is the proportion of participants that have dropped out at time point $i$, for the $i, \ldots, n_1$ scheduled time points, and $p_0 = 0$ and $p_i \leq p_{i+1}$. The default in powerlmm is to treat the values in $\textbf{D}$ as known, i.e. exactly $p_i$ subjects will have dropped out at time $i$. This is done by randomly sampling which $p_i N_2$ participants should drop out a time $i$, then adjusting their design matrices $\textbf{X}_{ij}$ to be of size $(i-1) \times 2$, thus their last time point will be $i-1$. Since, it is random which subjects will dropout, the power calculations will differ slightly each time. It is also possible to treat $\textbf{D}$ as random (using the option deterministic_dropout = FALSE), then dropout will be sampled from a multinomial distribution, by converting the elements of $\textbf{D}$ to the probability $p_i$ that a subject will have exactly $i$ measurements. This approach is similar to @Galbraith2002, and @Verbeke1999 who presents power calculations for two-level models with missing data.

Speeding up the computation of $V(Y)^{-1}$

Doing the matrix inversion of $V(\textbf{Y})$, which is of dimension $N_1 \times N_1$, can be extremely slow for some designs. @DeLeeuw1986hd (where they credit @swamy1971statistical) noted a more computationally efficient formulation, adopting it to the three-level formulation in \autoref{eq-3lvl-def}, lets us write

$$ V(\textbf{Y})^{-1} = \sigma^{-2}[\textbf{I}{N_1} - \textbf{X}(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}]+\textbf{X}(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{A}^{-1} (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}, $$ where, $$ \textbf{A}^{-1}=[\sigma^2(\textbf{X}^{\top}\textbf{X})^{-1}+ \bm{\Psi}_2 + \textbf{Z}\bm{\Psi}_3\textbf{Z}^{\top}].$$ Here \textbf{A} of size $2N_2 \times 2N_2$. However, since \textbf{A} is block-diagonal, with each block for cluster $k$ being of size $n{2[k]}$, the computation done in powerlmm, takes advantage of the sparse matrix functions from the Matrix-package. By using sparse matrix algebra the speed of computing $\textbf{A}^{-1}$ will depend greatly on the number of subjects per cluster. In most cases this solution is dramatically faster then directly solving $V(\textbf{Y})^{-1}$. For instance, calculating $\text{V}(\beta)$ for a study with $n_1=10$, $n_2=30$, $n_3=20$ is approximately 50 times faster using this method.

Changes in powerlmm 0.2

As of version 0.2, $\text{V}(\bm{\beta})$ is now computed using the sparse Cholesky factorization used in lme4, and the implementation specifically borrows from lme4pureR. Thus,

$$ \text{V}(\bm{\beta}) = \sigma^2\textbf{R}_X^{-1}(\textbf{R}^{\top}_X)^{-1} $$ where $\textbf{R}_X$ is the Cholesky factor of the fixed effects, see Eq. 54 in @bates2014fitting.

Power

To make the power calculations accurate for small samples sizes, power is calculated using the t distribution. Thus, we can define the power function as, $$ 1 - \beta = P(t_{\nu,\lambda} > t_{\nu,1-\alpha/2}) + P(t_{\nu,\lambda} < t_{\nu,\alpha/2}),$$ where $\lambda$ is the non-centrality parameter, $$\lambda = \delta_{101} / \sqrt{\text{V}(\delta_{101})},$$ and $\nu$ is the appropriate degrees of freedom of the t distribution. For the balanced fully nested three-level model, the degrees of freedom are $N3 - 2$, where $N3$ is the total number of clusters in both treatment arms.

Satterthwaite's degrees of freedom approximation

For small samples, the choice of degrees of freedom will potentially influence the accuracy of the power analysis a lot. In powerlmm it is therefore possible to use Satterthwaite's DF approximation in the power analysis. The degrees of freedom of the t distribution is approximated as, $$ \nu = \frac{2(\textbf{L}^{\top} \text{V}(\bm{\beta}) \textbf{L})^2}{ V(\textbf{L}^{\top}\text{V}(\bm{\beta})\textbf{L})},$$ and \textbf{L} specifies the linear contrast we are testing. Moreover, $V(\textbf{L}^{\top} \text{V}(\bm{\beta}) \textbf{L})$ is approximated using the delta method

$$ V( \textbf{L}^{\top} \text{V}(\bm{\beta}) \textbf{L}) \cong [\Delta_{f( \bm{ \theta}) }( \bm{\theta})]^{\top} \text{V}(\bm{\theta}) [\Delta_{f(\bm{\theta})}(\bm{\theta})], $$

$\Delta_{f(\bm{\theta})}(\bm{\theta})$ is the gradient of $\textbf{L} V(\bm{\beta}) \textbf{L}$ with respect to $\bm{\theta}$, where $\bm{\theta}$ is the vector of variance components, and thus $\text{V}(\bm{\theta})$ is the asymptotic covariance matrix of the random effects (including $\sigma^2$), which is approximated as $\text{V}(\bm{\theta}) = 2 \bm{\mathcal{I}}_E^{-1}$, where $\bm{\mathcal{I}}_E$ is the expected information matrix for the variance parameters. The calculation of $\bm{\mathcal{I}}_E$ is described in Equation 25 in @halekoh2014kenward. However, this implementation involves manipulating $V(\textbf{Y})$, i.e. the full variance-covariance matrix including all $N$ observations. For large sample sizes this will be very computationally intensive, and the computation time will depend mostly on $n_1$ and $n_2$. For instance, for a fully nested model with $n_1 = 10$, $n_2 = 100$, $n_3 = 4$, computations will likely take 30-60 seconds, and be very RAM intensive.

Partially nested designs

For the partially nested designs $\text{V}(\delta_{100[tx]})$ is calculated as above, and $\text{V}(\delta_{100[c]})$ by setting the cluster-level random effects to zero. Degrees of freedom for this model is trickier, and I recommend always using Satterthwaite DFs whenever possible. If balanced DFs are requested, then currently $n_3 - 1$ i used, where $n_3$ is the number of clusters in the treatment group only.

Two-level designs

For the two-level designs, $\text{V}(\delta_{101})$ can be calculated using the three-level formulas with the cluster-level random effects set to zero. Deleting these terms reduces the model to the classical two-level formulation. Degrees of freedom for the balanced model is $N_2 - 2$, where $N_2$ is the total number of subjects in both treatment arms.

Standardized formulation

If there's no missing data and the clusters sizes are balanced, the variance of the slope can be calculated more simply as $$ \text{V}(\delta_{100})= \frac{\sigma^2 + n_1 \sigma_{u_1}^2 V(\textbf{t}) + n_1 n_2 \sigma_{v_1}^2 V(\textbf{t})}{n_1 n_2 n_3 V(\textbf{t})}, $$ with,

$$ \text{V(\textbf{t})} = \Sigma_{i=1}^{n_1}(t_i - \bar{t})^2.$$

By defining the amount of slope variance at the cluster-level as $\rho_s = \sigma_{v_1}^2 / (\sigma_{v_1}^2 + \sigma_{u_1}^2)$, and ICC_pre_subjects = $\rho_1 = (\sigma_{u_0}^2 + \sigma_{v_0}^2) / (\sigma_{v_0}^2 + \sigma_{u_0}^2 + \sigma_{e}^2)$, and the variance ratio as $r_{\tau} = ({\sigma_{v_1}^2 + \sigma_{u_1}^2}) / \sigma_{e}^2$ we can then rewrite the formula using the relative parameters $\rho_1$, $\rho_s$ and $r_{\tau}$,

$$ \text{V}(\delta_1^*) = \frac{(1-\rho_1)+n_1 \text{Var}(\textbf{t})(1-\rho_1) [n_2 \rho_s r + (1-\rho_s)r]}{n_1 n_2 n_3 \text{Var}(\textbf{t}) }, $$ which will yield the same non-centrality parameters as long as the interaction-coefficient corresponds to the same standardized value, e.g. Cohen's d. Thus, we see that power depends on $n1$, $n2$, $n3$, the duration of the study, the proportion of intercept variance at baseline, the amount of slope variance at the third level, and the variance ratio.