library(galamm)
The purpose of this vignette is to describe the optimization procedure used by galamm
, and what kind of tools one can use in the case of convergence issues.
The optimization procedure used by galamm
is described in Section 3 of @sorensenLongitudinalModelingAgeDependent2023. It consists of two steps:
In the inner loop, the marginal likelihood is evaluated at a given set of parameters. The marginal likelihood is what you obtain by integrating out the random effects, and this integration is done with the Laplace approximation. The Laplace approximation yields a large system of equations that needs to be solved iteratively, except in the case with conditionally Gaussian responses and unit link function, for which a single step is sufficient to solve the system. When written in matrix-vector form, this system of equations will in most cases have an overwhelming majority of zeros, and to avoid wasting memory and time on storing and multiplying zero, we use sparse matrix methods.
In the outer loop, we try to find the parameters that maximize the marginal likelihood. For each new set of parameters, the whole procedure in the inner loop has to be repeated. By default, we use the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints [@byrdLimitedMemoryAlgorithm1995], abbreviated L-BFGS-B. In particular, we use the implementation in R's optim()
function, which is obtained by setting method = "L-BFGS-B"
. L-BFGS-B requires first derivatives, and these are obtained by automatic differentiation [@skaugAutomaticDifferentiationFacilitate2002]. In most use cases of galamm
, we also use constraints on some of the parameters, e.g., to ensure that variances are non-negative. As an alternative, the Nelder-Mead algorithm with box constraints [@batesFittingLinearMixedEffects2015;@nelderSimplexMethodFunction1965] from lme4
is also available. Since the Nelder-Mead algorithm is derivative free, automatic differentiation is not used in this case, except for computing the Hessian matrix at the final step.
At convergence, the Hessian matrix of second derivatives is computed exactly, again using automatic differentiation. The inverse of this matrix is the covariance matrix of the parameter estimates, and is used to compute Wald type confidence intervals.
We will illustrate some ways of modifying the optimization procedure with the covariate measurement model example shown in the vignette on models with mixed response types. Here we start by simply setting up what we need to fit the model.
loading_matrix <- matrix(c(1, 1, NA), ncol = 1) families <- c(gaussian, binomial) family_mapping <- ifelse(diet$item == "chd", 2, 1) formula <- y ~ 0 + chd + (age * bus):chd + fiber + (age * bus):fiber + fiber2 + (0 + loading | id)
Fitting the model with default arguments yields a warning when we look at the summary object.
mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix ) summary(mod) #> Warning in vcov.galamm(object, parm = "lambda"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> Warning in vcov.galamm(object, "beta"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> #> AIC BIC logLik deviance df.resid #> 2837.6 2892.9 -1406.8 12529.3 730 #> #> Lambda: #> loading SE #> lambda1 1.000 . #> lambda2 1.000 . #> lambda3 -2.026 . #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 0 0 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.78692 NA NA NA #> fiber 17.96184 NA NA NA #> fiber2 -0.64927 NA NA NA #> chd:age 0.06682 NA NA NA #> chd:bus -0.06882 NA NA NA #> fiber:age -0.20480 NA NA NA #> fiber:bus -1.69601 NA NA NA #> chd:age:bus -0.04934 NA NA NA #> fiber:age:bus 0.16097 NA NA NA
In this case, we can increase the amount of information provided by optim
, with the trace
argument. To avoid getting too much output, we also reduce the number of iterations. We set the control
argument as follows:
control <- galamm_control(optim_control = list(maxit = 5, trace = 3, REPORT = 1))
Here, maxit = 5
means that we take at most 5 iterations, trace = 3
means that we want more information from L-BFGS-B, and REPORT= = 1
means that we want L-BFGS-B to report information at each step it takes. We provide this object to the control
argument in galamm
, and rerun the model:
mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, control = control ) #> N = 11, M = 20 machine precision = 2.22045e-16 #> At X0, 0 variables are exactly at the bounds #> At iterate 0 f= 2148 |proj g|= 122.68 #> At iterate 1 f = 2132.1 |proj g|= 275.51 #> At iterate 2 f = 2100.1 |proj g|= 193.7 #> At iterate 3 f = 1975.6 |proj g|= 177.11 #> At iterate 4 f = 1923.2 |proj g|= 165.7 #> At iterate 5 f = 1898.8 |proj g|= 83.839 #> At iterate 6 f = 1887.9 |proj g|= 49.147 #> final value 1887.871646 #> stopped after 6 iterations vcov(mod) #> Warning in vcov.galamm(mod): Rank deficient Hessian matrix.Could not compute covariance matrix. #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] #> [1,] NA NA NA NA NA NA NA NA NA #> [2,] NA NA NA NA NA NA NA NA NA #> [3,] NA NA NA NA NA NA NA NA NA #> [4,] NA NA NA NA NA NA NA NA NA #> [5,] NA NA NA NA NA NA NA NA NA #> [6,] NA NA NA NA NA NA NA NA NA #> [7,] NA NA NA NA NA NA NA NA NA #> [8,] NA NA NA NA NA NA NA NA NA #> [9,] NA NA NA NA NA NA NA NA NA
Since what we did was simply to turn in more reporting, it is no surprise that the Hessian is still rank deficient, but from the output, it is also clear that there are no obvious errors, like values that diverge to infinity. The latter may also happen from time to time.
By default, L-BFGS-B uses the last 5 evaluations of the gradient to approximate the Hessian that is used during optimization (not to be confused with the exact Hessian compute with automatic differentiation after convergence). We try to increase this to 25, and see if that makes a difference. This is done with the lmm
argument. We also reduce the amount of reporting to be every 10th step, and avoid setting the maximum number of iterations, which means that optim()
's default option is used.
control <- galamm_control(optim_control = list(trace = 3, REPORT = 10, lmm = 25))
It is clear that neither this solved the issue.
mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, control = control ) #> N = 11, M = 25 machine precision = 2.22045e-16 #> At X0, 0 variables are exactly at the bounds #> At iterate 0 f= 2148 |proj g|= 122.68 #> At iterate 10 f = 1770.3 |proj g|= 30.656 #> At iterate 20 f = 1467.2 |proj g|= 11.286 #> At iterate 30 f = 1413 |proj g|= 4.3102 #> #> iterations 38 #> function evaluations 44 #> segments explored during Cauchy searches 39 #> BFGS updates skipped 0 #> active bounds at final generalized Cauchy point 1 #> norm of the final projected gradient 0.001689 #> final function value 1406.8 #> #> F = 1406.8 #> final value 1406.801104 #> converged vcov(mod) #> Warning in vcov.galamm(mod): Rank deficient Hessian matrix.Could not compute covariance matrix. #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] #> [1,] NA NA NA NA NA NA NA NA NA #> [2,] NA NA NA NA NA NA NA NA NA #> [3,] NA NA NA NA NA NA NA NA NA #> [4,] NA NA NA NA NA NA NA NA NA #> [5,] NA NA NA NA NA NA NA NA NA #> [6,] NA NA NA NA NA NA NA NA NA #> [7,] NA NA NA NA NA NA NA NA NA #> [8,] NA NA NA NA NA NA NA NA NA #> [9,] NA NA NA NA NA NA NA NA NA
Looking at the model output again, we see that the random effect variance is estimated to be exactly zero.
summary(mod) #> Warning in vcov.galamm(object, parm = "lambda"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> Warning in vcov.galamm(object, "beta"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> Control: control #> #> AIC BIC logLik deviance df.resid #> 2837.6 2892.9 -1406.8 12529.3 730 #> #> Lambda: #> loading SE #> lambda1 1.000 . #> lambda2 1.000 . #> lambda3 -1.922 . #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 0 0 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.78673 NA NA NA #> fiber 17.96179 NA NA NA #> fiber2 -0.64904 NA NA NA #> chd:age 0.06685 NA NA NA #> chd:bus -0.06894 NA NA NA #> fiber:age -0.20479 NA NA NA #> fiber:bus -1.69628 NA NA NA #> chd:age:bus -0.04937 NA NA NA #> fiber:age:bus 0.16092 NA NA NA
These types of obviously wrong zero variance estimates are well-known for users of mixed models [@hodgesRichlyParameterizedLinear2013]. We see if increasing the initial value for the variance parameter solves the issue. This is done with the start
argument to galamm
. The start argument requires a named list, with optional arguments theta
, beta
, lambda
, and weights
, giving initial values for each of these groups of parameters. In this case theta
is the standard deviation of the random effect, and we increase it to 10 to see what happens. By default, the initial value equals 1.
mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, start = list(theta = 10), control = control ) #> N = 11, M = 25 machine precision = 2.22045e-16 #> At X0, 0 variables are exactly at the bounds #> At iterate 0 f= 2827.6 |proj g|= 123.31 #> At iterate 10 f = 1764.5 |proj g|= 103.86 #> At iterate 20 f = 1623.6 |proj g|= 131.81 #> At iterate 30 f = 1447.9 |proj g|= 75.096 #> At iterate 40 f = 1400.5 |proj g|= 35.591 #> At iterate 50 f = 1373.1 |proj g|= 3.3359 #> At iterate 60 f = 1372.2 |proj g|= 0.0016541 #> #> iterations 60 #> function evaluations 72 #> segments explored during Cauchy searches 61 #> BFGS updates skipped 0 #> active bounds at final generalized Cauchy point 0 #> norm of the final projected gradient 0.00165415 #> final function value 1372.16 #> #> F = 1372.16 #> final value 1372.160386 #> converged
Now we see that the model converged and that the Hessian is no longer rank deficient.
summary(mod) #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> Control: control #> #> AIC BIC logLik deviance df.resid #> 2768.3 2823.6 -1372.2 2002.9 730 #> #> Lambda: #> loading SE #> lambda1 1.0000 . #> lambda2 1.0000 . #> lambda3 -0.1339 0.05121 #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 23.64 4.862 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.91525 0.27229 -7.03373 2.011e-12 #> fiber 17.94849 0.48686 36.86601 1.620e-297 #> fiber2 0.22402 0.41783 0.53614 5.919e-01 #> chd:age 0.06615 0.05931 1.11531 2.647e-01 #> chd:bus -0.02895 0.34355 -0.08427 9.328e-01 #> fiber:age -0.21204 0.10090 -2.10135 3.561e-02 #> fiber:bus -1.68303 0.63721 -2.64123 8.261e-03 #> chd:age:bus -0.04999 0.06507 -0.76822 4.424e-01 #> fiber:age:bus 0.16818 0.11223 1.49854 1.340e-01
The Nelder-Mead algorithm is turned on by setting method = "Nelder-Mead"
when calling galamm_control()
. We also turn on reporting every 20th function evaluation by setting verbose = 1
:
control <- galamm_control( optim_control = list(verbose = 1), method = "Nelder-Mead" )
We provide the estimates obtained with the L-BFGS-B algorithm as initial values. For this we can use the convenience function extract_optim_parameters
:
start <- extract_optim_parameters(mod)
We now fit the model, providing the initial values to the start
argument.
mod_nm <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, control = control, start = start ) #> (NM) 20: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 40: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 60: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 80: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 100: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 120: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 140: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 160: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 180: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 200: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 220: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 240: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 260: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 280: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 300: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 320: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 340: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 360: f = 1372.16 at 1.84247 -1.91525 17.9485 0.223968 0.0661412 -0.028921 -0.21203 -1.68308 -0.0499804 0.168172 -0.133908 #> (NM) 380: f = 1372.16 at 1.84247 -1.91525 17.9485 0.22398 0.0661421 -0.0289289 -0.212034 -1.68304 -0.0499816 0.168174 -0.133909 #> (NM) 400: f = 1372.16 at 1.84247 -1.91525 17.9485 0.223986 0.0661415 -0.0289274 -0.212031 -1.68305 -0.0499811 0.168171 -0.133909 #> (NM) 420: f = 1372.16 at 1.84247 -1.91525 17.9485 0.223985 0.0661434 -0.0289358 -0.212032 -1.68304 -0.0499824 0.168172 -0.133908
The summary output shows that Nelder-Mead found exactly the same optimum in this particular case, which is not surprising given the intial values that we provided.
summary(mod_nm) #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> Control: control #> #> AIC BIC logLik deviance df.resid #> 2768.3 2823.6 -1372.2 2002.9 730 #> #> Lambda: #> loading SE #> lambda1 1.0000 . #> lambda2 1.0000 . #> lambda3 -0.1339 0.05121 #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 23.64 4.862 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.91525 0.27229 -7.0337 2.011e-12 #> fiber 17.94851 0.48686 36.8660 1.618e-297 #> fiber2 0.22398 0.41783 0.5360 5.919e-01 #> chd:age 0.06614 0.05931 1.1153 2.647e-01 #> chd:bus -0.02893 0.34355 -0.0842 9.329e-01 #> fiber:age -0.21203 0.10090 -2.1013 3.561e-02 #> fiber:bus -1.68305 0.63721 -2.6413 8.260e-03 #> chd:age:bus -0.04998 0.06507 -0.7682 4.424e-01 #> fiber:age:bus 0.16817 0.11223 1.4985 1.340e-01
At a given set of parameters, the marginal likelihood is evaluated completely in C++. For solving the penalized iteratively reweighted least squares problem arising due to the Laplace approximation, we use sparse matrix methods from the Eigen C++ template library through the RcppEigen
package [@batesFastElegantNumerical2013]. In order to keep track of the derivatives throughout this iterative process, we use the autodiff library [@lealAutodiffModernFast2018]. However, since autodiff
natively only supports dense matrix operations with Eigen
, we have extended this library so that it also supports sparse matrix operations. This modified version of the autodiff
library can be found at inst/include/autodiff/
.
In order to maximize the marginal likelihood, we currently rely on the optim()
function in R. To make use of the fact that both the marginal likelihood value itself and first derivatives are returned from the C++ function, we use memoisation, provided by the memoise
package [@wickhamMemoiseMemoisationFunctions2021]. However, the optimization process still involves copying all model data between R and C++ for each new set of parameters. This is potentially an efficiency bottleneck with large datasets, although with the limited profiling that has been done so far, it seems like the vast majority of the computation time is spent actually solving the penalized iteratively reweighted least squares problem in C++.
We aim to perform also the outer optimization loop in C++, to avoid copying data back and forth between R and C++ during optimization. This requires finding an off-the-shelf optimization routine which is as good as the L-BFGS-B implementation provided by optim()
, and which plays well with autodiff
.
In addition, the current implementation uses only forward mode automatic differentiation. In the future, we aim to add backward mode as an option, as this might turn out to be more efficient for problems with a large number of variables.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.