In gllvm: Generalized Linear Latent Variable Models

knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  warning = FALSE,
  fig.width=5, fig.height=5,
  fig.align = "center",
  dev = "png",
  fig.pos = 'H'
  )

Introduction to gllvm

R package gllvm

R package gllvm fits Generalized linear latent variable models (GLLVM) for multivariate data^[Niku, J., F.K.C. Hui, S. Taskinen, and D.I. Warton. 2019. Gllvm - Fast Analysis of Multivariate Abundance Data with Generalized Linear Latent Variable Models in R. 10. Methods in Ecology and Evolution: 2173–82].
Developed by J. Niku, W.Brooks, R. Herliansyah, F.K.C. Hui, S. Taskinen, D.I. Warton, B. van der Veen.
The package available in
GitHub: https://github.com/JenniNiku/gllvm
CRAN: https://CRAN.R-project.org/package=gllvm
Package installation:

# From CRAN
install.packages(gllvm)
# OR
# From GitHub using devtools package's function install_github
devtools::install_github("JenniNiku/gllvm")

Problems?

gllvm package depends on R packages TMB and mvabund, try to install these first.

GLLVMs are computationally intensive to fit due the integral in log-likelihood.
gllvm package overcomes computational problems by applying closed form approximations to log-likelihood and using automatic differentiation in C++ to accelerate computation times (TMB^[Kasper Kristensen, Anders Nielsen, Casper W. Berg, Hans Skaug, Bradley M. Bell (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21]).
Estimation is performed using either variational approximation (VA^[Hui, F. K. C., Warton, D., Ormerod, J., Haapaniemi, V., and Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. Journal of Computational and Graphical Statistics. Journal of Computational and Graphical Statistics, 26:35-43]), extended variational approximation method (EVA^[Korhonen, P., Hui, F. K. C., Niku, J., and Taskinen, S. (2021). Fast, universal estimation of latent variable models using extended variational approximations, arXiv:2107.02627 .]) or Laplace approximation (LA^[Niku, J., Warton, D. I., Hui, F. K. C., and Taskinen, S. (2017). Generalized linear latent variable models for multivariate count and biomass data in ecology. Journal of Agricultural, Biological, and Environmental Statistics, 22:498-522.]) method implemented via R package TMB.
VA method is faster and more accurate than LA, but not applicable for all distributions and link functions.
Using gllvm we can fit
GLLVM without covariates gives model-based ordination and biplots
GLLVM with environmental covariates for studying factors explaining species abundance
Fourth corner models with latent variables for studying environmental-trait interactions
GLLVM without latent variables fits basic multivariate GLMs
Additional tools: model checking, model selection, inference, visualization.

Distributions

| Response | Distribution | Method | Link | | ----------- |:------------:|:------- |:------- | |Counts | Poisson | VA/LA |log | | | NB | VA/LA |log | | | ZIP | LA |log | |Binary | Bernoulli | VA/LA |probit | | | | EVA/LA |logit | |Ordinal | Ordinal | VA |probit | |Normal | Gaussian | VA/LA |identity| |Positive continuous| Gamma | VA/LA |log| |Non-negative continuous| Exponential | VA/LA |log| |Biomass | Tweedie | LA/EVA |log | |Percent cover| beta | LA/EVA |probit/logit |

Data input

Main function of the gllvm package is gllvm(), which can be used to fit GLLVMs for multivariate data with the most important arguments listed in the following:

gllvm(y = NULL, X = NULL, TR = NULL, family, num.lv = 2, 
 formula = NULL, method = "VA", row.eff = FALSE, n.init=1, starting.val ="res", ...)

y: matrix of abundances
X: matrix or data.frame of environmental variables
TR: matrix or data.frame of trait variables
family: distribution for responses
num.lv: number of latent variables
method: approximation used "VA" or "LA"
row.eff: type of community level row effects
n.init: number of random starting points for latent variables
starting.val: starting value method

library(gllvm)

Example: Spiders

Abundances of 12 hunting spider species measured as a count at 28 sites^[van der Aart, P. J. M., and Smeenk-Enserink, N. (1975) Correlations between distributions of hunting spiders (Lycosidae, Ctenidae) and environmental characteristics in a dune area. Netherlands Journal of Zoology 25, 1-45.].
Six environmental variables measured at each site.
soil.dry: Soil dry mass
bare.sand: cover of bare sand
fallen.leaves: cover of fallen leaves/twigs
moss: cover of moss
herb.layer: cover of herb layer
reflection: reflection of the soil surface with a cloudless sky

Data fitting

Fitting basic GLLVM $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$ with gllvm:

data("spider", package = "mvabund")
library(gllvm)
fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
fitnb

Residual analysis

Residual analysis can be used to assess the appropriateness of the fitted model (eg. in terms of mean-variance relationship).
Randomized quantile/Dunn-Smyth residuals^[Dunn, P. K., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5, 236-244.] are used in the package, as they provide standard normal distributed residuals, even for discrete responses, in the case of a proper model.

par(mfrow = c(1,2))
plot(fitnb, which = 1:2)

Model selection

Information criteria can be used for model selection.
For example, compare distributions or choose suitable number of latent variables.

fitp <- gllvm(y = spider$abund, family = poisson(), num.lv = 2)
fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
AIC(fitp)
AIC(fitnb)

Exercises

Try to do these exercises for the next 10 minutes, as many as time is enough for.

E1. Load spider data from mvabund package and take a look at the dataset.

library(gllvm)
data("spider", package = "mvabund")
# more info: 
# ?spider

Show the answers.

1. Print the data and covariates and draw a boxplot of the data.

# response matrix:
spider$abund
# Environmental variables
spider$x
# Plot data using boxplot:
boxplot(spider$abund)

E2. Fit GLLVM to spider data with a suitable distribution. Data consists of counts of spider species.

# Take a look at the function documentation for help: 
?gllvm

Show the answers.

2. Response variables in spider data are counts, so Poisson, negative binomial and zero inflated Poisson are possible. However, ZIP is implemented only with Laplace method, so it need to be noticed, that if models are fitted with different methods they can not be compared with information criteria. Let's try just with a Poisson and NB. NOTE THAT the results may not be exactly the same as below, as the initial values for each model fit are slightly different, so the results may also differ slightly.

# Fit a GLLVM to data
fitp <- gllvm(y = spider$abund, family = poisson(), num.lv = 2)
fitp
fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
fitnb

Based on AIC, NB distribution suits better. How about residual analysis: NOTE THAT The package uses randomized quantile residuals so each time you plot the residuals, they look a little different.

# Fit a GLLVM to data
plot(fitp)
plot(fitnb)

You could do these comparisons with Laplace method as well, using the code below, and it would give the same conclusion that NB distribution suits best:

fitLAp <- gllvm(y = spider$abund, family = poisson(), method = "LA", num.lv = 2)
fitLAnb <- gllvm(y = spider$abund, family = "negative.binomial", method = "LA", num.lv = 2)
fitLAzip <- gllvm(y = spider$abund, family = "ZIP", method = "LA", num.lv = 2)
AIC(fitLAp)
AIC(fitLAnb)
AIC(fitLAzip)

E3. Explore the fitted model. Where are the estimates for parameters? What about predicted latent variables? Standard errors?

Show the answers.

3. Lets explore the fitted model:

# Parameters:
coef(fitnb)
# Where are the predicted latent variable values? just fitp$lvs or
getLV(fitnb)
# Standard errors for parameters:
fitnb$sd

E4. Fit model with different numbers of latent variables.

Show the answers.

4. Default number of latent variables is 2. Let's try 1 and 3 latent variables as well:

# In exercise 2, we fitted GLLVM with two latent variables 
fitnb
# How about 1 or 3 LVs
fitnb1 <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 1)
fitnb1
getLV(fitnb1)
fitnb3 <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 3)
fitnb3
getLV(fitnb3)

E5. Include environmental variables to the GLLVM and explore the model fit.

Show the answers.

5. Environmental variables can be included with an argument X:

fitnbx <- gllvm(y = spider$abund, X = spider$x, family = "negative.binomial", seed = 123, num.lv = 2)
fitnbx
coef(fitnbx)
# confidence intervals for parameters:
confint(fitnbx)

Problems? See hints:

I have problems in model fitting. My model converges to infinity or local maxima: GLLVMs are complex models where starting values have a big role. Choosing a different starting value method (see argument starting.val) or use multiple runs and pick up the one giving highest log-likelihood value using argument n.init. More variation to the starting points can be added with jitter.var.

My results does not look the same as in answers: The results may not be exactly the same as in the answers, as the initial values for each model fit are slightly different, so the results may also differ slightly.

Ordination

GLLVM as a model based ordination method

GLLVMs can be used as a model-based approach to unconstrained ordination by including two latent variables in the model: $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$
The latent variable term try to capture the underlying factors driving species abundances at sites.
Predictions for the two latent variables, $\boldsymbol{\hat u}i=(\hat u{i1}, \hat u_{i2})$, then provide coordinates for sites in the ordination plot and then provides a graphical representation of which sites are similar in terms of their species composition.

Ordination plot

ordiplot() produces ordination plots based on fitted GLLVMs.
Uncertainty of the ordination points in model based ordination can be assessed with prediction regions based on the prediction errors of latent variables.
Prediction regions may also help interpreting which differences between ordination points are really a sign of the difference between species composition at those sites.

par(mfrow=c(1,1))
ordiplot(fitnb, predict.region = TRUE, ylim=c(-2.5,2.5), xlim=c(-2,3))

Biplot

Between species correlations can be visualized with biplot^[Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58, 453-467] by adding latent variable loadings $\boldsymbol{\theta}_j$ to the ordination of sites, by producing a biplot, (argument biplot = TRUE in ordiplot()).
In a biplot latent variables and their loadings are rotated so that the LV loadings of the species are in the same direction with the sites where they are most abundant.
Biplot can be used for finding groups of correlated species or finding indicator species common at specific sites.
For example, species named Pardlugu is common in the group of sites located on the bottom (eg sites 8, 19, 20 and 21).
x and y axes may help the interpretation.

ordiplot(fitnb, biplot = TRUE)
abline(h = 0, v = 0, lty=2)

Environmental gradients

The potential impact of environmental variables on species communities can be viewed by coloring ordination points according to the variables.
For example, species named Pardlugu seems to prefer sites with lot of dry soil mass and low reflection of the soil surface with a cloudless sky.

# Arbitrary color palette, a vector length of 20. Can use, for example, colorRampPalette from package grDevices
rbPal <- c("#00FA9A", "#00EC9F", "#00DFA4", "#00D2A9", "#00C5AF", "#00B8B4", "#00ABB9", "#009DBF", "#0090C4", "#0083C9", "#0076CF", "#0069D4", "#005CD9", "#004EDF", "#0041E4", "#0034E9", "#0027EF", "#001AF4", "#000DF9", "#0000FF")
X <- spider$x
par(mfrow = c(3,2), mar=c(4,4,2,2))
for(i in 1:ncol(X)){
Col <- rbPal[as.numeric(cut(X[,i], breaks = 20))]
ordiplot(fitnb, symbols = T, s.colors = Col, main = colnames(X)[i], 
         biplot = TRUE)
}

Here environmental gradients stand out quite clearly, indicating that, at least, some of the differences in species compositions at sites can be explained by the differences in environmental conditions.
The next step would be to include covariates to the model to study more precisely the effects of environmental variables: $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{x}i'\boldsymbol{\beta}{j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$

Any scripts or data that you put into this service are public.

gllvm documentation built on April 11, 2025, 6:12 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

gllvm
Generalized Linear Latent Variable Models

In gllvm: Generalized Linear Latent Variable Models

Introduction to gllvm

R package gllvm

Distributions

Data input

Example: Spiders

Data fitting

Residual analysis

Model selection

Exercises

Ordination

GLLVM as a model based ordination method

Ordination plot

Biplot

Environmental gradients

Try the gllvm package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

gllvm Generalized Linear Latent Variable Models

In gllvm: Generalized Linear Latent Variable Models

Introduction to gllvm

R package gllvm

Distributions

Data input

Example: Spiders

Data fitting

Residual analysis

Model selection

Exercises

Ordination

GLLVM as a model based ordination method

Ordination plot

Biplot

Environmental gradients

Try the gllvm package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

gllvm
Generalized Linear Latent Variable Models