knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, warning = FALSE, fig.width=5, fig.height=5, fig.align = "center", dev = "png", fig.pos = 'H' )
R package gllvm fits Generalized linear latent variable models (GLLVM) for multivariate data^[Niku, J., F.K.C. Hui, S. Taskinen, and D.I. Warton. 2019. Gllvm - Fast Analysis of Multivariate Abundance Data with Generalized Linear Latent Variable Models in R. 10. Methods in Ecology and Evolution: 2173–82].
Developed by J. Niku, W.Brooks, R. Herliansyah, F.K.C. Hui, S. Taskinen, D.I. Warton, B. van der Veen.
The package available in
Package installation:
# From CRAN install.packages(gllvm) # OR # From GitHub using devtools package's function install_github devtools::install_github("JenniNiku/gllvm")
Problems?
gllvm package depends on R packages TMB and mvabund, try to install these first.
GLLVMs are computationally intensive to fit due the integral in log-likelihood.
gllvm package overcomes computational problems by applying closed form approximations to log-likelihood and using automatic differentiation in C++ to accelerate computation times (TMB^[Kasper Kristensen, Anders Nielsen, Casper W. Berg, Hans Skaug, Bradley M. Bell (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21]).
Estimation is performed using either variational approximation (VA^[Hui, F. K. C., Warton, D., Ormerod, J., Haapaniemi, V., and Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. Journal of Computational and Graphical Statistics. Journal of Computational and Graphical Statistics, 26:35-43]), extended variational approximation method (EVA^[Korhonen, P., Hui, F. K. C., Niku, J., and Taskinen, S. (2021). Fast, universal estimation of latent variable models using extended variational approximations, arXiv:2107.02627 .]) or Laplace approximation (LA^[Niku, J., Warton, D. I., Hui, F. K. C., and Taskinen, S. (2017). Generalized linear latent variable models for multivariate count and biomass data in ecology. Journal of Agricultural, Biological, and Environmental Statistics, 22:498-522.]) method implemented via R package TMB.
VA method is faster and more accurate than LA, but not applicable for all distributions and link functions.
Using gllvm we can fit
GLLVM without latent variables fits basic multivariate GLMs
Additional tools: model checking, model selection, inference, visualization.
| Response | Distribution | Method | Link | | ----------- |:------------:|:------- |:------- | |Counts | Poisson | VA/LA |log | | | NB | VA/LA |log | | | ZIP | LA |log | |Binary | Bernoulli | VA/LA |probit | | | | EVA/LA |logit | |Ordinal | Ordinal | VA |probit | |Normal | Gaussian | VA/LA |identity| |Positive continuous| Gamma | VA/LA |log| |Non-negative continuous| Exponential | VA/LA |log| |Biomass | Tweedie | LA/EVA |log | |Percent cover| beta | LA/EVA |probit/logit |
Main function of the gllvm package is gllvm()
, which can be used to fit GLLVMs for multivariate data with the most important arguments listed in the following:
gllvm(y = NULL, X = NULL, TR = NULL, family, num.lv = 2, formula = NULL, method = "VA", row.eff = FALSE, n.init=1, starting.val ="res", ...)
library(gllvm)
soil.dry
: Soil dry massbare.sand
: cover of bare sandfallen.leaves
: cover of fallen leaves/twigsmoss
: cover of mossherb.layer
: cover of herb layerreflection
: reflection of the soil surface with a cloudless skyFitting basic GLLVM $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$ with gllvm:
library(mvabund) data("spider") library(gllvm) fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2) fitnb
Residual analysis can be used to assess the appropriateness of the fitted model (eg. in terms of mean-variance relationship).
Randomized quantile/Dunn-Smyth residuals^[Dunn, P. K., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5, 236-244.] are used in the package, as they provide standard normal distributed residuals, even for discrete responses, in the case of a proper model.
par(mfrow = c(1,2)) plot(fitnb, which = 1:2)
fitp <- gllvm(y = spider$abund, family = poisson(), num.lv = 2) fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2) AIC(fitp) AIC(fitnb)
Try to do these exercises for the next 10 minutes, as many as time is enough for.
E1. Load spider data from mvabund package and take a look at the dataset.
library(gllvm) #Package **mvabund** is loaded with **gllvm** so just load with a function `data()`. data("spider") # more info: # ?spider
Show the answers.
1. Print the data and covariates and draw a boxplot of the data.
# response matrix: spider$abund # Environmental variables spider$x # Plot data using boxplot: boxplot(spider$abund)
E2. Fit GLLVM to spider data with a suitable distribution. Data consists of counts of spider species.
# Take a look at the function documentation for help: ?gllvm
Show the answers.
2. Response variables in spider data are counts, so Poisson, negative binomial and zero inflated Poisson are possible. However, ZIP is implemented only with Laplace method, so it need to be noticed, that if models are fitted with different methods they can not be compared with information criteria. Let's try just with a Poisson and NB. NOTE THAT the results may not be exactly the same as below, as the initial values for each model fit are slightly different, so the results may also differ slightly.
# Fit a GLLVM to data fitp <- gllvm(y=spider$abund, family = poisson(), num.lv = 2) fitp fitnb <- gllvm(y=spider$abund, family = "negative.binomial", num.lv = 2) fitnb
Based on AIC, NB distribution suits better. How about residual analysis: NOTE THAT The package uses randomized quantile residuals so each time you plot the residuals, they look a little different.
# Fit a GLLVM to data plot(fitp) plot(fitnb)
You could do these comparisons with Laplace method as well, using the code below, and it would give the same conclusion that NB distribution suits best:
fitLAp <- gllvm(y=spider$abund, family = poisson(), method = "LA", num.lv = 2) fitLAnb <- gllvm(y=spider$abund, family = "negative.binomial", method = "LA", num.lv = 2) fitLAzip <- gllvm(y=spider$abund, family = "ZIP", method = "LA", num.lv = 2) AIC(fitLAp) AIC(fitLAnb) AIC(fitLAzip)
E3. Explore the fitted model. Where are the estimates for parameters? What about predicted latent variables? Standard errors?
Show the answers.
3. Lets explore the fitted model:
# Parameters: coef(fitnb) # Where are the predicted latent variable values? just fitp$lvs or getLV(fitnb) # Standard errors for parameters: fitnb$sd
E4. Fit model with different numbers of latent variables.
Show the answers.
4. Default number of latent variables is 2. Let's try 1 and 3 latent variables as well:
# In exercise 2, we fitted GLLVM with two latent variables fitnb # How about 1 or 3 LVs fitnb1 <- gllvm(y=spider$abund, family = "negative.binomial", num.lv = 1) fitnb1 getLV(fitnb1) fitnb3 <- gllvm(y=spider$abund, family = "negative.binomial", num.lv = 3) fitnb3 getLV(fitnb3)
E5. Include environmental variables to the GLLVM and explore the model fit.
Show the answers.
5. Environmental variables can be included with an argument X
:
fitnbx <- gllvm(y = spider$abund, X = spider$x, family = "negative.binomial", seed = 123, num.lv = 2) fitnbx coef(fitnbx) # confidence intervals for parameters: confint(fitnbx)
Problems? See hints:
I have problems in model fitting. My model converges to infinity or local maxima:
GLLVMs are complex models where starting values have a big role. Choosing a different starting value method (see argument starting.val
) or use multiple runs and pick up the one giving highest log-likelihood value using argument n.init
. More variation to the starting points can be added with jitter.var
.
My results does not look the same as in answers: The results may not be exactly the same as in the answers, as the initial values for each model fit are slightly different, so the results may also differ slightly.
GLLVMs can be used as a model-based approach to unconstrained ordination by including two latent variables in the model: $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$
The latent variable term try to capture the underlying factors driving species abundances at sites.
Predictions for the two latent variables, $\boldsymbol{\hat u}i=(\hat u{i1}, \hat u_{i2})$, then provide coordinates for sites in the ordination plot and then provides a graphical representation of which sites are similar in terms of their species composition.
ordiplot()
produces ordination plots based on fitted GLLVMs.par(mfrow=c(1,1)) ordiplot(fitnb, predict.region = TRUE, ylim=c(-2.5,2.5), xlim=c(-2,3))
biplot = TRUE
in ordiplot()
).ordiplot(fitnb, biplot = TRUE) abline(h = 0, v = 0, lty=2)
# Arbitrary color palette, a vector length of 20. Can use, for example, colorRampPalette from package grDevices rbPal <- c("#00FA9A", "#00EC9F", "#00DFA4", "#00D2A9", "#00C5AF", "#00B8B4", "#00ABB9", "#009DBF", "#0090C4", "#0083C9", "#0076CF", "#0069D4", "#005CD9", "#004EDF", "#0041E4", "#0034E9", "#0027EF", "#001AF4", "#000DF9", "#0000FF") X <- spider$x par(mfrow = c(3,2), mar=c(4,4,2,2)) for(i in 1:ncol(X)){ Col <- rbPal[as.numeric(cut(X[,i], breaks = 20))] ordiplot(fitnb, symbols = T, s.colors = Col, main = colnames(X)[i], biplot = TRUE) }
Here environmental gradients stand out quite clearly, indicating that, at least, some of the differences in species compositions at sites can be explained by the differences in environmental conditions.
The next step would be to include covariates to the model to study more precisely the effects of environmental variables: $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{x}i'\boldsymbol{\beta}{j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.