In ghislainv/jSDM: Joint Species Distribution Models

knitr::opts_chunk$set(
fig.align = "center",
fig.retina=1,
fig.width = 6, fig.height = 6,
cache = FALSE,
collapse = TRUE,
comment = "#>",
highlight = TRUE
)

Model definition

Referring to the models used in the articles @Warton2015 and @Albert1993, we define the following model :

$$ \mathrm{probit}(\theta_{ij}) =\alpha_i + X_i.\beta_j + W_i.\lambda_j $$

Link function probit: $\mathrm{probit}: q \rightarrow \Phi^{-1}(q)$ where $\Phi$ correspond to the distribution function of the reduced centered normal distribution.
Response variable: $Y=(y_{ij})^{i=1,\ldots,nsite}_{j=1,\ldots,nsp}$ with:

$$y_{ij}=\begin{cases} 0 & \text{ if species $j$ is absent on the site $i$}\ 1 & \text{ if species $j$ is present on the site $i$}. \end{cases}$$

Latent variable $z_{ij} = \alpha_i + X_i.\beta_j + W_i.\lambda_j + \epsilon_{i,j}$, with $\forall (i,j) \ \epsilon_{ij} \sim \mathcal{N}(0,1)$ and such that:

$$y_{ij}=\begin{cases} 1 & \text{if} \ z_{ij} > 0 \ 0 & \text{otherwise.} \end{cases}$$

It can be easily shown that: $y_{ij} \sim \mathcal{B}ernoulli(\theta_{ij})$.

Latent variables: $W_i=(W_i^1,\ldots,W_i^q)$ where $q$ is the number of latent variables considered, which has to be fixed by the user (by default $q=2$). We assume that $W_i \sim \mathcal{N}(0,I_q)$ and we define the associated coefficients: $\lambda_j=(\lambda_j^1,\ldots, \lambda_j^q)'$. We use a prior distribution $\mathcal{N}(0,10)$ for all lambdas not concerned by constraints to $0$ on upper diagonal and to strictly positive values on diagonal.
Explanatory variables:
bioclimatic data about each site. $X=(X_i){i=1,\ldots,nsite}$ with $X_i=(x{i}^0,x_{i}^1,\ldots,x_{i}^k,\ldots,x_{i}^p)\in \mathbb{R}^{p+1}$ where $p$ is the number of bioclimatic variables considered and $x_{i0}=1,\forall i$.
traits data about each species. $T=(T_j){j=1,\ldots,nspecies}$ with $T_j=(t{j}^0,t_{j}^1,\ldots,t_{j}^q,\ldots,t_{j}^{nt})\in \mathbb{R}^{nt+1}$ where $nt$ is the number of species specific traits considered and $t_j^0=1,\forall j$.
The corresponding regression coefficients for each species $j$ are noted : $\beta_j=(\beta_j^0, \beta_j^1,\ldots, \beta_j^k, \ldots, \beta_j^p)'$ where $\beta_j^0$ correspond to the intercept for species $j$. We use a prior distribution $\beta_j \sim \mathcal{N}(\mu_j,V_\beta)$ such as $\mu_{jk} = \sum\limits_{q=0}^{nt} t_{jq}.\gamma_{qk}$ takes different values for each species. We assume that $\gamma_{qk} \sim \mathcal{N}(\mu_{\gamma_{qk}},V_{\gamma_{qk}})$ as prior distribution.
$\alpha_i$ represents the random effect of site $i$ such as $\alpha_i \sim \mathcal{N}(0,V_{\alpha})$ and we assume that $V_{\alpha} \sim \mathcal {IG}(\text{shape}=0.5, \text{rate}=0.005)$ as prior distribution by default.

Dataset

Presence-absence of alpine plants

(ref:cap-alpine-plant) Alpine plants [@Choler2005].

knitr::include_graphics("figures/alpine_plants.png")

We consider alpine plants in Aravo (Valloire), south east France [@Choler2005]. The data are available from the R package ade4 [@Dray2007]. The original dataset includes abundance data for 82 species in 75 sites. This data-set is also available in the jSDM-package R package. It can be loaded with the data() command. The aravo data-set is a list containing a data.frame with the abundance values of 82 species (columns) in 75 sites (rows), a data.frame with the measurements of 6 environmental variables for the sites and data.frame with the measurements of 8 traits for the species.

library(jSDM)
data(aravo)
aravo$spe[1:5, 1:5]
head(aravo$env)

We transform abundance into presence-absence data and remove species with less than 5 presences. We also look at the number of observations per site.

# Transform abundance into presence-absence
PA_aravo <- aravo$spe
# colnames(PA_aravo) <- aravo$spe.names
PA_aravo[PA_aravo > 0] <- 1
# Remove species with less than 5 presences
rare_sp <- which(apply(PA_aravo, 2, sum) < 5)
PA_aravo <- PA_aravo[, -rare_sp]
# Number of sites and species
nsite <- dim(PA_aravo)[1]
nsite
nsp <- dim(PA_aravo)[2]
nsp
# Number of observations per site
nobs_site <- apply(PA_aravo, 1, sum)
nobs_site
# Number of observations per species
nobs_sp <- apply(PA_aravo, 2, sum)
nobs_sp

Environmental variables

The environmental variables are:

Aspect: Relative south aspect (opposite of the sine of aspect with flat coded 0).
Slope: Slope inclination (degrees).
Form: Microtopographic landform index: 1 (convexity); 2 (convex slope); 3 (right slope); 4 (concave slope); 5 (concavity).
Snow: Mean snowmelt date (Julian day) averaged over 1997-1999.
PhysD: Physical disturbance, i.e., percentage of unvegetated soil due to physical processes.
ZoogD: Zoogenic disturbance, i.e., quantity of unvegetated soil due to marmot activity: no; some; high.

As a first approach, we just select the "Snow" variable considering a quadratic orthogonal polynomial.

p <- poly(aravo$env$Snow, 2)
Env_aravo <- data.frame(cbind(1, p))
names(Env_aravo) <- c("int", "snow", "snow2")
head(Env_aravo)
# Number of environmental variables plus intercept
np <- ncol(Env_aravo)

Species traits

The species traits available for the alpine plants are:

Height: Vegetative height (cm)
Spread: Maximum lateral spread of clonal plants (cm)
Angle: Leaf elevation angle estimated at the middle of the lamina
Area: Area of a single leaf
Thick: Maximum thickness of a leaf cross section (avoiding the midrib)
SLA Specific leaf area
Nmass: Mass-based leaf nitrogen content
Seed: Seed mass

We want to analyze the response of alpine plants to snowmelt date according to their SLA.

As a first approach, we just integer the interaction between the mean snowmelt date Snow and the specific leaf area SLA as an explanatory factor of the model. We also normalize the continuous species traits to facilitate MCMC convergence.

head(aravo$traits)
Trait_aravo <- scale(aravo$traits[-rare_sp,])

Parameter inference

We use the jSDM_binomial_probit() function to fit the JSDM (increase the number of iterations to achieve convergence).

mod <- jSDM_binomial_probit(
  # Chains
  burnin=5000, mcmc=5000, thin=5,
  # Response variable 
  presence_data = PA_aravo, 
  # Explanatory variables 
  site_formula = ~ snow + snow2,   
  site_data = Env_aravo,
  trait_formula = ~ snow:SLA,
  trait_data = Trait_aravo,
  # Model specification 
  n_latent=2, site_effect="random",
  # Starting values
  alpha_start=0, beta_start=0,
  lambda_start=0, W_start=0,
  V_alpha=1, 
  # Priors
  shape_Valpha=0.1,
  rate_Valpha=0.1,
  mu_beta=0, V_beta=c(10,rep(1,np-1)),
  mu_lambda=0, V_lambda=1,
  # Various 
  seed=1234, verbose=1)

Analysis of the results

np <- nrow(mod$model_spec$beta_start)
## gamma corresponding to each covariable 
par(mfrow=c(2,2), oma=c(0,0,2,0))
for (p in 1:np){
plot(mod$mcmc.gamma[[p]])
   title(outer=TRUE, main=paste0("Covariable : ",
                                  names(mod$mcmc.gamma)[p]), cex.main=1.5)
}
## beta_j of the first two species
par(mfrow=c(np,2), oma=c(0,0,2,0))
for (j in 1:2) {
    plot(mod$mcmc.sp[[j]][,1:np])
    title(outer=TRUE, main=paste0( "species ", j ," : ",
                                   colnames(PA_aravo)[j]), cex.main=1.5)
}

## lambda_j of the first two species
n_latent <- mod$model_spec$n_latent
par(mfrow=c(n_latent,2), oma=c(0,0,2,0))
for (j in 1:2) {
    plot(mod$mcmc.sp[[j]][,(np+1):(np+n_latent)])
    title(outer=TRUE, main=paste0( "species ", j ," : ",
                                   colnames(PA_aravo)[j]), cex.main=1.5)
}

## species effects for all species
# par(mfrow=c(2,2), oma=c(0,0,2,0))
# plot(mcmc.list(mod$mcmc.sp))
# title(outer=TRUE, main="All species effects")

## Latent variables W_i for the first two sites
par(mfrow=c(2,2))
for (l in 1:n_latent) {
  for (i in 1:2) {
  coda::traceplot(mod$mcmc.latent[[paste0("lv_",l)]][,i],
                  main = paste0("Latent variable W_", l, ", site ", rownames(PA_aravo)[i]))
  coda::densplot(mod$mcmc.latent[[paste0("lv_",l)]][,i],
                 main = paste0("Latent variable W_", l, ", site ", rownames(PA_aravo)[i]))
  }
}

## alpha_i of the first two sites
plot(mod$mcmc.alpha[,1:2])

## V_alpha
plot(mod$mcmc.V_alpha)
## Deviance
plot(mod$mcmc.Deviance)

## probit_theta
par (mfrow=c(2,1))
hist(mod$probit_theta_latent,
     main = "Predicted probit theta", xlab ="predicted probit theta")
hist(mod$theta_latent,
     main = "Predicted theta", xlab ="predicted theta")

Matrice of correlations

After fitting the jSDM with latent variables, the full species residual correlation matrix $R=(R_{ij})^{i=1,\ldots, nspecies}{j=1,\ldots, nspecies}$ can be derived from the covariance in the latent variables such as : $$\Sigma{ij} = \lambda_i^T .\lambda_j $$, then we compute correlations from covariances : $$R_{i,j} = \frac{\Sigma_{ij}}{\sqrt{\Sigma {ii}\Sigma {jj}}}$$.

We use the plot_residual_cor() function to compute and display the residual correlation matrix :

plot_residual_cor(mod, tl.srt = 10)

Predictions

We use the predict.jSDM() S3 method on the mod object of class jSDM to compute the mean (or expectation) of the posterior distributions obtained and get the expected values of model's parameters.

# Sites and species concerned by predictions :
## 35 sites among the 75
nsite_pred <- 35
## 25 species among the 65
nsp_pred <- 25
Id_species  <- sample(colnames(PA_aravo), nsp_pred)
Id_sites  <- sample(rownames(PA_aravo), nsite_pred)
# Simulate new observations of covariates on those sites 
snow <- runif(nsite_pred,min(aravo$env[Id_sites,"Snow"])-15, max(aravo$env[Id_sites,"Snow"])+15)
p2 <- poly(snow, 2)
simdata <- data.frame(snow=p2[,1],snow2=p2[,2])
# Predictions 
theta_pred <- predict(mod, newdata=simdata,
                      Id_species=Id_species,   Id_sites=Id_sites, type="mean")
hist(theta_pred, main="Predicted theta with simulated data", xlab="predicted theta")