Introduction to hSDM
In hSDM: Hierarchical Bayesian Species Distribution Models

knitr::opts_chunk$set(
    fig.align = "center",
    fig.width = 6, fig.height = 6,
    cache = FALSE,
    collapse = TRUE,
    comment = "#>",
    highlight = TRUE
)

Species distribution models (SDM) are useful tools to explain or predict species range and abundance from various environmental factors. SDM are thus widely used in conservation biology. When using field observations (occurence or abundance data) to fit SDMs, two major problems can arise leading to potential bias in models' results: imperfect detection [@Lahoz-Monfort2014] and spatial dependence of the observations [@Lichstein2002]. In this vignette, we illustrate the use of the hSDM R package wich aims at providing user-friendly statistical functions to account for both imperfect detection and spatial dependence. Package's functions are developped in a hierarchical Bayesian framework. Functions call a Metropolis-within-Gibbs algorithm coded in C to estimate model's parameters. Using compiled C code for the Gibbs sampler reduce drastically the computation time. By making these new statistical tools available to the scientific community, we hope to democratize the use of more complex, but more realistic, statistical models for increasing knowledge in ecology and conserving biodiversity.

Below, we show an example of the use of hSDM for fitting a species distribution model to abundance data for a bird species. Model types available in hSDM are not limited to those described in this example. hSDM includes various model types (for various contexts: imperfect detection, spatial dependence, zero-inflation, land transformation) for both occurrence and abundance data:

simple Binomial and Poisson models [@Clark2007]
site-occupancy models [@MacKenzie2002]
N-mixture models [@Royle2004]
Zero-Inflated Binomial (ZIB) models [@Wilson2016]
Zero-Inflated Poisson (ZIP) models [@Flores2009]
less common mixture models to account for land degradation [@Latimer2006]

All the above models can include an additional intrinsic conditional autoregressive (iCAR) process to account for the spatial dependence of the observations. It should be easy for the user to fit other model types from the example below as function arguments are rather similar from one model to the other.

Data-set

The data-set from @Kery2010b includes repeated count data for the Willow tit (Poecile montanus, a pesserine bird, Fig. \@ref(fig:Willow-tit)) in Switzerland on the period 1999-2003. Data come from the Swiss national breeding bird survey MHB (Monitoring Haüfige Brutvögel). MHB is based on 264 1-km$^2$ sampling units (quadrats) laid out as a grid (Fig. \@ref(fig:sites)). Since 1999, every quadrat has been surveyed two to three times during most breeding seasons (15 April to 15 July). The Willow tit is a widespread but moderately rare bird species. It has a weak song and elusive behaviour and can be rather difficult to detect.

(ref:cap-Willow-tit) Willow tit (Poecile montanus).

knitr::include_graphics("figures/Poecile_montanus.jpg")

We first load some necessary packages. The sp and raster packages will be used for manipulating spatial data.

# Load libraries
library(sp)
library(raster)
library(hSDM)

This data-set from @Kery2010b is available in the hSDM R package. It can be loaded with the data command and formated to be used with hSDM functions. The data is in "wide" format: each line is a site and the repeated count data (from count1 to count3) are in columns. A site is characterized by its x-y geographical coordinates, elevation (in m) and forest cover (in %). The date of each observation (julian date) has been recorded.

# Load Kéry et al. 2010 data
data(data.Kery2010, package="hSDM")
head(data.Kery2010)

We first normalize the data to facilitate MCMC convergence. We discard the forest cover in our example.

# Normalized variables
elev.mean <- mean(data.Kery2010$elevation)
elev.sd <- sd(data.Kery2010$elevation)
juldate.mean <- mean(c(data.Kery2010$juldate1,
                     data.Kery2010$juldate2,
                     data.Kery2010$juldate3),na.rm=TRUE)
juldate.sd <- sd(c(data.Kery2010$juldate1,
                 data.Kery2010$juldate2,
                 data.Kery2010$juldate3),na.rm=TRUE)
data.Kery2010$elevation <- (data.Kery2010$elevation-elev.mean)/elev.sd
data.Kery2010$juldate1 <- (data.Kery2010$juldate1-juldate.mean)/juldate.sd
data.Kery2010$juldate2 <- (data.Kery2010$juldate2-juldate.mean)/juldate.sd
data.Kery2010$juldate3 <- (data.Kery2010$juldate3-juldate.mean)/juldate.sd

We plot the observation sites over a grid of 10 $\times$ 10 km$^2$ cells.

(ref:cap-sites) Location of the 264 10 km$^2$ quadrats of the Swiss national breeding bird survey. Points are located on a grid of 10 km$^2$ cells. The grid is covering the geographical extent of the observation points.

# Landscape and observation sites
sites.sp <- SpatialPointsDataFrame(coords=data.Kery2010[c("coordx","coordy")],
                                   data=data.Kery2010[,-c(1,2)])
xmin <- min(data.Kery2010$coordx)
xmax <- max(data.Kery2010$coordx)
ymin <- min(data.Kery2010$coordy)
ymax <- max(data.Kery2010$coordy)
res_spatial_cell <- 10
ncol <- ceiling((xmax-xmin)/res_spatial_cell)
nrow <- ceiling((ymax-ymin)/res_spatial_cell)
Xmax <- xmin+ncol*res_spatial_cell
Ymax <- ymin+nrow*res_spatial_cell
ext <- extent(c(xmin, Xmax,
                ymin, Ymax))
landscape <- raster(ncols=ncol,nrows=nrow,ext)
values(landscape) <- runif(ncell(landscape),0,1)
landscape.po <- rasterToPolygons(landscape)
par(mar=c(0.1,0.1,0.1,0.1))
plot(landscape.po)
points(sites.sp, pch=16, cex=1, col="black")

One important step if we want to account for spatial dependence is to compute the neighborhood for each cell. In our case, the neighborhood is define by the "king move" in chess, in the height directions around the target cell. To compute the neighbordhood for each cell, we need the number of neighbors (which can vary) and the adjacent cell identifiers. We can use the function adjacent from the raster package to do this.

# Neighborhood
# Rasters must be projected to correctly compute the neighborhood
crs(landscape) <- '+proj=utm +zone=1'
# Cell for each site
cells <- extract(landscape,sites.sp,cell=TRUE)[,1]
# Neighborhood matrix
ncells <- ncell(landscape)
neighbors.mat <- adjacent(landscape, cells=c(1:ncells), directions=8,
                          pairs=TRUE, sorted=TRUE)
# Number of neighbors by cell
n.neighbors <- as.data.frame(table(as.factor(neighbors.mat[,1])))[,2]
# Adjacent cells
adj <- neighbors.mat[,2]

We rearrange the data in two data-sets, one for the observation (or detection) process and one for the suitability (or ecological) process. The data for the observation process is in "long" format. Each row is an observation which is characterized by the site number and the julian date of the observation. The data for the suitability process (at the site level) includes information on sites (coordinates and elevation) by row.

# Arranging data
# data.obs
nsite <- length(data.Kery2010$coordx)
count <- c(data.Kery2010$count1,data.Kery2010$count2,data.Kery2010$count3)
juldate <- c(data.Kery2010$juldate1,data.Kery2010$juldate2,
             data.Kery2010$juldate3)
site <- rep(1:nsite,3)
data.obs <- data.frame(count,juldate,site)
data.obs <- data.obs[!is.na(data.obs$juldate),]
# data.suit
data.suit <- data.Kery2010[c("coordx","coordy","elevation")]
data.suit$cells <- cells
data.suit <- data.suit[-139,] # Removing site 139 with no juldate

We will compare three SDMs with increasing complexity: a first simple Poisson model, a second N-mixture model with imperfect detection, and a third N-mixture iCAR model with imperfect detection and spatial dependence.

Simple Poisson model

To fit a simple Poisson model, we use the function hSDM.poisson. For this model, we slightly rearrange the observation data-set to associate the site altitude to each observation in a "long" format. The statistical model can be written as follow:

\begin{equation} y_i \sim \mathcal{P}oisson(\lambda_i) \ \log(\lambda_i) = X_i \beta \ \end{equation}

# hSDM.poisson
data.pois <- data.obs
data.pois$elevation <- data.suit$elevation[as.numeric(as.factor(data.obs$site))]
mod.Kery2010.pois <- hSDM.poisson(counts=data.pois$count,
                                  suitability=~elevation+I(elevation^2),
                                  data=data.pois, beta.start=0)

# Outputs
summary(mod.Kery2010.pois$mcmc)

# Predictions
npred <- 100
nsamp <- dim(mod.Kery2010.pois$mcmc)[1]
# Abundance-elevation
elev.seq <- seq(500,3000,length.out=npred)
elev.seq.n <- (elev.seq-elev.mean)/elev.sd
beta <- as.matrix(mod.Kery2010.pois$mcmc[,1:3])
tbeta <- t(beta)
X <- matrix(c(rep(1,npred),elev.seq.n,elev.seq.n^2),ncol=3)
N <- matrix(NA,nrow=nsamp,ncol=npred)
for (i in 1:npred) {
    N[,i] <- exp(X[i,] %*% tbeta)
}
N.est.pois <- apply(N,2,mean)
N.q1.pois <- apply(N,2,quantile,0.025)
N.q2.pois <- apply(N,2,quantile,0.975)

N-mixture model for imperfect detection

The model integrates two processes, an ecological process associated to the abundance of the species due to habitat suitability and an observation process that takes into account the fact that the probability of detection of the species is inferior to one.

\begin{equation} \textbf{Ecological process:} \ N_i \sim \mathcal{P}oisson(\lambda_i) \ \log(\lambda_i) = X_i \beta \ ~ \ \textbf{Observation process:} \ y_{it} \sim \mathcal{B}inomial(N_i, \delta_{it}) \ \text{logit}(\delta_{it}) = W_{it} \gamma \ \end{equation}

# hSDM.Nmixture
mod.Kery2010.Nmix <- hSDM.Nmixture(# Observations
                     counts=data.obs$count,
                     observability=~juldate+I(juldate^2),
                     site=data.obs$site,
                     data.observability=data.obs,
                     # Habitat
                     suitability=~elevation+I(elevation^2),
                     data.suitability=data.suit,
                     # Predictions
                     suitability.pred=NULL,
                     # Chains
                     burnin=10000, mcmc=5000, thin=5,
                     # Starting values
                     beta.start=0,
                     gamma.start=0,
                     # Priors
                     mubeta=0, Vbeta=1.0E6,
                     mugamma=0, Vgamma=1.0E6,
                     # Various
                     seed=1234, verbose=1,
                     save.p=0, save.N=0)

# Outputs
summary(mod.Kery2010.Nmix$mcmc)

# Predictions
nsamp <- dim(mod.Kery2010.Nmix$mcmc)[1]
# Abundance-elevation
beta <- as.matrix(mod.Kery2010.Nmix$mcmc[,1:3])
tbeta <- t(beta)
N <- matrix(NA,nrow=nsamp,ncol=npred)
for (i in 1:npred) {
    N[,i] <- exp(X[i,] %*% tbeta)
}
N.est.Nmix <- apply(N,2,mean)
N.q1.Nmix <- apply(N,2,quantile,0.025)
N.q2.Nmix <- apply(N,2,quantile,0.975)
# Detection-Julian date
juldate.seq <- seq(100,200,length.out=npred)
juldate.seq.n <- (juldate.seq-juldate.mean)/juldate.sd
gamma <- as.matrix(mod.Kery2010.Nmix$mcmc[,4:6])
tgamma <- t(gamma)
W <- matrix(c(rep(1,npred),juldate.seq.n,juldate.seq.n^2),ncol=3)
delta <- matrix(NA,nrow=nsamp,ncol=npred)
for (i in 1:npred) {
    delta[,i] <- inv.logit(X[i,] %*% tgamma)
}
delta.est.Nmix <- apply(delta,2,mean)
delta.q1.Nmix <- apply(delta,2,quantile,0.025)
delta.q2.Nmix <- apply(delta,2,quantile,0.975)

N-mixture model with iCAR process

For this model, we assume that the abundance of the species at one site depends on the abundance of the species on neighboring sites. In this model, the ecological process includes an intrinsic conditional autoregressive (iCAR) subprocess to model the spatial dependence between observations.

\begin{equation} \textbf{Ecological process:} \ N_i \sim \mathcal{P}oisson(\lambda_i) \ \log(\lambda_i) = X_i \beta + \rho_{j(i)} \ \text{$\rho_{j(i)}$: spatial random effect for spatial entity $j$ including site $i$} \ ~ \ \textbf{Spatial dependence:} \ \text{An intrinsic conditional autoregressive model (iCAR) is assumed:} \ \rho_j \sim \mathcal{N}ormal(\mu_j, V_{\rho}/n_j) \ \text{$\mu_j$: mean of $\rho_{j^{\prime}}$ in the neighborhood of $j$.} \ \text{$V_{\rho}$: variance of the spatial random effects.} \ \text{$n_j$: number of neighbors for spatial entity $j$.} \ ~ \ \textbf{Observation process:} \ y_{it} \sim \mathcal{B}inomial(N_i, \delta_{it}) \ \text{logit}(\delta_{it}) = W_{it} \gamma \ \end{equation}

# hSDM.Nmixture.iCAR
mod.Kery2010.Nmix.iCAR <- hSDM.Nmixture.iCAR(# Observations
                            counts=data.obs$count,
                            observability=~juldate+I(juldate^2),
                            site=data.obs$site,
                            data.observability=data.obs,
                            # Habitat
                            suitability=~elevation+I(elevation^2),
                            data.suitability=data.suit,
                            # Spatial structure
                            spatial.entity=data.suit$cells,
                            n.neighbors=n.neighbors, neighbors=adj,
                            # Chains
                            burnin=20000, mcmc=10000, thin=10,
                            # Starting values
                            beta.start=0,
                            gamma.start=0,
                            Vrho.start=1,
                            # Priors
                            mubeta=0, Vbeta=1.0E6,
                            mugamma=0, Vgamma=1.0E6,
                            priorVrho="1/Gamma",
                            shape=1, rate=1,
                            # Various
                            seed=1234, verbose=1,
                            save.rho=0, save.p=0, save.N=0)

summary(mod.Kery2010.Nmix.iCAR$mcmc)

(ref:cap-spatial) Estimated spatial random effects. Locations of observation quadrats are represented by dots. The mean abundance on each quadrat is represented by a circle of size proportional to abundance.

# Spatial random effects
rho.pred <- mod.Kery2010.Nmix.iCAR$rho.pred
r.rho.pred <- rasterFromXYZ(cbind(coordinates(landscape),rho.pred))
plot(r.rho.pred)
# Mean abundance by site
ma <- apply(sites.sp@data[,3:5],1,mean,na.rm=TRUE)
points(sites.sp,pch=16,cex=0.5)
points(sites.sp,pch=1,cex=ma/2)

# Predictions
nsamp <- dim(mod.Kery2010.Nmix.iCAR$mcmc)[1]
# Abundance-elevation
beta <- as.matrix(mod.Kery2010.Nmix.iCAR$mcmc[,1:3])
tbeta <- t(beta)
N <- matrix(NA,nrow=nsamp,ncol=npred)
# Simplified way of obtaining samples for rho
rho.samp <- sample(rho.pred,nsamp,replace=TRUE)
for (i in 1:npred) {
    N[,i] <- exp(X[i,] %*% tbeta + rho.samp)
}
N.est.Nmix.iCAR <- apply(N,2,mean)
N.q1.Nmix.iCAR <- apply(N,2,quantile,0.025)
N.q2.Nmix.iCAR <- apply(N,2,quantile,0.975)

# Detection-Julian date
gamma <- as.matrix(mod.Kery2010.Nmix.iCAR$mcmc[,4:6])
tgamma <- t(gamma)
delta <- matrix(NA,nrow=nsamp,ncol=npred)
for (i in 1:npred) {
    delta[,i] <- inv.logit(X[i,] %*% tgamma)
}
delta.est.Nmix.iCAR <- apply(delta,2,mean)
delta.q1.Nmix.iCAR <- apply(delta,2,quantile,0.025)
delta.q2.Nmix.iCAR <- apply(delta,2,quantile,0.975)

Comparing predictions from the three different models

(ref:cap-abundance) Relationship between abundance and altitude. Black: Poisson model, Red: N-mixture model, Green: N-mixture model with iCAR.

# Expected abundance - Elevation
par(mar=c(4,4,1,1),cex=1.4,tcl=+0.5)
plot(elev.seq,N.est.pois,type="l",
     xlim=c(500,3000),
     ylim=c(0,10),
     lwd=2,
     xlab="Elevation (m a.s.l.)",
     ylab="Expected abundance",
     axes=FALSE)
#lines(elev.seq,N.q1.pois,lty=3,lwd=1)
#lines(elev.seq,N.q2.pois,lty=3,lwd=1)
axis(1,at=seq(500,3000,by=500),labels=seq(500,3000,by=500))
axis(2,at=seq(0,10,by=2),labels=seq(0,10,by=2))
# Nmix
lines(elev.seq,N.est.Nmix,lwd=2,col="red")
#lines(elev.seq,N.q1.Nmix,lty=3,lwd=1,col="red")
#lines(elev.seq,N.q2.Nmix,lty=3,lwd=1,col="red")
# Nmix.iCAR
lines(elev.seq,N.est.Nmix.iCAR,lwd=2,col="dark green")
#lines(elev.seq,N.q1.Nmix.iCAR,lty=3,lwd=1,col="dark green")
#lines(elev.seq,N.q2.Nmix.iCAR,lty=3,lwd=1,col="dark green")

(ref:cap-detection) Relationship between detection probability and observation date. Red: N-mixture model, Green: N-mixture model with iCAR.

# Detection probability - Julian date
par(mar=c(4,4,1,1),cex=1.4,tcl=+0.5)
plot(juldate.seq,delta.est.Nmix,type="l",
     xlim=c(100,200),
     ylim=c(0,1),
     lwd=2,
     col="red",
     xlab="Julian date",
     ylab="Detection probability",
     axes=FALSE)
lines(juldate.seq,delta.q1.Nmix,lty=3,lwd=1,col="red")
lines(juldate.seq,delta.q2.Nmix,lty=3,lwd=1,col="red")
axis(1,at=seq(100,200,by=20),labels=seq(100,200,by=20))
axis(2,at=seq(0,1,by=0.2),labels=seq(0,1,by=0.2))
# Nmix.iCAR
lines(juldate.seq,delta.est.Nmix.iCAR,lwd=2,col="dark green")
lines(juldate.seq,delta.q1.Nmix.iCAR,lty=3,lwd=1,col="dark green")
lines(juldate.seq,delta.q2.Nmix.iCAR,lty=3,lwd=1,col="dark green")