In doserjef/spOccupancy: Single-Species, Multi-Species, and Integrated Spatial Occupancy Models

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  comment = "", eval = FALSE
)

\newcommand{\bm}{\boldsymbol} \newcommand{\pg}{\text{P{\'o}lya-Gamma}}

Introduction

This vignette provides statistical details on the MCMC algorithms used to fit the core occupancy models in spOccupancy. Specifically, this vignette will walk through the MCMC algorithms for the following models:

Occupancy model using PGOcc().
Spatial occupancy model using spPGOcc().
Multi-species occupancy model using msPGOcc().
Spatial multi-species occupancy model using spMsPGOcc().
Integrated occupancy model using intPGOcc().
Spatial integrated occupancy model using spIntPGOcc().

We provide detailed descriptions of the joint posterior distributions for each model, how each parameter is updated in the model fitting process, and provide relevant citations to more specific documentation of the approaches where necessary. We also provide information on the composition sampling algorithms used for each model to predict at out-of-sample locations. Details on models in spOccupancy that account for species interactions are provided in a separate vignette.

Single-species occupancy model (`PGOcc`)

Model description

Let $z_j$ be the true presence (1) or absence (0) of a species at site $j$, with $j = 1, \dots, J$. We assume this latent occupancy process can be represented by a Bernoulli process following

\begin{equation} \begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \ &\text{logit}(\psi_j) = \bm{x}^{\top}_j\bm{\beta}, \end{split} (#eq:zPGOcc) \end{equation}

where $\psi_j$ is the probability of occurrence at site $j$, which is a function of site-specific covariates $\bm{X}$ and a vector of regression coefficients ($\bm{\beta}$).

We do not directly observe $z_j$, but rather we observe an imperfect representation of the latent occurrence process. Let $y_{j, k}$ be the observed detection (1) or nondetection (0) of a species of interest at site $j$ during replicate $k$ for each of $k = 1, \dots, K_j$ replicates at each site $j$. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:

\begin{equation} \begin{split} &y_{j, k} \sim \text{Bernoulli}(p_{j, k}z_j), \ &\text{logit}(p_{j, k}) = \bm{v}^{\top}_{j, k}\bm{\alpha}, \end{split} (#eq:yPGOcc) \end{equation}

where $p_{j, k}$ is the probability of detecting a species at site $j$ during replicate $k$ (given it is present at site $j$), which is a function of site and replicate specific covariates $\bm{V}$ and a vector of regression coefficients ($\bm{\alpha}$).

We assume multivariate normal priors for the occurrence ($\bm{\beta}$) and detection ($\bm{\alpha}$) regression coefficients to complete the Bayesian specification of a single-species occupancy model. Traditionally, when estimation occurs in a Bayesian framework, the regression coefficients for occurrence ($\bm{\beta}$) and detection ($\bm{\alpha}$) must be updated using Metropolis updates, which can lead to slow convergence and bad mixing of MCMC chains [@clark2019]. Instead, we introduce $\pg$ latent variables [@polson2013] for both the occurrence and detection portions of the model, which induces Gibbs updates for all parameters in the single-species occupancy model.

More specifically, let $\omega_{j, \beta}$ follow a $\pg$ distribution with parameters 1 and 0, i.e., $\omega_{j, \beta} \sim \text{PG}(1, 0)$. Given this latent variable, we can express the Bernoulli process of $z_j$ as

\begin{equation} \begin{split} \psi_j^{z_j} (1 - \psi_j)^{1 - z_j} &= \frac{\text{exp}(\bm{x}^{\top}j\bm{\beta})^{z_i}}{1 + \text{exp}(\bm{x}^{\top}_j\bm{\beta})} \ &= \text{exp}(\kappa_j\bm{x}^{\top}_j\bm{\beta})\int \text{exp}(-\frac{\omega{j, \beta}}{2}(\bm{x}j^{\top}\bm{\beta})^2)p(\omega{j, \beta} \mid 1, 0) d\omega_{j, \beta}, \end{split} (#eq:polyaGamma) \end{equation}

where $\kappa_j = z_j - 0.5$ and $p(\omega_{j, \beta})$ is the probability density function of a $\pg$ distribution with parameters 1 and 0 [@polson2013]. Similarly, we define $\omega_{j, k, \alpha} \sim \text{PG}(1, 0)$ as a $\pg$ latent variable for the detection portion of the occupancy model, which results in a similar re-expression of the Bernoulli likelihood for $y_{j, k}$ as for $z_j$. These re-expressions of the Bernoulli processes result in Gibbs updates for both the occurrence ($\bm{\beta}$) and detection ($\bm{\alpha}$) regression coefficients when they are assigned normal priors [@polson2013; @clark2019].

Our full joint posterior for a single-species occupancy model thus takes the following form:

$$ \begin{aligned} \lbrack \bm{\alpha}, \bm{\beta}, \bm{z}, \bm{\omega}{\beta}, \bm{\omega}{\alpha} \mid \bm{Y} \rbrack \propto \prod_{j = 1}^J \prod_{k = 1}^{K_j} &\text{Bernoulli}(y_{j, k} \mid p_{j, k}z_j) \times \ & \text{Bernoulli}(z_j \mid \psi_j) \times \ & \text{PG}(\omega_{j, \beta} \mid 1, 0) \times \ & \text{PG}(\omega_{j, k, \alpha} \mid 1, 0) \times \ & \text{Normal}(\bm{\beta} \mid \bm{\mu0}{\beta}, \bm{\Sigma}{\beta}) \times \ & \text{Normal}(\bm{\alpha} \mid \bm{\mu0}{\alpha}, \bm{\Sigma}{\alpha}) \ \end{aligned} $$

MCMC sampler

The $\pg$ data augmentation induces a Gibbs update for all parameters in the single-species occupancy model. We first sample the occurrence and detection auxiliary variables from

\begin{equation} \begin{split} \omega_{j, \beta} \mid \cdot &\sim \text{PG}(1, \bm{x}j^{\top}\bm{\beta}), \ \omega{j, k, \alpha} \mid \cdot &\sim \text{PG}(1, \bm{v}_{j, k}^{\top}\bm{\alpha}), \end{split} (#eq:omegaPostPGOcc) \end{equation}

respectively. We next sample the occurrence regression coefficients $\bm{\beta}$ from

\begin{equation} \bm{\beta} \mid \cdot \sim \text{Normal}\Big([\bm{\Sigma}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}[\bm{X}^{\top}(\bm{z} - 0.5 \bm{1}J) + \bm{\Sigma}{\beta}^{-1}\bm{\mu0}{\beta}], [\bm{\Sigma}{\beta}^{-1} + \bm{X}^{\top}\bm{S}_{\beta}\bm{X}]^{-1}\Big), (#eq:betaPostPGOcc) \end{equation}

where $\bm{S}{\beta}$ is a diagonal $J \times J$ matrix with diagonal entries equal to the latent PG variable values ($\omega{1, \beta}, \dots, \omega_{J, \beta})$.

Similarly, we sample the detection regression coefficients $\bm{\alpha}$ from

\begin{equation} \bm{\alpha} \mid \cdot \sim \text{Normal}\Big([\bm{\Sigma}{\alpha}^{-1} + \bm{\tilde{V}}^{\top}\bm{S}{\alpha}\bm{\tilde{V}}]^{-1}[\bm{\tilde{V}}^{\top}(\bm{\tilde{y}} - 0.5 \bm{1}{J^*}) + \bm{\Sigma}{\alpha}^{-1}\bm{\mu0}{\alpha}], [\bm{\Sigma}{\alpha}^{-1} + \bm{\tilde{V}}^{\top}\bm{S}_{\alpha}\bm{\tilde{V}}]^{-1}\Big). (#eq:alphaPostPGOcc) \end{equation}

The detection regression coefficients $\bm{\alpha}$ are only informed by the locations where $z_j = 1$, since we assume no false positive detections in the standard occupancy model. We define $J^$ as the total number of sites at the current iteration of the MCMC with $z_j = 1$. $\bm{S}{\alpha}$ is a diagonal matrix with diagonal entries equal to the latent $\pg$ variable values ($\omega{1, 1, \alpha}, \dots, \omega_{J^, K_{J^*}, \alpha}$). The matrix $\bm{\tilde{V}}$ is the matrix of detection covariates associated with the sites where $z_j = 1$. Similarly, $\bm{\tilde{y}}$ is a vector of stacked detection-nondetection data values at the entries associated with $z_j = 1$.

Finally, $z_j$ is set to 1 for all sites where there is at least one detection, and thus we only need to sample $z_j$ at sites where there are no detections. Thus, for all locations with no detections, we sample $z_j$ according to

\begin{equation} z_j \mid \cdot \sim \text{Bernoulli}\Bigg(\frac{\psi_j \prod_{k = 1}^{K_j}(1 - p_{j, k})}{1 - \psi_j + \psi_j \prod_{k = 1}^{K_j}(1 - p_{j, k})}\Bigg). (#eq:zPostPGOcc) \end{equation}

Prediction {#singleSpeciesPred}

Prediction for a nonspatial single-species occupancy model is a simple composition sampling problem [@banerjee2003]. Given a set of occurrence covariates at a set of non-sampled locations ($\bm{X}_0$), we can derive the latent occurrence probability and the latent occurrence state at each non-sampled site $j = 1, \dots, J_0$ for each posterior sample $q$ of the MCMC sampler following

\begin{equation} \begin{split} &\text{logit}(\psi_{j}^{(q)}) = \bm{x}_{0, j}^{\top} \bm{\beta}^{(q)}, \ &z^{(q)}_j \sim \text{Bernoulli}(\psi_j^{(q)}). \end{split} (#eq:zPredPGOcc) \end{equation}

Single-species spatial occupancy model (`spPGOcc`) {#spPGOcc}

Gaussian Process formulation

Model description {#modelGP}

We extend the previous single-species occupancy model to incorporate a spatial Gaussian process that accounts for unexplained spatial variation in species occurrence across a region of interest. Let $\bm{s}_j$ denote the geographical coordinates of site $j$ for $j = 1, \dots, J$. The species-specific occurrence probability at site $j$ with coordinates $\bm{s}_j$, $\psi(\bm{s}_j)$, now takes the form

\begin{equation} \text{logit}(\psi(\bm{s}_j)) = \bm{x}(\bm{s}_j)^{\top}\bm{\beta} + \text{w}(\bm{s}_j), \end{equation}

where w($\bm{s}_j$) is a realization from a zero-mean spatial Gaussian Process, i.e.,

\begin{equation} \text{\textbf{w}}(\bm{s}) \sim N(\bm{0}, \bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})). (#eq:fullGP) \end{equation}

We define $\bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})$ as a $J \times J$ covariance matrix that is a function of the distances between any pair of site coordinates $\bm{s}$ and $\bm{s}'$ and a set of parameters $(\bm{\theta})$ that govern the spatial process. The vector $\bm{\theta}$ is equal to $\bm{\theta} = {\sigma^2, \phi, \nu}$, where $\sigma^2$ is a spatial variance parameter, $\phi$ is a spatial decay parameter, and $\nu$ is a spatial smoothness parameter. $\nu$ is only specified when using a Matern correlation function.

The detection portion of the occupancy model remains unchanged from the non-spatial occupancy model and follows Equation \@ref(eq:yPGOcc). Formulation of $\pg$ latent variables is also exactly analogous to the nonspatial model (Equation \@ref(eq:polyaGamma)), with all references to $\psi_j$ now including the latent spatial random effects in addition to the site-level covariates.

Following standard recommendations for point-referenced spatial data [@banerjee2003], we assign an inverse-Gamma prior to the spatial variance parameter and uniform priors to the spatial decay and spatial smoothness parameters. We also allow users to specify a uniform prior on the spatial variance parameter $\sigma^2$ instead of an inverse-Gamma prior. This can be useful in certain situations when working with binary data, as there is a confounding between the spatial variance parameter $\sigma^2$ and the occurrence intercept $\beta_0$ as a result of the logit transformation and Jensen's Inequality [@bolker2015]. Generally, we have found this confounding to be inconsequential, as the spatial structure of the random effects helps to separate $\sigma^2$ from $\beta_0$. However, there may be certain circumstances when $\sigma^2$ is estimated to be extremely large, and the estimate of $\beta_0$ is a very large magnitude negative number. It can be helpful in these situations to use a uniform distribution on $\sigma^2$ to restrict it to taking more reasonable values. In the following, we present the full joint distribution of all models using an inverse-Gamma prior on $\sigma^2$, as we nearly always use this prior in our own analyses.

Our full joint posterior distribution takes the following form, where IG stands for the inverse-Gamma distribution:

$$ \begin{aligned} \lbrack \bm{\alpha}, \bm{\beta}, \bm{z}, \bm{\omega}{\beta}, \bm{\omega}{\alpha}, \text{\textbf{w}}(\bm{s}), \bm{\theta} \mid \bm{Y} \rbrack \propto \prod_{j = 1}^J \prod_{k = 1}^{K_j} &\text{Bernoulli}(y_{j, k} \mid p_{j, k} z_j) \times \ & \text{Bernoulli}(z_j \mid \psi_j) \times \ & \text{Normal}(\text{\textbf{w}}(\bm{s}) \mid \bm{0}, \bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})) \times \ & \text{PG}(\omega_{j, \beta} \mid 1, 0) \times \ & \text{PG}(\omega_{j, k, \alpha} \mid 1, 0) \times \ & \text{Normal}(\bm{\beta} \mid \bm{\mu0}{\beta}, \bm{\Sigma}{\beta}) \times \ & \text{Normal}(\bm{\alpha} \mid \bm{\mu0}{\alpha}, \bm{\Sigma}{\alpha}) \times \ & \text{IG}(\sigma^2 \mid a_{\sigma^2}, b_{\sigma^2}) \times \ & \text{Uniform}(\phi \mid a_{\phi}, b_{\phi}) \times \ & \text{Uniform}(\nu \mid a_{\nu}, b_{\nu}) \ \end{aligned} $$

MCMC sampler {#mcmcGP}

We first sample the occurrence and detection auxiliary variables from

\begin{equation} \begin{split} \omega_{j, \beta} \mid \cdot &\sim \text{PG}(1, \bm{x}j^{\top}\bm{\beta} + \text{w}_j), \ \omega{j, k, \alpha} \mid \cdot &\sim \text{PG}(1, \bm{v}_{j, k}^{\top}\bm{\alpha}), \end{split} (#eq:omegaPostspPGOcc) \end{equation}

The $\pg$ scheme induces a Gibbs update for the occurrence regression coefficients, which are updated at each iteration according to

\begin{equation} \bm{\beta} \mid \cdot \sim \text{Normal}\Big([\bm{\Sigma}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}[\bm{X}^{\top}(\bm{z} - 0.5 \bm{1}J - \bm{S}{\beta}\text{\textbf{w}}) + \bm{\Sigma}{\beta}^{-1}\bm{\mu0}{\beta}], [\bm{\Sigma}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}\Big), (#eq:betaPostspPGOcc) \end{equation}

where $\bm{S}{\beta}$ is a diagonal $J \times J$ matrix with diagonal entries equal to the latent PG variable values ($\omega{1, \beta}, \dots, \omega_{J, \beta})$.

The full conditional for the detection regression coefficients is the same as in the non-spatial model shown in Equation \@ref(eq:alphaPostPGOcc).

When using an inverse-Gamma prior, the spatial variance parameter, $\sigma^2$, is sampled via a Gibbs update of the form

\begin{equation} \sigma^2 \mid \cdot \sim \text{IG}\Big(\frac{J}{2} + a_{\sigma^2}, \frac{\text{\textbf{w}}^{\top}\bm{R}^{-1}\text{\textbf{w}}}{2} + b_{\sigma^2}\Big), (#eq:sigmaSqPostspPGOcc) \end{equation}

where $\bm{R}$ is a $J \times J$ spatial correlation matrix.

The full conditional distributions for the spatial range parameter, $\phi$, and spatial smoothness parameter, $\nu$, are not available in closed form, and thus we use random walk Metropolis updates (e.g., @robert2013monte) to update these parameters. We use a random-walk Metroplis step with a multivariate normal proposal distribution (either of dimension 1 or of dimension 2 if Matern covariance function is used). To use the normal distribution as a proposal distribution, we transform the parameters to have a support spanning the entire real line, including a Jacobian adjustment for the Metropolis step. Tuning parameters are adaptively updated using Adaptive MCMC following @roberts2009examples. If using a uniform prior for the spatial variance parameter $\sigma^2$, this parameter is also updated in this step using the same random-walk Metropolis step.

The $\pg$ data augmentation scheme also enables a Gibbs update for the latent spatial Gaussian process (\textbf{w}($\bm{s}$)), as opposed to a traditional spatial occupancy model that requires a Metropolis update for the latent spatial process. The spatial process is updated according to

\begin{equation} \text{\textbf{w}}(\bm{s}) \mid \cdot \sim \text{Normal}\Big([\bm{S}{\beta} + \bm{\Sigma}^{-1}]^{-1} [\bm{z} - 0.5\bm{1}_J - \bm{S}{\beta}\bm{X}\bm{\beta}], [\bm{S}_{\beta} + \bm{\Sigma}^{-1}]^{-1}\Big). (#eq:wPostspPGOcc) \end{equation}

Finally, for all sites with no detections, the latent occurrence values $z_j$ are updated following Equation \@ref(eq:zPostPGOcc).

Prediction {#predictionSpPGOcc}

Prediction for spatial occupancy models requires use of standard results for conditional multivariate normal distributions [@banerjee2003]. To predict latent occurrence and occurrence probability at non-sampled sites, we first need to predict the spatial process at the unobserved locations. Let \textbf{w}$(\bm{s}_0)$ denote the spatial process at the $J_0$ non-sampled locations. We assume that \textbf{w}$(\bm{s}_0)$ and \textbf{w}$(\bm{s})$ (the spatial process at observed locations) arise from a multivariate normal distribution following

\begin{equation} \begin{bmatrix} \text{\textbf{w}}(\bm{s}) \ \text{\textbf{w}}(\bm{s}0) \end{bmatrix} \mid \cdot \sim \text{Normal}\Bigg(\begin{bmatrix} \bm{0} \ \bm{0} \end{bmatrix}, \begin{bmatrix} \bm{\Sigma}{11} & \bm{\Sigma}{12} \ \bm{\Sigma}{12}^{\top} & \bm{\Sigma}_{22} \end{bmatrix}\Bigg), (#eq:mvn) \end{equation}

where $\bm{\Sigma}{11} = \bm{\Sigma}$, $\bm{\Sigma}{12}$ is the $J \times J_0$ cross-covariance matrix between \textbf{w}$(\bm{s})$ and \textbf{w}$(\bm{s}0)$, and $\bm{\Sigma}{22}$ is the variance-covariance matrix for \textbf{w}$(\bm{s}_0)$. Using conditional multivariate normal theory, this results in the following posterior predictive distribution for the spatial process at nonsampled locations

\begin{equation} \text{\textbf{w}}(\bm{s}0) \sim \text{Normal}(\bm{\Sigma}{12}^{\top}\bm{\Sigma}{11}^{-1}\text{\textbf{w}}(\bm{s}), \bm{\Sigma}{22} - \bm{\Sigma}{12}^{\top}\bm{\Sigma}{11}^{-1}\bm{\Sigma}_{12}). (#eq:wPredspPGOcc) \end{equation}

We can use composition sampling to sample from this posterior predictive distribution by using the values for \textbf{w} at each sample $q$ of the posterior distribution. This will generate a full predictive posterior sample which we can summarize with full uncertainty quantification.

Predicting all $J_0$ locations jointly can be expensive when $J_0$ is large. Thus, we perform independent individual predictions of the spatial process at each non-sampled location $j = 1, \dots, J_0$. Finally, to predict the latent occurrence and latent occurrence probability at each non-sampled site $j$, we perform the following steps for each posterior sample $q$.

Sample w$^{(q)}(\bm{s}_{0, j})$ from Equation \@ref(eq:wPredspPGOcc), substituting in the current values at sample $q$ of the spatial parameters and latent spatial process at the observed locations.
Compute the latent occurrence probability $\psi^{(q)}(\bm{s}j)$ as $\text{logit}^{-1}(\bm{x}{0, j}^{\top}\bm{\beta}^{(q)} + \text{w}^{(q)}(\bm{s}_{0, j}))$.
Sample the latent occurrence from $z_{j}^{(q)} \sim \text{Bernoulli}(\psi^{(q)}(\bm{s}_j))$.

Nearest Neighbor Gaussian Process formulation

When the number of sites is moderately large, say 1000, the above described spatial Gaussian process model can be drastically slow as a result of the need to take the inverse of the spatial covariance matrix $\bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})$ at each MCMC iteration. Numerous approximation methods exist to reduce this computational cost [@heaton2019case]. One attractive approach is the Nearest Neighbor Gaussian Process (NNGP; @datta2016hierarchical). Instead of modeling the spatial process using a full GP as shown in Equation \@ref(eq:fullGP), we replace the GP prior specification with a NNGP, which leads to drastic decreases in run time with nearly identicial inference and prediction as the full GP specification. See @datta2016hierarchical for theoretical details on the NNGP and its relationship to the full GP.

Model description {#modelNNGP}

The joint posterior distribution for the NNGP model is exactly the same as that of the full GP spatial occupancy model except the Gausian Process specification assigned to the latent spatial random effects \textbf{w}($\bm{s}$) is replaced with an NNGP prior [@datta2016hierarchical].

MCMC sampler {#samplerspPGOccNNGP}

Full conditionals for the $\pg$ latent variables, occurrence regression coefficients, detection regression coefficients, spatial range (and smoothness if applicable) parameter, and the latent occurrence values are sampled in the same manner as done for the full GP spatial occupancy model. When using an inverse-Gamma prior, the full conditional for the spatial variance parameter $\sigma^2$ similarly takes the form of an inverse-Gamma distribution, or when using a uniform prior, the update is the same as the full GP spatial occupancy model. The $\pg$ data augmentation scheme induces the following full conditional for the latent spatial process when using an NNGP prior:

\begin{equation} \text{\textbf{w}}(\bm{s}) \mid \cdot \sim \text{Normal}(\text{\textbf{B}}[\bm{S}_{\beta}^{-1}(\bm{\tilde{z}} - \bm{X}\bm{\beta})], \text{\textbf{B}}) \end{equation}

where $\bm{S}{\beta}$ is a diagonal $J \times J$ matrix with diagonal entries equal to the latent $\pg$ variable values, $\tilde{z}_j = \frac{z_j - 0.5}{\omega{j, \beta}}$, and \textbf{B} $= \bm{\tilde{\Sigma}}(\bm{s}, \bm{s}', \bm{\theta})^{-1} + \bm{S}_{\beta}$. $\bm{\tilde{\Sigma}}(\bm{s}, \bm{s}', \bm{\theta})$ is the NNGP covariance matrix. For details of the NNGP covariance matrix, see @datta2016hierarchical and @finley2019efficient.

As described by @finley2020spnngp and @datta2016hierarchical, the above block update of \textbf{w}($\bm{s}$) is not computationally practical. Instead, we sequentially update the full conditionals individual for each $j = 1, \dots, J$ element of \textbf{w}($\bm{s}$) following the algorithm in @datta2016hierarchical. This ensures each update of the full latent spatial random efffects vector occurs in $O(J)$ floating point operations (FLOPs).

Prediction {#predictionspPGOccNNGP}

Prediction for the NNGP occupancy model follows a similar algorithm to that of the full GP spatial occupancy model. We first sample the observed spatial random effects when fitting the model, use these random effects to generate the spatial random effects at new locations, and subsequently use the predicted spatial random effects to generate predictions of latent occurrence and occurrence probability. More specifically, our approach for prediction follows exactly Algorithm 2 of @finley2019efficient, which we reproduce in the context of occupancy models in the following.

The posterior predictive distribution for the spatial process at unobserved locations (\textbf{w}$_0$) is

\begin{equation} \text{\textbf{w}}(\bm{s}0) \mid \cdot \sim \text{Normal}(\text{\textbf{B}}^{-1}[\bm{S}{\beta}^{-1}(\bm{\tilde{z}} - \bm{X}\bm{\beta})], \text{\textbf{B}}^{-1}). (#eq:nngpPred) \end{equation}

Our predictive algorithm thus takes the following steps for each posterior sample $q$:

Sample w$^{(q)}_{0, j}$ from Equation \@ref(eq:nngpPred), substituting in the current values at sample $q$ of the spatal parameters and latent spatial process at the observed locations.
Compute the latent occurrence probability $\psi^{(q)}(\bm{s}j)$ as $\text{logit}^{-1}(\bm{x}{0, j}^{\top}\bm{\beta}^{(q)} + \text{w}^{(q)}(\bm{s}_{0, j}))$.
Sample the latent occurrence from $z_{j}^{(q)} \sim \text{Bernoulli}(\psi^{(q)}(\bm{s}_j))$.

Multi-species occupancy model (`msPGOcc`)

Model description

Let $z_{i, j}$ be the true presence (1) or absence (0) of a species $i$ at site $j$, with $j = 1, \dots, J$ and $i = 1, \dots, N$. We assume the latent occurrence process arises from a Bernoulli process following

\begin{equation} \begin{split} &z_{i, j} \sim \text{Bernoulli}(\psi_{i, j}), \ &\text{logit}(\psi_{i, j}) = \bm{x}^{\top}_{j} \bm{\beta}_i, \end{split} (#eq:zmsPGOcc) \end{equation}

where $\psi_{i, j}$ is the probability of occurrence of species $i$ at site $j$, which is a function of site-specific covariates $\bm{X}$ and a vector of species-specific regression coefficients ($\bm{\beta}_i$). The regression coefficients in multi-species occupancy models are envisioned as random effects arising from a common community-level distribution:

\begin{equation} \bm{\beta}i \sim \text{Normal}(\bm{\mu{\beta}}, \bm{T}_{\beta}), (#eq:msBeta) \end{equation}

where $\bm{\mu_{\beta}}$ is a vector of community-level mean effects for each occurrence covariate effect (including the intercept) and $\bm{T}{\beta}$ is a diagonal matrix with diagonal elements $\bm{\tau}^2{\beta}$ that represent the variability of each occurrence covariate effect among species in the community.

We do not directly observe $z_{i, j}$, but rather we observe an imperfect representation of the latent occurrence process. Let $y_{i, j, k}$ be the observed detection (1) or nondetection (0) of a species $i$ of interest at site $j$ during replicate $k$ for each of $k = 1, \dots, K_j$ replicates at each site $j$. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:

\begin{equation} \begin{split} &y_{i, j, k} \sim \text{Bernoulli}(p_{i, j, k}z_{i, j}), \ &\text{logit}(p_{i, j, k}) = \bm{v}^{\top}_{i, j, k}\bm{\alpha}_i, \end{split} (#eq:ymsPGOcc) \end{equation}

where $p_{i, j, k}$ is the probability of detecting species $i$ at site $j$ during replicate $k$ (given it is present at site $j$), which is a function of site and replicate-specific covariates $\bm{V}$ and a vector of species-specific regression coefficients ($\bm{\alpha}_i$). Similarly to the occurrence regression coefficients, the species-specific detection coefficients are envisioned as random effects arising from a common community-level distribution:

\begin{equation} \bm{\alpha}i \sim \text{Normal}(\bm{\mu{\alpha}}, \bm{T}_{\alpha}), (#eq:msAlpha) \end{equation}

where $\bm{\mu_{\alpha}}$ is a vector of community-level mean effects for each detection covariate effect (including the intercept) and $\bm{T}{\alpha}$ is a diagonal matrix with diagonal elements $\bm{\tau}^2{\alpha}$ that represent the variability of each detection covariate effect among species in the community.

We assign multivariate normal priors for the community-level occurrence ($\bm{\mu_{\beta}}$) and detection ($\bm{\mu_{\alpha}}$) means, and assign indepdent inverse-Gamma priors on the community-level occurrence ($\tau^2_{\beta}$) and detection ($\tau^2_{\alpha}$) variance parameters. Analogous to the single-species occupancy model, we implement the model using $\pg$ data augmentation which induces fully Gibbs updates for all parameters. We specify $\pg$ data augmented variables for each species ($\omega_{i, j, \beta}$, $\omega_{i, j, k, \alpha})$, which follow the same scheme as that for the single-species model in Equation \@ref(eq:polyaGamma), except all parameters in the equation are now indexed by $i$ for each species. The full joint posterior distribution thus takes the following form:

$$ \begin{aligned} \lbrack \bm{\alpha}, \bm{\beta}, \bm{z}, \bm{\omega}{\beta}, \bm{\omega}{\alpha}, \bm{\mu}{\beta}, \bm{\mu}{\alpha}, \bm{\tau}^2_{\beta}, \bm{\tau}^2_{\alpha} \mid \bm{y} \rbrack \propto \prod_{i = 1}^N \prod_{j = 1}^J \prod_{k = 1}^{K_j} &\text{Bernoulli}(y_{i, j, k} \mid p_{i, j, k}z_{i, j}) \times \ & \text{Bernoulli}(z_{i, j} \mid \psi_{i, j}) \times \ & \text{PG}(\omega_{i, j, \beta} \mid 1, 0) \times \ & \text{PG}(\omega_{i, j, k, \alpha} \mid 1, 0) \times \ & \text{Normal}(\bm{\beta} \mid \bm{\mu_{\beta}}, \bm{T}{\beta}) \times \ & \text{Normal}(\bm{\alpha} \mid \bm{\mu{\alpha}}, \bm{T}{\alpha}) \ & \text{Normal}(\bm{\mu{\beta}} \mid \bm{\mu0}{\beta}, \bm{\Sigma}{\beta}) \ & \text{Normal}(\bm{\mu_{\alpha}} \mid \bm{\mu0}{\alpha}, \bm{\Sigma}{\alpha}) \ & \prod_{r = 1}^{n_{\beta}} \text{IG}(\tau^2_{r, \beta} \mid a_{r, \beta}, b_{r, \beta}) \ & \prod_{t = 1}^{n_{\alpha}} \text{IG}(\tau^2_{t, \alpha} \mid a_{t, \alpha}, b_{t, \alpha}), \end{aligned} $$

where $r$ and $t$ index across the number of occurrence and detection regression parameters, respectively.

MCMC sampler

The $\pg$ data augmentation induces a Gibbs update for all parameters in the multi-species occupancy model. For each iteration, we first sample all community-level parameters followed by species level parameters. We first sample the community-level regression coefficients $\bm{\mu_{\beta}}$ from

\begin{equation} \bm{\mu_{\beta}} \mid \cdot \sim \text{Normal}([\bm{\Sigma}{\beta}^{-1} + N\bm{T}{\beta}^{-1}]^{-1}\Big[\sum_{i = 1}^N(\bm{T}{\beta}^{-1}\bm{\beta}_i) + \bm{\Sigma}{\beta}^{-1}\bm{\mu0}{\beta}\Big], [\bm{\Sigma}{\beta}^{-1} + N\bm{T}_{\beta}^{-1}]^{-1}). (#eq:betaBar) \end{equation}

Similarly, we next sample the community-level regression coefficients $\bm{\mu_{\alpha}}$ from

\begin{equation} \bm{\mu_{\alpha}} \mid \cdot \sim \text{Normal}([\bm{\Sigma}{\alpha}^{-1} + N\bm{T}{\alpha}^{-1}]^{-1}\Big[\sum_{i = 1}^N(\bm{T}{\alpha}^{-1}\bm{\alpha}_i) + \bm{\Sigma}{\alpha}^{-1}\bm{\mu0}{\alpha}\Big], [\bm{\Sigma}{\alpha}^{-1} + N\bm{T}_{\alpha}^{-1}]^{-1}). (#eq:alphaBar) \end{equation}

Next, we sample the community-level occurrence variance parameter for each regression coefficient, $\tau^2_{r, \beta}$, from the following inverse-Gamma full conditional:

\begin{equation} \tau^2_{r, \beta} \mid \cdot \sim \text{IG}(a_{r, \beta} + \frac{N}{2}, b_{r, \beta} + \frac{\sum_{i = 1}^N(\beta_{i, r} - \mu_{\beta_r})^2}{2}). (#eq:tauBeta) \end{equation}

Similarly, we next sample the community-level detection variance parameter for each regression coefficient:

\begin{equation} \tau^2_{t, \alpha} \mid \cdot \sim \text{IG}(a_{t, \alpha} + \frac{N}{2}, a_{t, \alpha} + \frac{\sum_{i = 1}^N(\alpha_{i, t} - \mu_{\alpha_t})^2}{2}). (#eq:tauAlpha) \end{equation}

We now sample all species level coefficients. The coefficients are sampled one at a time for each species. First, we sample the occurrence and detection auxiliary variables for species $i$ from

\begin{equation} \begin{split} \omega_{i, j, \beta} \mid \cdot &\sim \text{PG}(1, \bm{x}j^{\top}\bm{\beta}_i), \ \omega{i, j, k, \alpha} \mid \cdot &\sim \text{PG}(1, \bm{v}_{j, k}^{\top}\bm{\alpha}_i). \end{split} (#eq:omegaPostmsPGOcc) \end{equation}

The occurrence regression coefficients for species $i$ are subsequently drawn from the following multivariate Normal full conditional distribution

\begin{equation} \bm{\beta}i \mid \cdot \sim \text{Normal}\Big([\bm{T}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}[\bm{X}^{\top}(\bm{z}_i - 0.5 \bm{1}_J) + \bm{T}{\beta}^{-1}\bm{\mu_{\beta}}], [\bm{T}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}\Big), (#eq:betaMsPGOcc) \end{equation}

where $\bm{S}_{\beta}$ is a diagonal $J \times J$ matrix with diagonal entries equal to the latent $\pg$ variable values for species $i$. Similarly, we sample the detection regression coefficients for species $i$ from

\begin{equation} \bm{\alpha}i \mid \cdot \sim \text{Normal}\Big([\bm{T}{\alpha}^{-1} + \bm{\tilde{V}}^{\top}\bm{S}{\alpha}\bm{\tilde{V}}]^{-1}[\bm{\tilde{V}}^{\top}(\bm{\tilde{y}}_i - 0.5 \bm{1}{J_i^*}) + \bm{T}{\alpha}^{-1}\bm{\mu{\alpha}}], [\bm{T}{\alpha}^{-1} + \bm{\tilde{V}}^{\top}\bm{S}{\alpha}\bm{\tilde{V}}]^{-1}\Big). (#eq:alphaMsPGOcc) \end{equation}

The species-level detection regression coefficients $\bm{\alpha}i$ are only informed by the locations where $z{i, j} = 1$, since we assume no false positive detections in the standard occupancy model. We define $J_i^$ as the total number of sites at the current iteration of the MCMC with $z_{i, j} = 1$. $\bm{S}{\alpha}$ is a diagonal matrix with diagonal entries equal to the latent $\pg$ variable values ($\omega{i, 1, 1, \alpha}, \dots, \omega_{i, J_i^, K_{J_i^*}, \alpha}$). The matrix $\bm{\tilde{V}}$ is the matrix of detection covariates associated with the sites where $z_{i, j} = 1$. Similarly, $\bm{\tilde{y}}i$ is a vector of stacked detection-nondetection data values at the entries associated with $z{i, j} = 1$.

Finally, we sample the latent occurrence states for each species. $z_{i, j}$ is set to 1 for all sites where there is at least one detection of species $i$, and so we only need to sample $z_{i, j}$ at sites where there are no detections. Thus, for all locations with no detections of the species $i$, we sample $z_{i, j}$ according to

\begin{equation} z_{i, j} \mid \cdot \sim \text{Bernoulli}\Bigg(\frac{\psi_{i, j} \prod_{k = 1}^{K_j}(1 - p_{i, j, k})}{1 - \psi_{i, j} + \psi_{i, j} \prod_{k = 1}^{K_j}(1 - p_{i, j, k})}\Bigg). (#eq:zPostmsPGOcc) \end{equation}

Note the full conditional for $z_{i, j}$ is exactly the same as that for the single-species occupancy model in Equation \@ref(eq:zPostPGOcc), except all values are now additionally indexed by species ($i$).

Prediction

Prediction for a nonspatial multi-species occupancy model is a simple composition sampling problem exactly analogous to the single-species model. Given a set of occurrence covariates at a set of non-sampled locations ($\bm{X}_0$), we can derive the latent occurrence probability and the latent occurrence state at each non-sampled site $j = 1, \dots, J_0$ for each species $i$ for each posterior sample $q$ of the MCMC sampler following

\begin{equation} \begin{split} &\text{logit}(\psi_{i, j}^{(q)}) = \bm{x}{0, j}^{\top} \bm{\beta}_i^{(q)}, \ &z^{(q)}{i, j} \sim \text{Bernoulli}(\psi_{i, j}^{(q)}). \end{split} (#eq:zPredmsPGOcc) \end{equation}

Multi-species spatial occupancy model (`spMsPGOcc`)

Gaussian Process formulation

Model description

We extend the previous multi-species occupancy model to incorporate a distinct spatial Gaussian Process (GP) for each species that accounts for unexplained spatial variation in each individual species occurrence across a spatial region. Occurrence probability for species $i$ at site $j$ with coordinates $\bm{s}j$, $\psi{i}(\bm{s}_j)$, now takes the form

\begin{equation} \text{logit}(\psi_{i}(\bm{s}j)) = \bm{x}_j^{\top}\bm{\beta}_i + \text{w}{i}(\bm{s}_j), \end{equation}

where the species-specific regression coefficients $\bm{\beta}i$ follow the community-level distribution in Equation \@ref(eq:msBeta), and \text{w}${i}(\bm{s}_j)$ is a realization from a zero-mean spatial GP, i.e.,

\begin{equation} \text{\textbf{w}}_{i}(\bm{s}) \sim \text{Normal}(\bm{0}, \bm{\Sigma}_i(\bm{s}, \bm{s}', \bm{\theta}_i)). (#eq:fullGPMs) \end{equation}

We define $\bm{\Sigma}_i(\bm{s}, \bm{s}', \bm{\theta}_i)$ as a $J \times J$ covariance matrix that is a function of the distances between any pair of site coordinates $\bm{s}$ and $\bm{s}'$ and a set of parameters $(\bm{\theta}_i)$ that govern the spatial process. The vector $\bm{\theta}_i$ is equal to $\bm{\theta}_i = {\sigma^2_i, \phi_i, \nu_i}$, where $\sigma^2_i$ is a spatial variance parameter for species $i$, $\phi_i$ is a spatial decay parameter for species $i$, and $\nu_i$ is a spatial smoothness parameter for species $i$. $\nu_i$ is only specified when using a Matern correlation function.

Note that we estimate a distinct parameter vector $\bm{\theta}_i$ for each species and assume the spatial processes are independent of each other. This is a naive approach for incorporating spatial processes in a multi-species occupancy model, as we do not leverage the potential correlation in spatial processes among species in a linear model of coregionalization approach [@gelfand2004nonstationary]. Despite the simplicity of the approach, such models have been shown to yield improved insight in species distributions across broad locations [@wright2021spatial], and the Bayesian shrinkage component of the multi-species model (Equation \@ref(eq:msBeta)) will make estimates in a multi-species spatial occupancy model more precise than a single-species spatial occupancy model, in particular for rare species. In future implementations of spOccupancy we plan to implement multi-species occupancy models in a more rich inferential framework that leverages between species correlations in spatial processes and non-spatial components, similar to the models of @tobler2019joint and @taylor2019spatial.

The detection portion of the multi-species spatial occupancy model remains unchanged from the non-spatial multi-species occupancy model and follows Equations \@ref(eq:ymsPGOcc) and \@ref(eq:msAlpha). Formulation of $\pg$ latent variables is also exactly analogous to the nonspatial model (Equation \@ref(eq:polyaGamma)), with all parameters including an index for species ($i$) and all references to $\psi_{i}(\bm{s}_j)$ now including the latent spatial random effects in addition to the site-level covariates.

Following standard recommendations for point-referenced spatial data [@banerjee2003], we assign an inverse-Gamma prior to the spatial variance parameter for each species and uniform priors to the spatial decay and spatial smoothness parameters for each species. Our full joint posterior distribution takes the following form, where IG stands for the inverse-Gamma distribution:

$$ \begin{aligned} \lbrack \bm{\alpha}, \bm{\beta}, \bm{z}, \bm{\omega}{\beta}, \bm{\omega}{\alpha}, \bm{\mu}{\beta}, \bm{\mu}{\alpha}, \bm{\tau}^2_{\beta}, \bm{\tau}^2_{\alpha} \mid \bm{y} \rbrack \propto \prod_{i = 1}^N \prod_{j = 1}^J \prod_{k = 1}^{K_j} &\text{Bernoulli}(y_{i, j, k} \mid p_{i, j, k}z_{i, j}) \times \ & \text{Bernoulli}(z_{i, j} \mid \psi_{i, j}) \times \ & \text{PG}(\omega_{i, j, \beta} \mid 1, 0) \times \ & \text{PG}(\omega_{i, j, k, \alpha} \mid 1, 0) \times \ & \text{Normal}(\text{\textbf{w}} \mid \bm{0}, \bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})) \times \ & \text{Normal}(\bm{\beta} \mid \bm{\mu_{\beta}}, \bm{T}{\beta}) \times \ & \text{Normal}(\bm{\alpha} \mid \bm{\mu{\alpha}}, \bm{T}{\alpha}) \ & \text{Normal}(\bm{\mu{\beta}} \mid \bm{\mu0}{\beta}, \bm{\Sigma}{\beta}) \ & \text{Normal}(\bm{\mu_{\alpha}} \mid \bm{\mu0}{\alpha}, \bm{\Sigma}{\alpha}) \ & \prod_{r = 1}^{n_{\beta}} \text{IG}(\tau^2_{r, \beta} \mid a_{r, \beta}, b_{r, \beta}) \ & \prod_{t = 1}^{n_{\alpha}} \text{IG}(\tau^2_{t, \alpha} \mid a_{t, \alpha}, b_{t, \alpha}) \ & \text{IG}(\sigma^2_i \mid a_{\sigma^2, i}, b_{\sigma^2, i}) \times \ & \text{Uniform}(\phi_i \mid a_{\phi, i}, b_{\phi, i}) \times \ & \text{Uniform}(\nu_i \mid a_{\nu, i}, b_{\nu, i}) \ \end{aligned} $$

MCMC sampler

The $\pg$ data augmentation induces a Gibbs update for all parameters in the multi-species spatial occupancy model except the spatial range parameters ($\phi_i$) and the spatial smoothness parameters $\nu_i$ if specified. For each iteration, we first sample all community-level parameters followed by species level parameters. Full conditional distributions for all community-level parameters are exactly the same as those for the nonspatial multi-species model. See Equations \@ref(eq:betaBar)-\@ref(eq:tauAlpha).

The species-level coefficients are sampled one at a time for each species. First, we sample the occurrence and detection auxiliary variables for species $i$ from

\begin{equation} \begin{split} \omega_{i, j, \beta} \mid \cdot &\sim \text{PG}(1, \bm{x}j^{\top}\bm{\beta}_i + \text{w}{i}(\bm{s}j)), \ \omega{i, j, k, \alpha} \mid \cdot &\sim \text{PG}(1, \bm{v}_{j, k}^{\top}\bm{\alpha}_i). \end{split} (#eq:omegaPostspMsPGOcc) \end{equation}

The occurrence regression coefficients for species $i$ are subsequently drawn from the following multivariate normal full conditional distribution

\begin{equation} \bm{\beta}i \mid \cdot \sim \text{Normal}\Big([\bm{T}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}[\bm{X}^{\top}(\bm{z}_i - 0.5 \bm{1}_J - \bm{S}{\beta}\text{\textbf{w}}i(\bm{s}_j)) + \bm{T}{\beta}^{-1}\bm{\mu_{\beta}}], [\bm{T}{\beta}^{-1} + \bm{X}^{\top}\bm{S}{\beta}\bm{X}]^{-1}\Big), (#eq:betaspMsPGOcc) \end{equation}

where $\bm{S}_{\beta}$ is a diagonal $J \times J$ matrix with diagonal entries equal to the latent $\pg$ variable values for species $i$. The full conditional for the species-level detection regression coefficients is the same as in the non-spatial model shown in Equation \@ref(eq:alphaMsPGOcc).

When specifying an inverse-Gamma prior, the spatial variance parameter for species $i$, $\sigma^2_i$, is sampled via a Gibbs update of the form

\begin{equation} \sigma^2_i \mid \cdot \sim \text{IG}\Big(\frac{J}{2} + a_{\sigma^2, i}, \frac{\text{\textbf{w}}i(\bm{s}_j)^{\top}\bm{R}_i^{-1}\text{\textbf{w}}_i(\bm{s}_j)}{2} + b{\sigma^2, i}\Big), (#eq:sigmaSqPostspMsPGOcc) \end{equation}

where $\bm{R}$ is a $J \times J$ spatial correlation matrix. The full conditional distributions for the species-specific spatial range parameters, $\phi_i$, and spatial smoothness parameters, $\nu_i$ are not available in closed form, and so they are updated using random walk Metropolis updates following the same procedure as described for the single-species spatial occupancy models. When a uniform prior is used for the species-specific spatial variance parameters, $\sigma^2_i$, these are also updated using random walk metropolis updates as before.

The $\pg$ data augmentation scheme induces a Gibbs update for the latent spatial Gaussian process for each species, which is updated according to

\begin{equation} \text{\textbf{w}}i(\bm{s}) \mid \cdot \sim \text{Normal}\Big([\bm{S}{\beta} + \bm{\Sigma}i^{-1}]^{-1} [\bm{z}_i - 0.5\bm{1}_J - \bm{S}{\beta}\bm{X}\bm{\beta}i], [\bm{S}{\beta} + \bm{\Sigma}_i^{-1}]^{-1}\Big). (#eq:wPostspMsPGOcc) \end{equation}

Finally, for all sites with no detections for a given species, the latent occurrence values $z_{i, j}$ are updated following Equation \@ref(eq:zPostmsPGOcc).

Prediction

Because we assume independence between the spatial processes of the different species, prediction for the multi-species spatial occupancy model is exactly analogous to prediction for the single-species spatial occupancy model described in Section \@ref(predictionSpPGOcc), except prediction is done for each species $i$ using the species-specific values for that species. See Section \@ref(predictionSpPGOcc) for the algorithm, noting that for the multi-species model all values are additionally indexed by species ($i$).

Nearest Neighbor Gaussian Process formulation

As with the single-species model, we also implement spMsPGOcc with an NNGP. Use of the NNGP leads to even larger computational benefits for the multi-species occupancy models, as we now replace each of the independent GPs for each of the $N$ species with an independent NNGP.

Model description

The joint posterior distribution for the multi-species NNGP occupancy model is exactly the same as that of the full GP multi-species model except the GP prior assigned to the latent spatial random effects \textbf{w}$_i(\bm{s})$ for each species is replaced with an NNGP prior [@datta2016hierarchical].

MCMC sampler and prediction

The full conditionals for all variables except the spatial variance parameters $\sigma^2_i$ and the latent spatial process \textbf{w}$_i(\bm{s})$ follow the same full conditionals as described in the full GP multi-species spatial occupancy model. The full conditionals for $\sigma^2_i$ and \textbf{w}$_i(\bm{s})$ follow exactly those described for the single-species NNGP model in Section \@ref(samplerspPGOccNNGP), where all parameters are now indexed by each species $i$. Prediction is also exactly analogous to that described for the single-species NNGP model in Section \@ref(predictionspPGOccNNGP).

Single-species integrated occupancy model (`intPGOcc`)

Data integration is a model-based approach that leverages multiple data sources to provide inference and prediction on some latent process of interest [@miller2019recent]. Data integration is particularly relevant in ecology as many data sources are often collected to study a single ecological phenomenon, with each data source having advantages and disadvantages. Often, multiple detection-nondetection data sources are available to study the occurrence and distribution of some species of interest. For example, both human point count surveys and autonomous recording units could be used to monitor a bird species of conservation concern [@doser2021integrating]. Different types of data have different sources of observation error, which should be explicitly incorporated into a model to avoid attributing any variation in detection probability to the true ecological process. Here we describe single-species integrated occupancy models, which combine multiple sources of detection-nondetection data (which may or may not be replicated) in a single hierarchical modeling framework.

Model description {#integratedModel}

The biological process model is exactly the same as single-species occupancy models, which we now describe again for clarity. Let $z_j$ be the presence or absence of a species at site $j$, with $j = 1, \dots, J$. We assume this latent occurrence variables arises from a Bernoulli process following

\begin{equation} \begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \ &\text{logit}(\psi_j) = \bm{x}^{\top}_j\bm{\beta}, \end{split} (#eq:zintPGOcc) \end{equation}

where $\psi_j$ is the probability of occurrence at site $j$, which is a function of site-specific covariates $\bm{X}$ and a vector of regression coefficients ($\bm{\beta}$).

We do not directly observe $z_j$, but rather we observe an imperfect representation of the latent occurrence process. In integrated models, we have $r = 1, \dots, R$ distinct sources of data that are all imperfect representations of a single, shared occurrence process. Let $y_{r, a, k}$ be the observed detection (1) or nondetection (0) of a species of interest in data set $r$ at site $a$ during replicate $k$. Because different data sources have different variables influencing the observation process, we envision a separate detection model for each data source that is conditional on a single, shared ecological process described by Equation \@ref(eq:zintPGOcc). We envision the detection-nondetection data from source $r$ as arising from a Bernoulli process conditional on the true latent occurrence process:

\begin{equation} \begin{split} &y_{r, a, k} \sim \text{Bernoulli}(p_{r, a, k}z_{j[a]}), \ &\text{logit}(p_{r, a, k}) = \bm{v}^{\top}_{r, a, k}\bm{\alpha}_r, \end{split} (#eq:yintPGOcc) \end{equation}

where $p_{r, a, k}$ is the probability of detecting a species at site $a$ during replicate $k$ (given it is present at site $a$) for data source $r$, which is a function of site, replicate, and data source specific covariates $\bm{V}r$ and a vector of regression coefficients specific to each data source ($\bm{\alpha}_r$). Note that $z{j[a]}$ is the true occurrence status at site $j$ corresponding to the $a$th data source site in the given data set $r$. Each data source may be available at all $J$ sites in the region of interest or at a subset of the $J$ sites. Additionally, data sources can overlap in the sites they sample, or they can be obtained at distinct sites within all $J$ sites of interest in the overall region.

We assume multivariate normal priors for the occurrence ($\bm{\beta}$) and data-set specific detection ($\bm{\alpha}$) regression coefficients to complete the Bayesian specification of a single-species occupancy model. $\pg$ data augmentation is implemented analgoous to previous models, where there is a single set of occurrence auxiliary variables ($\bm{\omega}{\beta}$) and a distinct set of detection auxiliary variables for each data source ($\bm{\omega}{r, \alpha}$).

In short, the integrated occupancy model has an identical process model to the single-species occupancy model, and has a distinct detection model for each data source that are all conditional on the same shared ecological process (species occurrence). Our full joint posterior takes the same form as that of a single-species occupancy model, except a separate conditional likelihood is specified for each data source which is dependent on its own unique set of detection regression cofficients and $\pg$ auxiliary variables.

MCMC sampler {#mcmcIntegrated}

The $\pg$ data augmentation induces a Gibbs update for all parameters in the single-species integrated occupancy model. We first sample the occurrence and detection auxiliary variables from

\begin{equation} \begin{split} \omega_{j, \beta} \mid \cdot &\sim \text{PG}(1, \bm{x}j^{\top}\bm{\beta}), \ \omega{r, j, k, \alpha} \mid \cdot &\sim \text{PG}(1, \bm{v}_{r, j, k}^{\top}\bm{\alpha}_r), \end{split} (#eq:omegaPostIntPGOcc) \end{equation}

The occurrence regression coefficients are sampled from the same full conditional as that in the single-species occupancy model in Equation \@ref(eq:betaPostPGOcc).

The detection regression coefficients for a given data source $r$ follows Equation \@ref(eq:alphaPostPGOcc), with all parameters now indexed by $r$.

Finally, $z_j$ is set to 1 for all sites where there is at least one detection from one more more data sources, and thus we only need to sample $z_j$ at sites where there are no detections. Thus, for all locations with no detections, we sample $z_j$ according to

\begin{equation} z_j \mid \cdot \sim \text{Bernoulli}\Bigg(\frac{\psi_j \prod_{\forall{a = j}}(1 - p_{r, a, j, k})}{1 - \psi_j + \psi_j \prod_{\forall{a = j}}(1 - p_{r, a, j, k})}\Bigg), (#eq:zPostIntPGOcc) \end{equation}

where the product occurs over all the sites in the $R$ data sources that correspond the $j$th location.

Prediction

Integrated occupancy models have an identical ecological process model to single-species occupancy models, and so out-of-sample prediction follows the same approach. See Section \@ref(singleSpeciesPred) for details.

Single-species integrated spatial occupancy model (`spIntPGOcc`)

Single-species integrated spatial occupancy models are identical to integrated occupancy models except the ecological process model now incorporates a spatially-structured random effect following the discussion in Section \@ref(spPGOcc). All details for the single-species integrated spatial occupancy model have already been presented. Here we present the sections to consult for necessary details for each portion of the single-species integrated spatial occupancy model.

Gaussian Process formulation

Model description

For the ecological process model, see Section \@ref(modelGP). For the observation model for each data source, see Section \@ref(integratedModel).

MCMC sampler

For the ecological process parameters, see Section \@ref(mcmcGP). For the observation process parameters, see Section \@ref(mcmcIntegrated).

Prediction

See Section \@ref(predictionSpPGOcc).

Nearest Neighbor Gaussian Process formulation

Model description

For the ecological process model, see Section \@ref(modelNNGP). For the observation model for each data source, see Section \@ref(integratedModel).

MCMC sampler

For the ecological process parameters, see Section \@ref(samplerspPGOccNNGP). For the observation process parameters, see Section \@ref(mcmcIntegrated).

Prediction

See Section \@ref(predictionspPGOccNNGP).

References {-}

doserjef/spOccupancy documentation built on June 2, 2025, 5:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

doserjef/spOccupancy Single-Species, Multi-Species, and Integrated Spatial Occupancy Models

In doserjef/spOccupancy: Single-Species, Multi-Species, and Integrated Spatial Occupancy Models

Introduction

Single-species occupancy model (PGOcc)

Model description

MCMC sampler

Prediction {#singleSpeciesPred}

Single-species spatial occupancy model (spPGOcc) {#spPGOcc}

Gaussian Process formulation

Model description {#modelGP}

MCMC sampler {#mcmcGP}

Prediction {#predictionSpPGOcc}

Nearest Neighbor Gaussian Process formulation

Model description {#modelNNGP}

MCMC sampler {#samplerspPGOccNNGP}

Prediction {#predictionspPGOccNNGP}

Multi-species occupancy model (msPGOcc)

Model description

MCMC sampler

Prediction

Multi-species spatial occupancy model (spMsPGOcc)

Gaussian Process formulation

Model description

MCMC sampler

Prediction

Nearest Neighbor Gaussian Process formulation

Model description

MCMC sampler and prediction

Single-species integrated occupancy model (intPGOcc)

Model description {#integratedModel}

MCMC sampler {#mcmcIntegrated}

Prediction

Single-species integrated spatial occupancy model (spIntPGOcc)

Gaussian Process formulation

Model description

MCMC sampler

Prediction

Nearest Neighbor Gaussian Process formulation

Model description

MCMC sampler

Prediction

References {-}

R Package Documentation

Browse R Packages

We want your feedback!

doserjef/spOccupancy
Single-Species, Multi-Species, and Integrated Spatial Occupancy Models

Single-species occupancy model (`PGOcc`)

Single-species spatial occupancy model (`spPGOcc`) {#spPGOcc}

Multi-species occupancy model (`msPGOcc`)

Multi-species spatial occupancy model (`spMsPGOcc`)

Single-species integrated occupancy model (`intPGOcc`)

Single-species integrated spatial occupancy model (`spIntPGOcc`)