options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set( comment = "", cache = TRUE )
\newcommand{\bm}{\boldsymbol}
This vignette provides worked examples and explanations for fitting single-species, multi-species, and integrated occupancy models in the spOccupancy
R package. We will provide step by step examples on how to fit the following models:
PGOcc()
. spPGOcc()
. msPGOcc()
.spMsPGOcc()
.intPGOcc()
.spIntPGOcc()
. We fit all occupancy models using Pólya-Gamma data augmentation [@polson2013] for computational efficiency (Pólya-Gamma is where the PG
comes from in all spOccupancy
model fitting function names). In this vignette, we will provide a brief description of each model, with full statistical details provided in a separate MCMC sampler vignette. We will also show how spOccupancy
provides functions for posterior predictive checks as a Goodness of Fit assessment, model comparison and assessment using the Widely Applicable Information Criterion (WAIC), k-fold cross-validation, and out of sample predictions using standard R helper functions (e.g., predict()
).
Note that this vignette does not detail all spOccupancy
functionality, and rather just details the core functionality presented in the original implementation of the package. Additional functionality, and associated vignettes, include:
To get started, we load the spOccupancy
package, as well as the coda
package, which we will use for some MCMC summary and diagnostics. We will also use the stars
and ggplot2
packages to create some basic plots of our results. We then set a seed so you can reproduce the same results as we do.
library(spOccupancy) library(coda) library(stars) library(ggplot2) set.seed(102)
As an example data set throughout this vignette, we will use data from twelve foliage-gleaning bird species collected from point count surveys at Hubbard Brook Experimental Forest (HBEF) in New Hampshire, USA. Specific details on the data set, which is just a small subset from a long-term survey, are available on the Hubbard Brook website and @doser2022integrated. The data are provided as part of the spOccupancy
package and are loaded with data(hbef2015)
. Point count surveys were conducted at 373 sites over three replicates, each of 10 minutes in length and with a detection radius of 100m. In the data set provided here, we converted these data to detection-nondetection data. Some sites were not visited for all three replicates. Additional information on the data set (including individual species in the data set) can be viewed in the man page using help(hbef2015)
.
data(hbef2015) # Load the data set. str(hbef2015) # Get an overview of what's in the data.
Thus, the object hbef2015
is a list comprised of the detection-nondetection data (y
), covariates on the occurrence portion of the model (occ.covs
), covariates on the detection portion of the model (det.covs
), and the spatial coordinates of each site (coords
) for use in spatial occupancy models and in plotting. This list is in the exact format required for input to spOccupancy
model functions. hbef2015
contains data on 12 species in the three-dimensional array y
, where the dimensions of y
correspond to species (12), sites (373), and replicates (3). For single-species occupancy models in Section 2 and 3, we will only use data on the charming Ovenbird (OVEN; Seiurus aurocapilla), so we next subset the hbef2015
list to only include data from OVEN in a new object ovenHBEF
.
sp.names <- dimnames(hbef2015$y)[[1]] ovenHBEF <- hbef2015 ovenHBEF$y <- ovenHBEF$y[sp.names == "OVEN", , ] table(ovenHBEF$y) # Quick summary.
We see that OVEN is detected at a little over half of all site-replicate combinations.
Let $z_j$ be the true presence (1) or absence (0) of a species at site $j$, with $j = 1, \dots, J$. For our OVEN example, $J = 373$. Following the basic occupancy model [@mackenzie2002; @tyre2003improving], we assume this latent occurrence variable arises from a Bernoulli process following
\begin{equation} \begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \ &\text{logit}(\psi_j) = \bm{x}^{\top}_j\bm{\beta}, \end{split} \end{equation}
where $\psi_j$ is the probability of occurrence at site $j$, which is a function of site-specific covariates $\bm{X}$ and a vector of regression coefficients ($\bm{\beta}$).
We do not directly observe $z_j$, but rather we observe an imperfect representation of the latent occurrence process as a result of imperfect detection (i.e., the failure to detect a species at a site when it is truly present). Let $y_{j, k}$ be the observed detection (1) or nondetection (0) of a species of interest at site $j$ during replicate $k$ for each of $k = 1, \dots, K_j$ replicates. Note that the number of replicates, $K_j$, can vary by site and in practical applications will often be equal to 1 for a subset of sites (i.e., some sites will have no replicate surveys). For our OVEN example, the maximum value of $K_j$ is three. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:
\begin{equation} \begin{split} &y_{j, k} \sim \text{Bernoulli}(p_{j, k}z_j), \ &\text{logit}(p_{j, k}) = \bm{v}^{\top}_{j, k}\bm{\alpha}, \end{split} \end{equation}
where $p_{j, k}$ is the probability of detecting a species at site $j$ during replicate $k$ (given it is present at site $j$), which is a function of site and replicate specific covariates $\bm{V}$ and a vector of regression coefficients ($\bm{\alpha}$).
To complete the Bayesian specification of the model, we assign multivariate normal priors for the occurrence ($\bm{\beta}$) and detection ($\bm{\alpha}$) regression coefficients. To yield an efficient implementation of the occupancy model using a logit link function, we use Pólya-Gamma data augmentation [@polson2013], which is described in depth in a separate MCMC sampler vignette.
PGOcc()
The PGOcc()
function fits single-species occupancy models using Pólya-Gamma latent variables, which makes it more efficient than standard Bayesian implementations of occupancy models using a logit link function [@polson2013; @clark2019]. PGOcc()
has the following arguments:
PGOcc(occ.formula, det.formula, data, inits, priors, n.samples, n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, k.fold.only = FALSE, ...)
The first two arguments, occ.formula
and det.formula
, use standard R model syntax to denote the covariates to be included in the occurrence and detection portions of the model, respectively. Only the right hand side of the formulas are included. Random intercepts can be included in both the occurrence and detection portions of the single-species occupancy model using lme4
syntax [@bates2015]. For example, to include a random intercept for different observers in the detection portion of the model, we would include (1 | observer)
in the det.formula
, where observer
indicates the specific observer for each data point. The names of variables given in the formulas should correspond to those found in data
, which is a list consisting of the following tags: y
(detection-nondetection data), occ.covs
(occurrence covariates), det.covs
(detection covariates). y
should be stored as a sites x replicate matrix, occ.covs
as a matrix or data frame with site-specific covariate values, and det.covs
as a list with each list element corresponding to a covariate to include in the detection portion of the model. Covariates on detection can vary by site and/or survey, and so these covariates may be specified as a site by survey matrix for survey-level covariates or as a one-dimensional vector for survey level covariates. The ovenHBEF
list is already in the required format. Here we will model OVEN occurrence as a function of linear and quadratic elevation and will include three observational covariates (linear and quadratic day of survey, time of day of survey) on the detection portion of the model. We standardize all covariates by using the scale()
function in our model specification, and use the I()
function to specify quadratic effects:
oven.occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2) oven.det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2) # Check out the format of ovenHBEF str(ovenHBEF)
Next, we specify the initial values for the MCMC sampler in inits
. PGOcc()
(and all other spOccupancy
model fitting functions) will set initial values by default, but here we will do this explicitly, since in more complicated cases setting initial values close to the presumed solutions can be vital for success of an MCMC-based analysis. For all models described in this vignette (in particular the non-spatial models), choice of the initial values is largely inconsequential, with the exception being that specifying initial values close to the presumed solutions can decrease the amount of samples you need to run to arrive at convergence of the MCMC chains. Thus, when first running a model in spOccupancy
, we recommend fitting the model using the default initial values that spOccupancy
provides, which are based on the prior distributions. After running the model for a reasonable period, if you find the chains are taking a long time to reach convergence, you then may wish to set the initial values to the mean estimates of the parameters from the initial model fit, as this will likely help reduce the amount of time you need to run the model.
The default initial values for occurrence and detection regression coefficients (including the intercepts) are random values from the prior distributions adopted in the model, while the default initial values for the latent occurrence effects are set to 1 if the species was observed at a site and 0 otherwise. Initial values are specified in a list with the following tags: z
(latent occurrence values), alpha
(detection intercept and regression coefficients), and beta
(occurrence intercept and regression coefficients). Below we set all initial values of the regression coefficients to 0, and set initial values for z
based on the detection-nondetection data matrix. For the occurrence (beta
) and detection (alpha
) regression coefficients, the initial values are passed either as a vector of length equal to the number of estimated parameters (including an intercept, and in the order specified in the model formula), or as a single value if setting the same initial value for all parameters (including the intercept). Below we take the latter approach. To specify the initial values for the latent occurrence at each site (z
), we must ensure we set the value to 1 at all sites where OVEN was detected at least once, because if we detect OVEN at a site then the value of z
is 1 with complete certainty (under the assumption of the model that there are no false positives). If the initial value for z
is set to 0 at one or more sites when the species was detected, the occupancy model will fail. spOccupancy
will provide a clear error message if the supplied initial values for z
are invalid. Below we use the raw detection-nondetection data and the apply()
function to set the initial values to 1 if OVEN was detected at that site and 0 otherwise.
# Format with explicit specification of inits for alpha and beta # with four detection parameters and three occurrence parameters # (including the intercept). oven.inits <- list(alpha = c(0, 0, 0, 0), beta = c(0, 0, 0), z = apply(ovenHBEF$y, 1, max, na.rm = TRUE)) # Format with abbreviated specification of inits for alpha and beta. oven.inits <- list(alpha = 0, beta = 0, z = apply(ovenHBEF$y, 1, max, na.rm = TRUE))
We next specify the priors for the occurrence and detection regression coefficients. The Pólya-Gamma data augmentation algorithm employed by spOccupancy
assumes normal priors for both the detection and occurrence regression coefficients. These priors are specified in a list with tags beta.normal
for occurrence and alpha.normal
for detection parameters (including intercepts). Each list element is then itself a list, with the first element of the list consisting of the hypermeans for each coefficient and the second element of the list consisting of the hypervariances for each coefficient. Alternatively, the hypermean and hypervariances can be specified as a single value if the same prior is used for all regression coefficients. By default, spOccupancy
will set the hypermeans to 0 and the hypervariances to 2.72, which corresponds to a relatively flat prior on the probability scale (0, 1; @lunn2013bugs). @broms2016model, @northrup2018comment, and others show such priors are an adequate choice when the goal is to specify relatively non-informative priors. We will use these default priors here, but we specify them explicitly below for clarity
oven.priors <- list(alpha.normal = list(mean = 0, var = 2.72), beta.normal = list(mean = 0, var = 2.72))
Our last step is to specify the number of samples to produce with the MCMC algorithm (n.samples
), the length of burn-in (n.burn
), the rate at which we want to thin the posterior samples (n.thin
), and the number of MCMC chains to run (n.chains
). Note that currently spOccupancy
runs multiple chains sequentially and does not allow chains to be run simultaneously in parallel across multiple threads. Instead, we allow for within-chain parallelization using the n.omp.threads
argument. We can set n.omp.threads
to a number greater than 1 and smaller than the number of threads on the computer you are using. Generally, setting n.omp.threads > 1
will not result in decreased run times for non-spatial models in spOccupancy
, but can substantially decrease run time when fitting spatial models [@finley2020spnngp]. Here we set n.omp.threads = 1
.
For a simple single-species occupancy model, we shouldn't need too many samples and will only need a small amount of burn-in and thinning. We will run the model using three chains to assess convergence using the Gelman-Rubin diagnostic (Rhat; @brooks1998).
n.samples <- 5000 n.burn <- 3000 n.thin <- 2 n.chains <- 3
We are now nearly set to run the occupancy model. The verbose
argument is a logical value indicating whether or not MCMC sampler progress is reported to the screen. If verbose = TRUE
, sampler progress is reported after every multiple of the specified number of iterations in the n.report
argument. We set verbose = TRUE
and n.report = 1000
to report progress after every 1000th MCMC iteration. The last four arguments to PGOcc()
(k.fold
, k.fold.threads
, k.fold.seed
, k.fold.only
) are used for performing k-fold cross-validation for model assessment, which we will illustrate in a subsequent section. For now, we won't specify the arguments, which will tell PGOcc()
not to perform k-fold cross-validation.
out <- PGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = TRUE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)
names(out) # Look at the contents of the resulting object.
PGOcc()
returns a list of class PGOcc
with a suite of different objects, many of them being coda::mcmc
objects of posterior samples. Note the "Preparing to run the model" printed section doesn't have any information shown in it. spOccupancy
model fitting functions will present messages when preparing the data for the model in this section, or will print out the default priors or initial values used when they are not specified in the function call. Here we specified everything explicitly so no information was reported.
For a concise yet informative summary of the regression parameters and convergence of the MCMC chains, we can use summary()
on the resulting PGOcc()
object.
summary(out)
Note that all coefficients are printed on the logit scale. We see OVEN is fairly widespread (at the scale of the survey) in the forest given the large intercept value, which translates to a value of 0.89 on the probability scale (run plogis(2.13)
). In addition, the negative linear and quadratic terms for Elevation
suggest occurrence probability peaks at mid-elevations. On average, detection probability is about 0.69.
In principle, before we even look at the parameter summaries, we ought to ensure that the MCMC chains have reached convergence. The summary
function also returns the Gelman-Rubin diagnostic [@brooks1998] and the effective samples size (ESS) of the posterior samples, which we can use to assess convergence. Here we see all Rhat values are less than 1.1 and the ESS values indicate adequate mixing of the MCMC chains. Additionally, we can use the plot()
function to plot traceplots of the individual model parameters that are contained in the resulting PGOcc
object. The plot()
function takes three arguments: x
(the model object), param
(a character string denoting the parameter name), and density
(logical value indicating whether to also plot the density of MCMC samples along with the traceplot). See ?plot.PGOcc
for more details (similar functions exist for all spOccupancy
model objects).
plot(out, 'beta', density = FALSE) # Occupancy parameters.
plot(out, 'alpha', density = FALSE) # Detection parameters.
The function ppcOcc()
performs a posterior predictive check on all spOccupancy
model objects as a Goodness of Fit (GoF) assessment. The fundamental idea of GoF testing is that a good model should generate data that closely align with the observed data. If there are drastic differences in the true data from the model generated data, our model is likely not very useful [@hobbs2015]. GoF assessments are more complicated using binary data, like detection-nondetection used in occupancy models, as standard approaches are not valid assessments for the raw data in binary response models such as occupancy models [@broms2016model; @mccullagh2019generalized]. Thus, any approach to assess model fit for detection-nondetection data must bin, or aggregate, the raw values in some manner, and then perform a model fit assessment on the binned values. There are numerous ways we could envision binning the raw detection-nondetection values [@mackenzie2004assessing; @kery2015applied]. In spOccupancy
, a posterior predictive check broadly takes the following steps:
PGOcc()
), which generates replicated values for all detection-nondetection data points.To perform a posterior predictive check, we send the resulting PGOcc
model object as input to the ppcOcc()
function, along with a fit statistic (fit.stat
) and a numeric value indicating how to group, or bin, the data (group
). Currently supported fit statistics include the Freeman-Tukey statistic and the Chi-Squared statistic (freeman-tukey
or chi-squared
, respectively, @kery2015applied). Currently, ppcOcc()
allows the user to group the data by row (site; group = 1
) or column (replicate; group = 2
). ppcOcc()
will then return a set of posterior samples for the fit statistic (or discrepancy measure) using the actual data (fit.y
) and model generated replicate data set (fit.y.rep
), summed across all data points in the chosen manner. We generally recommend performing a posterior predictive check when grouping data both across sites (group = 1
) as well as across replicates (group = 2
), as they may reveal (or fail to reveal) different inadequacies of the model for the specific data set at hand [@kery2015applied]. In particular, binning the data across sites (group = 1
) may help reveal whether the model fails to adequately represent variation in occurrence and detection probability across space, while binning the data across replicates (group = 2
) may help reveal whether the model fails to adequately represent variation in detection probability across the different replicate surveys. Similarly, we suggest exploring posterior predictive checks using both the Freeman-Tukey Statistic as well as the Chi-Squared statistic. Generally, the more ways we explore the fit of our model to the data, the more confidence we have that our model is adequately representing the data and we can draw reasonable conclusions from it. By performing posterior predictive checks with the two fit statistics and two ways of grouping the data, spOccupancy
provides us with four different measures that we can use to assess the validity of our model. Throughout this vignette, we will display different types of posterior predictive checks using different combinations of the fit statistic and grouping approach. In a more complete analysis, we would run all four types of posterior predictive checks for each model we fit as a more complete GoF assessment.
The resulting values from a call to ppcOcc()
can be used with the summary()
function to generate a Bayesian p-value, which is the probability, under the fitted model, to obtain a value of the fit statistic that is more extreme (i.e., larger) than the one observed, i.e., for the actual data. Bayesian p-values are sensitive to individual values, so we should also explore the discrepancy measures for each "grouped" data point. ppcOcc()
returns a matrix of posterior quantiles for the fit statistic for both the observed (fit.y.group.quants
) and model generated, replicate data (fit.y.rep.group.quants
) for each "grouped" data point.
We next perform a posterior predictive check using the Freeman-Tukey statistic grouping the data by sites. We summarize the posterior predictive check with the summary()
function, which reports a Bayesian p-value. A Bayesian p-value that hovers around 0.5 indicates adequate model fit, while values less than 0.1 or greater than 0.9 suggest our model does not fit the data well [@hobbs2015]. As always with a simulation-based analysis using MCMC, you will get numerically slightly different values.
ppc.out <- ppcOcc(out, fit.stat = 'freeman-tukey', group = 1) summary(ppc.out)
The Bayesian p-value is the proportion of posterior samples of the fit statistic of the model generated data that are greater than the corresponding fit statistic of the true data, summed across all "grouped" data points. We can create a visual representation of the Bayesian p-value as follows [@kery2015applied]:
ppc.df <- data.frame(fit = ppc.out$fit.y, fit.rep = ppc.out$fit.y.rep, color = 'lightskyblue1') ppc.df$color[ppc.df$fit.rep > ppc.df$fit] <- 'lightsalmon' plot(ppc.df$fit, ppc.df$fit.rep, bg = ppc.df$color, pch = 21, ylab = 'Fit', xlab = 'True') lines(ppc.df$fit, ppc.df$fit, col = 'black')
Our Bayesian p-value is greater than 0.1 indicating no lack of fit, although the above plot shows most of the fit statistics are smaller for the replicate data than the actual data set. Relying solely on the Bayesian p-value as an assessment of model fit is not always a great option, as individual data points can have an overbearing influence on the resulting summary value. Instead of summing across all data points for a single discrepancy measure, ppcOcc()
also allows us to explore discrepancy measures on a "grouped" point by point basis. The resulting ppcOcc
object will contain the objects fit.y.group.quants
and fit.y.rep.group.quants
, which contain quantiles of the posterior distributions for the discrepancy measures of each grouped data point. Below we plot the difference in the discrepancy measure between the replicate and actual data across each of the sites.
diff.fit <- ppc.out$fit.y.rep.group.quants[3, ] - ppc.out$fit.y.group.quants[3, ] plot(diff.fit, pch = 19, xlab = 'Site ID', ylab = 'Replicate - True Discrepancy')
We see there are four sites that contribute to the overall GoF measure far more than in proportion to their number, i.e., whose discrepancy measure for the actual data is much larger than the corresponding discrepancy for the replicate data. Here we will ignore this, but in a real analysis it might be very insightful to further explore these sites to see what could explain this pattern (e.g., are the sites close together in space? Could the data be the result of erroneous recording, or of extraordinary local habitat that is not adequately described by the occurrence covariates?). Although rarely done, closer investigations of outlying values in statistical analyses may sometimes teach one more than mere acceptance of a fitting model.
Posterior predictive checks allow us to assess how well our model fits the data, but they are not very useful if we want to compare multiple competing models and ultimately select a final model based on some criterion. Bayesian model selection is very much a constantly changing field. See @hooten2015guide for an accessible overview of Bayesian model selection for ecologists.
For Bayesian hierarchical models like occupancy models, the most common Bayesian model selection criterion, the deviance information criterion or DIC, is not applicable [@hooten2015guide]. Instead, the Widely Applicable Information Criterion [@watanabe2010] is recommended to compare a set of models and select the best-performing model for final analysis.
The WAIC is calculated for all spOccupancy
model objects using the function waicOcc()
. We calculate the WAIC as
$$ \text{WAIC} = -2 \times (\text{elpd} - \text{pD}), $$
where elpd is the expected log point-wise predictive density and PD is the effective number of parameters. We calculate elpd by calculating the likelihood for each posterior sample, taking the mean of these likelihood values, taking the log of the mean of the likelihood values, and summing these values across all sites. We calculate the effective number of parameters by calculating the variance of the log likelihood for each site taken over all posterior samples, and then summing these values across all sites. See Appendix S1 from @broms2016model for more details.
We calculate the WAIC using waicOcc()
for our OVEN model below (as always, note some slight differences with your solutions due to Monte Carlo error).
waicOcc(out)
For illustration, let's next rerun the OVEN model, but this time we assume occurrence is constant across the HBEF, and subsequently compare the WAIC value to the full model
out.small <- PGOcc(occ.formula = ~ 1, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = FALSE, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains) waicOcc(out.small)
Smaller values of WAIC indicate models with better performance. We see the WAIC for the model with elevation is smaller than the intercept-only model, indicating elevation is an important predictor for OVEN occurrence in HBEF.
When focusing primarily on predictive performance, a k-fold cross-validation (CV) approach is another attractive, though computationally more intensive, alternative to compare a series of models, especially since WAIC may not always be reliable for occupancy models [@broms2016model]. In spOccupancy
, k-fold cross-validation is accomplished using the arguments k.fold
, k.fold.threads
, and k.fold.seed
in the model fitting function. A k-fold cross validation approach requires fitting a model $k$ times, where each time the model is fit using $J / k$ data points, where $J$ is the total number of sites surveyed at least once in the data set. Each time the model is fit, it uses a different portion of the data and then predicts the remaining $J - J/k$ hold out values. Because the data are not used to fit the model, CV yields true samples from the posterior predictive distribution that we can use to assess the predictive capacity of the model.
As a measure of out-of-sample predictive performance, we use the deviance as a scoring rule following @hooten2015guide. For K-fold cross-validation, it is computed as
\begin{equation} -2 \sum_{k = 1}^K \text{log}\Bigg(\frac{\sum_{q = 1}^Q \text{Bernoulli}(\bm{y}_k \mid \bm{p}^{(q)}\bm{z}_k^{(q)})}{Q}\Bigg), \end{equation}
where $\bm{p}^{(q)}$ and $\bm{z}_k^{(q)}$ are MCMC samples of detection probability and latent occurrence, respectively, arising from a model that is fit without the observations $\bm{y}_k$, and $Q$ is the total number of posterior samples produced from the MCMC sampler. The -2 is used so that smaller values indicate better model fit, which aligns with most information criteria used for model assessment (like the WAIC implemented using waicOcc()
).
The three arguments in PGOcc()
(k.fold
, k.fold.threads
, k.fold.seed
) control whether or not k-fold cross validation is performed following the complete fit of the model using the entire data set. The k.fold
argument indicates the number of $k$ folds to use for cross-validation. If k.fold
is not specified, cross-validation is not performed and k.fold.threads
and k.fold.seed
are ignored. The k.fold.threads
argument indicates the number of threads to use for running the $k$ models in parallel across multiple threads. Parallel processing is accomplished using the R packages foreach
and doParallel
. Specifying k.fold.threads > 1
can substantially decrease run time since it allows for models to be fit simultaneously on different threads rather than sequentially. The k.fold.seed
indicates the seed used to randomly split the data into $k$ groups. This is by default set to 100. Lastly, the k.fold.only
argument controls whether the full model is run prior to performing cross-validation (k.fold.only = FALSE
, the default). If set to TRUE
, only k-fold cross-validation will be performed. This can be useful when performing cross-validation after doing some initial exploration with fitting different models with the full data set so that you don't have to rerun the full model again.
Below we refit the occupancy model with elevation (linear and quadratic) as an occurrence predictor this time performing 4-fold cross-validation. We set k.fold = 4
to perform 4-fold cross-validation and k.fold.threads = 1
to run the model using 1 thread. Normally we would set k.fold.threads = 4
, but using multiple threads leads to complications when compiling this vignette, so we leave that to you to explore the computational improvements of performing cross-validation across multiple cores.
out.k.fold <- PGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = TRUE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains, k.fold = 4, k.fold.threads = 1)
We subsequently refit the intercept only occupancy model, and compare the deviance metrics from the 4-fold cross-validation.
# Model fitting information is suppressed for space. out.int.k.fold <- PGOcc(occ.formula = ~ 1, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = FALSE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains, k.fold = 4, k.fold.threads = 1)
The cross-validation metric (model deviance) is stored in the k.fold.deviance
tag of the resulting model object.
out.k.fold$k.fold.deviance # Larger model. out.int.k.fold$k.fold.deviance # Intercept-only model.
Similar to the results from the WAIC, CV also suggests that the model including elevation with a predictor outperforms the intercept only model.
All model objects resulting from a call to spOccupancy
model-fitting functions can be used with predict()
to generate a series of posterior predictive samples at new locations, given the values of all covariates used in the model fitting process. The object hbefElev
(which comes as part of the spOccupancy
package) contains elevation data at a 30x30m resolution from the National Elevation Data set across the entire HBEF. We load the data below
data(hbefElev) str(hbefElev)
The column val
contains the elevation values, while Easting
and Northing
contain the spatial coordinates that we will use for plotting. We can obtain posterior predictive samples for the occurrence probabilities at these sites by using the predict()
function and our PGOcc
fitted model object. Given that we standardized the elevation values when we fit the model, we need to standardize the elevation values for prediction using the exact same values of the mean and standard deviation of the elevation values used to fit the data.
elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1]) # These are the new intercept and covariate data. X.0 <- cbind(1, elev.pred, elev.pred^2) out.pred <- predict(out, X.0)
For PGOcc
objects, the predict
function takes two arguments: (1) the PGOcc
fitted model object; and (2) a matrix or data frame consisting of the design matrix for the prediction locations (which must include and intercept if our model contained one). The resulting object consists of posterior predictive samples for the latent occurrence probabilities (psi.0.samples
) and latent occurrence values (z.0.samples
). The beauty of the Bayesian paradigm, and the MCMC computing machinery, is that these predictions all have fully propagated uncertainty. We can use these values to create plots of the predicted mean occurrence values, as well as of their standard deviation as a measure of the uncertainty in these predictions akin to a standard error associated with maximum likelihood estimates. We could also produce a map for the Bayesian credible interval length as an alternative measure of prediction uncertainty or two maps, one showing the lower and the other the upper limit of, say, a 95% credible interval.
plot.dat <- data.frame(x = hbefElev$Easting, y = hbefElev$Northing, mean.psi = apply(out.pred$psi.0.samples, 2, mean), sd.psi = apply(out.pred$psi.0.samples, 2, sd), stringsAsFactors = FALSE) # Make a species distribution map showing the point estimates, # or predictions (posterior means) dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y')) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'Mean OVEN occurrence probability') + theme_bw() # Map of the associated uncertainty of these predictions # (i.e., posterior sds) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'SD OVEN occurrence probability') + theme_bw()
When working across large spatial domains, accounting for residual spatial autocorrelation in species distributions can often improve predictive performance of a model, leading to more accurate species distribution maps [@guelat2018; @lany2020]. We here extend the basic single-species occupancy model to incorporate a spatial Gaussian Process that accounts for unexplained spatial variation in species occurrence across a region of interest. Let $\bm{s}_j$ denote the geographical coordinates of site $j$ for $j = 1, \dots, J$. In all spatially-explicit models, we include $\bm{s}_j$ directly in the notation of spatially-indexed variables to indicate the model is spatially-explicit. More specifically, the occurrence probability at site $j$ with coordinates $\bm{s}_j$, $\psi(\bm{s}_j)$, now takes the form
\begin{equation} \text{logit}(\psi(\bm{s}_j) = \bm{x}(\bm{s}_j)^{\top}\bm{\beta} + \omega(\bm{s}_j), \end{equation}
where $\omega_j$ is a realization from a zero-mean spatial Gaussian Process, i.e.,
\begin{equation} \bm{\omega}(\bm{s}) \sim N(\bm{0}, \bm{\Sigma}(\bm{\bm{s}, \bm{s}', \theta})). \end{equation}
We define $\bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})$ as a $J \times J$ covariance matrix that is a function of the distances between any pair of site coordinates $\bm{s}$ and $\bm{s}'$ and a set of parameters $(\bm{\theta})$ that govern the spatial process. The vector $\bm{\theta}$ is equal to $\bm{\theta} = {\sigma^2, \phi, \nu}$, where $\sigma^2$ is a spatial variance parameter, $\phi$ is a spatial decay parameter, and $\nu$ is a spatial smoothness parameter. $\nu$ is only specified when using a Matern correlation function.
The detection portion of the occupancy model remains unchanged from the non-spatial occupancy model. Single-species spatial occupancy models, like all models in spOccupancy
are fit using Pólya-Gamma data augmentation (see MCMC sampler vignette for details).
When the number of sites is moderately large, say 1000, the above described spatial Gaussian process model can become drastically slow as a result of the need to take the inverse of the spatial covariance matrix $\bm{\Sigma}(\bm{s}, \bm{s}', \bm{\theta})$ at each MCMC iteration. Numerous approximation methods exist to reduce this computational cost (see @heaton2019case for an overview and comparison of multiple methods). One attractive approach is the Nearest Neighbor Gaussian Process (NNGP; @datta2016hierarchical). Instead of modeling the spatial process using a full Gaussian Process, we replace the Gaussian Process prior specification with a NNGP, which leads to drastic improvements of run time with nearly identical inference and prediction as the full Gaussian Process specification. See @datta2016hierarchical, @finley2019efficient, and the MCMC sampler vignette for additional statistical details on NNGPs and their implementation in spatial occupancy models.
spPGOcc()
The function spPGOcc()
fits single-species spatial occupancy models using Pólya-Gamma latent variables, where spatial autocorrelation is accounted for using a spatial Gaussian Process. spPGOcc()
fits spatial occupancy models using either a full Gaussian process or an NNGP. See @finley2020spnngp for details on using NNGPs with Pólya-Gamma latent variables.
We will fit the same occupancy model for OVEN that we fit previously using PGOcc()
, but we will now make the model spatially explicit by incorporating a spatial process with spPGOcc()
. First, let's take a look at the arguments for spPGOcc()
:
spPGOcc(occ.formula, det.formula, data, inits, n.batch, batch.length, accept.rate = 0.43, priors, cov.model = "exponential", tuning, n.omp.threads = 1, verbose = TRUE, NNGP = FALSE, n.neighbors = 15, search.type = "cb", n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed = 100, k.fold.only = FALSE, ...)
We will walk through each of the arguments to spPGOcc()
in the context of our Ovenbird example. The occurrence (occ.formula
) and detection (det.formula
) formulas, as well as the list of data (data
), take the same form as we saw in PGOcc
, with random intercepts allowed in both the occurrence and detection models. Notice the coords
matrix in the ovenHBEF
list of data. We did not use this for PGOcc()
but specifying the spatial coordinates in data
is required for all spatially explicit models in spOccupancy
.
oven.occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2) oven.det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2) str(ovenHBEF) # coords is required for spPGOcc.
The initial values (inits
) are again specified in a list. Valid tags for initial values now additionally include the parameters associated with the spatial random effects. These include: sigma.sq
(spatial variance parameter), phi
(spatial range parameter), w
(the latent spatial random effects at each site), and nu
(spatial smoothness parameter), where the latter is only specified if adopting a Matern covariance function (i.e., cov.model = 'matern'
). spOccupancy
supports four spatial covariance models (exponential
, spherical
, gaussian
, and matern
), which are specified in the cov.model
argument. Throughout this vignette, we will use an exponential covariance model, which we often use as our default covariance model when fitting spatially-explicit models and is commonly used throughout ecology. To determine which covariance function to use, we can fit models with the different covariance functions and compare them using WAIC or k-fold cross-validation to select the best performing function. We will note that the Matern covariance function has the additional spatial smoothness parameter $\nu$ and thus can often be more flexible than the other functions. However, because we need to estimate an additional parameter, this also tends to require more data (i.e., a larger number of sites) than the other covariance functions, and so we encourage use of the three simpler functions if your data set is sparse. We note that model estimates are generally fairly robust to the different covariance functions, although certain functions may provide substantially better estimates depending on the specific form of the underlying spatial autocorrelation in the data. For example, the Gaussian covariance function is often useful for accounting for spatial autocorrelation that is very smooth (i.e., long range spatial dependence). See Chapter 2 in @banerjee2003 for a more thorough discussion of these functions and their mathematical properties.
The default initial values for phi
, sigma.sq
, and nu
are all set to random values from the prior distribution, which generally will be sufficient for the models discussed in this vignette. In all spatially-explicit models described in this vignette, the spatial range parameter phi
is the most sensitive to initial values. In general, the spatial range parameter will often have poor mixing and take longer to converge than the rest of the parameters in the model, so specifying an initial value that is reasonably close to the resulting value can really help decrease run times for complicated models. As an initial value for the spatial range parameter phi
, we compute the mean distance between points in HBEF and then set it equal to 3 divided by this mean distance. When using an exponential covariance function, $\frac{3}{\phi}$ is the effective range, or the distance at which the residual spatial correlation between two sites drops to 0.05 [@banerjee2003]. Thus our initial guess for this effective range is the average distance between sites across HBEF. As with all other parameters, we generally recommend using the default initial values for an initial model run, and if the model is taking a very long time to converge you can rerun the model with initial values based on the posterior means of estimated parameters from the initial model fit. For the spatial variance parameter sigma.sq
, we set the initial value to 2. This corresponds to a fairly small spatial variance, which we expect based on our previous work with this data set. Further, we set the initial values of the latent spatial random effects at each site to 0. The initial values for these random effects has an extremely small influence on the model results, so we generally recommend setting their initial values to 0 as we have done here (this is also the default). However, if you are running your model for a very long time and are seeing very slow convergence of the MCMC chains, setting the initial values of the spatial random effects to the mean estimates from a previous run of the model could help reach convergence faster.
# Pair-wise distances between all sites dist.hbef <- dist(ovenHBEF$coords) # Exponential covariance model cov.model <- "exponential" # Specify list of inits oven.inits <- list(alpha = 0, beta = 0, z = apply(ovenHBEF$y, 1, max, na.rm = TRUE), sigma.sq = 2, phi = 3 / mean(dist.hbef), w = rep(0, nrow(ovenHBEF$y)))
The next three arguments (n.batch
, batch.length
, and accept.rate
) are all related to the specific type of MCMC sampler we use when we fit the model. The spatial range parameter (and the spatial smoothness parameter if cov.model = 'matern'
) are the two hardest parameters to estimate in spatially-explicit models in spOccupancy
. In other words, you will often see slow mixing and convergence of the MCMC chains for these parameters. To try to speed up this slow mixing and convergence of these parameters, we use an algorithm called an adaptive Metropolis-Hastings algorithm for all spatially-explicit models in spOccupancy
(see @roberts2009examples for more details on this algorithm). In this approach, we break up the total number of MCMC samples into a set of "batches", where each batch has a specific number of MCMC samples. When we fit a spatially-explicit model in spOccupancy
, instead of specifying the total number of MCMC samples in the n.samples
argument like we did in PGOcc()
, we must specify the total number of batches (n.batch
) as well as the number of MCMC samples each batch contains. Thus, the total number of MCMC samples is n.batch * batch.length
. Typically, we set batch.length = 25
and then play around with n.batch
until convergence of all model parameters is reached. We recommend setting batch.length = 25
unless you have a specific reason to change it. Here we set n.batch = 400
for a total of 10,000 MCMC samples in each of 3 chains. We additionally specify a burn-in period of length 2000 and a thinning rate of 20.
batch.length <- 25 n.batch <- 400 n.burn <- 2000 n.thin <- 20 n.chains <- 3
Importantly, we also need to specify an acceptance rate and a tuning parameter for the spatial range parameter (and spatial smoothness parameter if cov.model = 'matern'
), which are both features of the adaptive algorithm we use to sample these parameters. In this algorithm, we propose new values for phi
(and nu
), compare them to our previous values, and use a statistical algorithm to determine if we should accept the new proposed value or keep the old one. The accept.rate
argument specifies the ideal proportion of times we will accept the newly proposed values for these parameters. @roberts2009examples show that if we accept new values around 43% of the time, this will lead to optimal mixing and convergence of the MCMC chains. Following these recommendations, we should strive for an algorithm that accepts new values about 43% of the time. Thus, we recommend setting accept.rate = 0.43
unless you have a specific reason not to (this is the default value). The values specified in the tuning
argument helps control the initial values we will propose for both phi
and nu
. These values are supplied as input in the form of a list with tags phi
and nu
. The initial tuning value can be any value greater than 0, but we generally recommend starting the value out around 0.5. This initial tuning value is closely related to the ideal (or target) acceptance rate we specified in accept.rate
. In short, the algorithm we use is "adaptive" because the algorithm will change the tuning values after each batch of the MCMC to yield acceptance rates that are close to our target acceptance rate that we specified in the accept.rate
argument. Information on the acceptance rates for phi
and nu
in your model will be displayed when setting verbose = TRUE
. After some initial runs of the model, if you notice the final acceptance rate is much larger or smaller than the target acceptance rate (accept.rate
), you can then change the initial tuning value to get closer to the target rate. While use of this algorithm requires us to specify more arguments in spatially-explicit models, this leads to much shorter run times compared to a more simple approach where we do not have an "adaptive" sampling approach, and it should thus save you time in the long haul when waiting for these models to run. For our example here, we set the initial tuning value to 1 after some initial exploratory runs of the model.
oven.tuning <- list(phi = 1) # accept.rate = 0.43 by default, so we do not specify it.
Priors are again specified in a list in the argument priors
. We follow standard recommendations for prior distributions from the spatial statistics literature [@banerjee2003]. We assume an inverse gamma prior for the spatial variance parameter sigma.sq
(the tag of which is sigma.sq.ig
), and uniform priors for the spatial decay parameter phi
and smoothness parameter nu
(if using the Matern correlation function), with the associated tags phi.unif
and nu.unif
. The hyperparameters of the inverse Gamma are passed as a vector of length two, with the first and second elements corresponding to the shape and scale, respectively. The lower and upper bounds of the uniform distribution are passed as a two-element vector for the uniform priors.
As of v0.3.2, we allow users to specify a uniform prior on the spatial variance parameter sigma.sq
(using the tag sigma.sq.unif
) instead of an inverse-Gamma prior. This can be useful in certain situations when working with a binary response variable (which we inherently do in occupancy models), as there is a confounding between the spatial variance parameter sigma.sq
and the occurrence intercept. This occurs as a result of the logit transformation and a mathematical statement known as Jensen's Inequality [@bolker2015]. Briefly, Jensen's Inequality tells us that while the spatial random effects don't influence the mean on the logit scale (since we give them a prior with mean 0), they do have an influence on the mean on the actual probability scale (after back-transforming), which is what leads to potential confounding between the spatial variance parameter and the occurrence intercept. Generally, we have found this confounding to be inconsequential, as the spatial structure of the random effects helps to separate the spatial variance from the occurrence intercept. However, there may be certain circumstances when $\sigma^2$ is estimated to be extremely large, and the estimate of the occurrence intercept is a very large magnitude negative number in order to compensate. It can be helpful in these situations to use a uniform distribution on sigma.sq
to restrict it to taking more reasonable values. We have rarely encountered such situations, and so throughout this vignette (and in our the vast majority of our analyses) we will use an inverse-Gamma prior.
Note that the priors for the spatial parameters in a spatially-explicit model must be at least weakly informative for the model to converge [@banerjee2003]. For the inverse-Gamma prior on the spatial variance, we typically set the shape parameter to 2 and the scale parameter equal to our best guess of the spatial variance. The default prior hyperparameter values for the spatial variance $\sigma^2$ are a shape parameter of 2 and a scale parameter of 1. This weakly informative prior suggests a prior mean of 1 for the spatial variance, which is a moderately small amount of spatial variation. Based on our previous work with these data, we expected the residual spatial variation to be relatively small, and so we use this default value and set the scale parameter below to 1. For the spatial decay parameter, our default approach is to set the lower and upper bounds of the uniform prior based on the minimum and maximum distances between sites in the data. More specifically, by default we set the lower bound to 3 / max
and the upper bound to 3 / min
, where min
and max
are the minimum and maximum distances between sites in the data set, respectively. This equates to a vague prior that states the spatial autocorrelation in the data could only exist between sites that are very close together, or could span across the entire observed study area. If additional information is known on the extent of the spatial autocorrelation in the data, you may place more restrictive bounds on the uniform prior, which would reduce the amount of time needed for adequate mixing and convergence of the MCMC chains. Here we use this default approach, but will explicitly set the values for transparency.
min.dist <- min(dist.hbef) max.dist <- max(dist.hbef) oven.priors <- list(beta.normal = list(mean = 0, var = 2.72), alpha.normal = list(mean = 0, var = 2.72), sigma.sq.ig = c(2, 1), phi.unif = c(3/max.dist, 3/min.dist))
The argument n.omp.threads
specifies the number of threads to use for within-chain parallelization, while verbose
specifies whether or not to print the progress of the sampler. We highly recommend setting verbose = TRUE
for all spatial models to ensure the adaptive MCMC is working as you want (and this is the reason for why this is the default for this argument). The argument n.report
specifies the interval to report the Metropolis-Hastings sampler acceptance rate. Note that n.report
is specified in terms of batches, not the overall number of samples. Below we set n.report = 100
, which will result in information on the acceptance rate and tuning parameters every 100th batch (not sample).
n.omp.threads <- 1 verbose <- TRUE n.report <- 100 # Report progress at every 100th batch.
The parameters NNGP
, n.neighbors
, and search.type
relate to whether or not you want to fit the model with a Gaussian Process or with NNGP, which is a much more computationally efficient approximation. The argument NNGP
is a logical value indicating whether to fit the model with an NNGP (TRUE
) or a regular Gaussian Process (FALSE
). For data sets that have more than 1000 locations, using an NNGP will have substantial improvements in run time. Even for more modest-sized data sets, like the HBEF data set, using an NNGP will be quite a bit faster (especially for multi-species models as shown in subsequent sections). Unless your data set is particularly small (e.g., 100 points) and you are concerned about the NNGP approximation, we recommend setting NNGP = TRUE
, which is the default. The arguments n.neighbors
and search.type
specify the number of neighbors used in the NNGP and the nearest neighbor search algorithm, respectively, to use for the NNGP model. Generally, the default values of these arguments will be adequate. @datta2016hierarchical showed that setting n.neighbors = 15
is usually sufficient, although for certain data sets a good approximation can be achieved with as few as five neighbors, which could substantially decrease run time for the model. We generally recommend leaving search.type = "cb"
, as this results in a fast code book nearest neighbor search algorithm. However, details on when you may want to change this are described in @finley2020spnngp. We will run an NNGP model using the default value for search.type
and setting n.neighbors = 5
, which we have found in exploratory analysis to closely approximate a full Gaussian Process.
We now fit the model (without k-fold cross-validation) and summarize the results using summary()
.
# Approx. run time: < 1 minute out.sp <- spPGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.batch = n.batch, batch.length = batch.length, priors = oven.priors, cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, tuning = oven.tuning, n.report = n.report, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains) class(out.sp) # Look at what spOccupancy produced. names(out.sp) summary(out.sp)
Most Rhat values are less than 1.1, although for a full analysis we should run the model longer to ensure the spatial variance parameter (sigma.sq
) has converged. The effective sample sizes of the spatial covariance parameters are somewhat low, which indicates limited mixing of these parameters. This is common when fitting spatial models. We note that there is substantial MC error in the spatial covariance parameters, and so your estimates may be slightly different from what we have shown here. We see spPGOcc()
returns a list of class spPGOcc
and consists of posterior samples for all parameters. Note that posterior samples for spatial parameters are stored in the list element theta.samples
.
For our posterior predictive check, we send the spPGOcc
model object to the ppcOcc()
function, this time grouping by replicate, or survey occasion, (group = 2
) instead of by site (group = 1
).
ppc.sp.out <- ppcOcc(out.sp, fit.stat = 'freeman-tukey', group = 2) summary(ppc.sp.out)
The Bayesian p-value indicates adequate fit of the spatial occupancy model.
We next use the waicOcc()
function to compute the WAIC, which we can compare to the non-spatial model to assess the benefit of incorporating the spatial random effects in the occurrence model.
waicOcc(out.sp) # Compare to non-spatial model waicOcc(out)
We see the WAIC value for the spatial model is substantially reduced compared to the nonspatial model, indicating that incorporation of the spatial random effects yields an improvement in predictive performance.
k-fold cross-validation is accomplished by specifying the k.fold
argument in spPGOcc
just as we saw in PGOcc
.
Finally, we can perform out of sample prediction using the predict
function just as before. Prediction for spatial models is more computationally intensive than for non-spatial models, and so the predict
function for spPGOcc
class objects also has options for parallelization (n.omp.threads
) and reporting sampler progress (verbose
and n.report
). Note that for spPGOcc()
, you also need to supply the coordinates of the out of sample prediction locations in addition to the covariate values.
# Do prediction. coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')]) # Approx. run time: 6 min out.sp.pred <- predict(out.sp, X.0, coords.0, verbose = FALSE) # Produce a species distribution map (posterior predictive means of occupancy) plot.dat <- data.frame(x = hbefElev$Easting, y = hbefElev$Northing, mean.psi = apply(out.sp.pred$psi.0.samples, 2, mean), sd.psi = apply(out.sp.pred$psi.0.samples, 2, sd)) dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y')) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'Mean OVEN occurrence probability') + theme_bw() # Produce an uncertainty map of the SDM (posterior predictive SDs of occupancy) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'SD OVEN occurrence probability') + theme_bw()
Comparing this to the non-spatial occupancy model, the spatial model appears to identify areas in HBEF with low OVEN occurrence that are not captured in the non-spatial model. We will resist trying to hypothesize what environmental factors could lead to these patterns.
Let $z_{i, j}$ be the true presence (1) or absence (0) of a species $i$ at site $j$, with $j = 1, \dots, J$ and $i = 1, \dots, N$. We assume the latent occurrence variable arises from a Bernoulli process following
\begin{equation} \begin{split} &z_{i, j} \sim \text{Bernoulli}(\psi_{i, j}), \ &\text{logit}(\psi_{i, j}) = \bm{x}^{\top}_{j}\bm{\beta}_i, \end{split} \end{equation}
where $\psi_{i, j}$ is the probability of occurrence of species $i$ at site $j$, which is a function of site-specific covariates $\bm{X}$ and a vector of species-specific regression coefficients ($\bm{\beta}_i$). The regression coefficients in multi-species occupancy models are envisioned as random effects arising from a common community level distribution:
\begin{equation} \bm{\beta}i \sim \text{Normal}(\bm{\mu}{\beta}, \bm{T}_{\beta}), \end{equation}
where $\bm{\mu}{\beta}$ is a vector of community level mean effects for each occurrence covariate effect (including the intercept) and $\bm{T}{\beta}$ is a diagonal matrix with diagonal elements $\bm{\tau}^2_{\beta}$ that represent the variability of each occurrence covariate effect among species in the community.
We do not directly observe $z_{i, j}$, but rather we observe an imperfect representation of the latent occurrence process. Let $y_{i, j, k}$ be the observed detection (1) or nondetection (0) of a species $i$ at site $j$ during replicate/survey $k$ for each of $k = 1, \dots, K_j$ replicates at each site $j$. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:
\begin{equation} \begin{split} &y_{i, j, k} \sim \text{Bernoulli}(p_{i, j, k}z_{i, j}), \ &\text{logit}(p_{i, j, k}) = \bm{v}^{\top}_{i, j, k}\bm{\alpha}_i, \end{split} \end{equation}
where $p_{i, j, k}$ is the probability of detecting species $i$ at site $j$ during replicate $k$ (given it is present at site $j$), which is a function of site and replicate specific covariates $\bm{V}$ and a vector of species-specific regression coefficients ($\bm{\alpha}_i$). Similarly to the occurrence regression coefficients, the species specific detection coefficients are envisioned as random effects arising from a common community level distribution:
\begin{equation} \bm{\alpha}i \sim \text{Normal}(\bm{\mu}{\alpha}, \bm{T}_{\alpha}), \end{equation}
where $\bm{\mu}{\alpha}$ is a vector of community level mean effects for each detection covariate effect (including the intercept) and $\bm{T}{\alpha}$ is a diagonal matrix with diagonal elements $\bm{\tau}^2_{\alpha}$ that represent the variability of each detection covariate effect among species in the community.
To complete the Bayesian specification of the model, we assign multivariate normal priors for the occurrence ($\bm{\mu}{\beta}$) and detection ($\bm{\mu}{\alpha}$) community-level regression coefficient means and independent inverse-Gamma priors for each element of $\bm{\tau}^2_{\beta}$ and $\bm{\tau}^2_{\alpha}$. We again use Pólya-Gamma data augmentation to yield an efficient implementation of the multi-species occupancy model, which is described in depth in the MCMC sampler vignette.
msPGOcc()
spOccupancy
uses nearly identical syntax for fitting multi-species occupancy models as it does for single-species models and provides the same functionality for posterior predictive checks, model assessment and selection using WAIC and k-fold cross-validation, and out of sample prediction. The msPGOcc()
function fits nonspatial multi-species occupancy models using Pólya-Gamma latent variables, which results in substantial increases in run time compared to standard implementations of logit link multi-species occupancy models. msPGOcc()
has exactly the same arguments as PGOcc()
:
msPGOcc(occ.formula, det.formula, data, inits, n.samples, priors, n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, k.fold.only = FALSE, ...)
For illustration, we will again use the Hubbard Brook data in hbef2015
, but we will now model occurrence for all 12 species in the community. Below we reload the hbef2015
data set to get a fresh copy.
data(hbef2015) # str(hbef2015) # Remind ourselves of what's in it (not shown).
We will model occurrence for all species as a function of linear and quadratic terms for elevation, and detection as a function of linear and quadratic day of survey as well as the time of day during which the survey was conducted. These models are specified in occ.formula
and det.formula
as before, which reference variables stored in the data
list. Random intercepts can be included in both the occurrence and detection portions of the occupancy model using lme4
syntax [@bates2015]. For multi-species models, the multi-species detection-nondetection data y
is now a three-dimensional array with dimensions corresponding to species, sites, and replicates. This is how the data are provided in the hbef2015
object, so we don't need to do any additional preparations.
occ.ms.formula <- ~ scale(Elevation) + I(scale(Elevation)^2) det.ms.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2) str(hbef2015)
Next we specify the initial values in inits
. For multi-species occupancy models, we supply initial values for community-level and species-level parameters. In msPGOcc()
, we will supply initial values for the following parameters: alpha.comm
(community-level detection coefficients), beta.comm
(community-level occurrence coefficients), alpha
(species-level detection coefficients), beta
(species-level occurrence coefficients), tau.sq.beta
(community-level occurrence variance parameters), tau.sq.alpha
(community-level detection variance parameters, z
(latent occurrence variables for all species). These are all specified in a single list. Initial values for community-level parameters are either vectors of length corresponding to the number of community-level detection or occurrence parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level parameters are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. The initial values for the latent occurrence matrix are specified as a matrix with $N$ rows corresponding to the number of species and $J$ columns corresponding to the number of sites.
N <- dim(hbef2015$y)[1] ms.inits <- list(alpha.comm = 0, beta.comm = 0, beta = 0, alpha = 0, tau.sq.beta = 1, tau.sq.alpha = 1, z = apply(hbef2015$y, c(1, 2), max, na.rm = TRUE))
In multi-species models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects. For nonspatial models, these priors are specified with the following tags: beta.comm.normal
(normal prior on the community-level occurrence mean effects), alpha.comm.normal
(normal prior on the community-level detection mean effects), tau.sq.beta.ig
(inverse-Gamma prior on the community-level occurrence variance parameters), tau.sq.alpha.ig
(inverse-Gamma prior on the community-level detection variance parameters). Each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type.
By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 2.72, which results in a vague prior on the probability scale as we discussed for the single-species spatial occupancy model. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of [@lunn2013bugs], which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect [@gelman2006prior], although we have found multi-species occupancy models to be relatively robust to this prior specification. However, if you want to explore the impact of this prior on species-specific covariate effects, we recommend running single-species occupancy models for a select number of species in the data set, and assessing how large the differences in the estimated species-specific effects are between the single-species model estimates and those from the multi-species model. Below we explicitly set these default prior values for our HBEF example.
ms.priors <- list(beta.comm.normal = list(mean = 0, var = 2.72), alpha.comm.normal = list(mean = 0, var = 2.72), tau.sq.beta.ig = list(a = 0.1, b = 0.1), tau.sq.alpha.ig = list(a = 0.1, b = 0.1))
All that's now left to do is specify the number of threads to use (n.omp.threads
), the number of MCMC samples (n.samples
), the amount of samples to discard as burn-in (n.burn
), the thinning rate (n.thin
), and arguments to control the display of sampler progress (verbose
, n.report
).
# Approx. run time: 6 min out.ms <- msPGOcc(occ.formula = occ.ms.formula, det.formula = det.ms.formula, data = hbef2015, inits = ms.inits, n.samples = 30000, priors = ms.priors, n.omp.threads = 1, verbose = TRUE, n.report = 6000, n.burn = 10000, n.thin = 50, n.chains = 3)
The resulting object out.ms
is a list of class msPGOcc()
consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, prediction, and model fit evaluation. We can display a nice summary of these results using the summary()
function as before. For multi-species objects, when using summary we need to specify the level of parameters we want to summarize. We do this using the argument level
, which takes values community
, species
, or both
to print results for community-level parameters, species-level parameters, or all parameters. By default, level
is set to both
to display both species and community-level parameters.
summary(out.ms) # Or more explicitly # summary(out.ms, level = 'both')
Looking at the community-level variance parameters, we see substantial variability in the average occurrence (the intercept) for the twelve species, as well as considerable variability in the effect of elevation across the community. There appears to be less variability across species in the detection portion of the model. We can look directly at the species-specific effects to confirm this. All community-level Rhat values are less than 1.1 and most species-level Rhat values are less than 1.1, with the exception of a few of the parameters for rare species. Mixing of the MCMC chains appears adequate when looking at the ESS values, although we can clearly see the ESS is lower for rare species (e.g., AMRE, BAWW) compared to more common species in the forest (e.g., OVEN, REVI).
We can use the ppcOcc()
function to perform a posterior predictive check, and summarize the check with a Bayesian p-value using the summary()
function, all as shown before. The summary()
function again requires the level
argument to specify if you want an overall Bayesian p-value for the entire community (level = 'community'
), each individual species (level = 'species'
), or for both (level = 'both'
). By default, we set `level = 'both'.
# Approx. run time: 20 sec ppc.ms.out <- ppcOcc(out.ms, 'chi-squared', group = 1) summary(ppc.ms.out)
The Bayesian p-value for the overall community suggests an adequate model fit, but we see clear variation in the values for each individual species. We should explore this further in a complete analysis.
We can compute the WAIC for comparison with alternative models using the waicOcc()
function.
waicOcc(out.ms)
As of v0.7.0
, the waicOcc()
function also contains the argument by.sp
for all multi-species models supported by spOccupancy
, which allows us to calculate WAIC individually for each species.
waicOcc(out.ms, by.sp = TRUE)
Note that the sum of the WAIC values for each individual species is equal to the overall WAIC value for the model.
k-fold cross-validation is again accomplished using the k.fold
argument as we have seen previously. For multi-species occupancy models, using multiple threads can greatly reduce the time needed for k-fold cross-validation, so we encourage the use of multiple threads if such computing power is readily available. Using up to $k$ threads will generally involve substantial decreases in run time. For multi-species models, a separate deviance measure is reported for each species as a measure of predictive capacity, allowing for comparisons across multiple models for individual species, as well as for the entire community (by summing all species-specific values).
Prediction with msPGOcc
objects is exactly analogous to what we saw with PGOcc
. We can use the predict
function along with a data frame of covariates at new locations. We can predict across the entire HBEF for all twelve species using the elevation data stored in hbefElev
.
elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1]) X.0 <- cbind(1, elev.pred, elev.pred^2) # Approx. run time: 100 sec out.ms.pred <- predict(out.ms, X.0)
Here we describe a basic multi-species spatial occupancy model that was part of the original release of spOccupancy
. Since then, we have implemented a more computationally efficient approach for fitting spatial multi-species occupancy models. We call this alternative approach a "spatial factor multi-species occupancy model", and we describe this in depth in @doser2023joint. This newer approach also accounts for residual species correlations (i.e., it is a joint species distribution model with imperfect detection). Our simulation results from @doser2023joint show that this alternative approach outperforms, or performs equally as well as, the model described in the following section, while being substantially faster. Thus, we encourage you to explore the spOccupancy
functionality to fit the spatial factor multi-species occupancy model with the function sfMsPGOcc()
that is described in depth in this vignette. The model and function we describe below will generally work well for communities comprised of a moderate number of species (e.g., between 5 and 10) that are not extremely rare.
Residual spatial autocorrelation may perhaps be more prominent in multi-species occupancy models compared to single-species models, as a single set of covariates is used to explain occurrence probability across a region of interest for every one of the species in the modeled community. Given the large variety individual species show in habitat requirements, this may result in important drivers of occurrence probability not being included for certain species, resulting in many species having high residual spatial autocorrelation. We extend the previous multi-species occupancy model to incorporate a distinct spatial Gaussian Process (GP) for each species that accounts for unexplained spatial variation in each individual species occurrence across a spatial region. Occurrence probability for species $i$ at site $j$ with spatial coordinates $\bm{s}j$, $\psi{i}(\bm{s}_j)$, now takes the form
\begin{equation} \text{logit}(\psi_{i}(\bm{s}j)) = \bm{x}^{\top}(\bm{s}_j)\bm{\beta}_i + \omega{i}(\bm{s}_j), \end{equation}
where the species-specific regression coefficients $\bm{\beta}i$ follow the community-level distribution described previously for nonspatial multi-species occupancy models, and $\omega{i}(\bm{s}_j)$ is a realization from a zero-mean spatial GP, i.e.,
\begin{equation} \bm{\omega}_{i}(\bm{s}) \sim \text{Normal}(\bm{0}, \bm{\Sigma}_i(\bm{s}, \bm{s}', \bm{\theta}_i)). \end{equation}
We define $\bm{\Sigma}_i(\bm{s}, \bm{s}', \bm{\theta}_i)$ as a $J \times J$ covariance matrix that is a function of the distances between any pair of site coordinates $\bm{s}$ and $\bm{s}'$ and a set of parameters $(\bm{\theta}_i)$ that govern the spatial process. The vector $\bm{\theta}_i$ is equal to $\bm{\theta}_i = {\sigma^2_i, \phi_i, \nu_i}$, where $\sigma^2_i$ is a spatial variance parameter for species $i$, $\phi_i$ is a spatial decay parameter for species $i$, and $\nu_i$ is a spatial smoothness parameter for species $i$, where as before, $\nu_i$ is only specified when using a Matern correlation function.
The detection portion of the multi-species spatial occupancy model remains unchanged from the non-spatial multi-species occupancy model. We fit the model again using Pólya-Gamma data augmentation to enable an efficient Gibbs sampler (see MCMC sampler vignette for details). Similar to our discussion on the single-species spatial occupancy model, we also allow for specification of the spatial process using an NNGP instead of a full GP. This leads to even larger computational gains over the full GP than in the case of a single-species model, given that a separate covariance matrix is specified for each species in the model. See @datta2016hierarchical, @finley2020spnngp, and the MCMC sampler vignette for additional details on NNGPs and their implementation in multi-species spatial occupancy models.
spMsPGOcc()
The function spMsPGOcc()
fits spatially explicit multi-species occupancy models. Similar to single-species models using spPGOcc()
, models can be fit using either a full Gaussian Process (GP) or a Nearest Neighbor Gaussian Process (NNGP). spMsPGOcc()
fits a separate spatial process for each species. The syntax for spMsPGOcc()
is analogous to the syntax for single-species spatially-explicit models using spPGOcc()
.
spMsPGOcc(occ.formula, det.formula, data, inits, n.batch, batch.length, accept.rate = 0.43, priors, cov.model = "exponential", tuning, n.omp.threads = 1, verbose = TRUE, NNGP = TRUE, n.neighbors = 15, search.type = "cb", n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, k.fold.only = FALSE, ...)
We will again showcase the model using the HBEF foliage-gleaning bird data set, with the same predictors in our occurrence and detection models
occ.ms.sp.formula <- ~ scale(Elevation) + I(scale(Elevation)^2) det.ms.sp.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
Our initial values in the inits
argument will look analogous to what we had specified for the nonspatial multi-species occupancy model using msPGOcc()
, but we will now also include additional initial values for the parameters controlling the spatial processes: sigma.sq
is the species-specific spatial variance parameter, phi
is the species specific spatial decay parameter, and w
is the latent spatial process for each species at each site. We will use an exponential covariance model, but note that when using a Matern covariance model we must also specify initial values for nu
, the species-specific spatial smoothness parameter. Note that all species-specific spatial parameters are independent of each other. We currently do not leverage any correlation between spatial processes of different species, although this is something we plan to incorporate for future spOccupancy
development. Initial values for phi
, sigma.sq
, and nu
(if applicable) are specified as vectors with $N$ elements (the number of species being modeled) or as a single value that is used for all species, while the initial values for the latent spatial processes are specified as a matrix with $N$ rows (i.e., species) and $J$ columns (i.e., sites). Here we set the initial value for the spatial variances equal to 2 for all species and set the initial values for the spatial decay parameter to yield an effective range of the average distance between sites across the HBEF.
# Number of species N <- dim(hbef2015$y)[1] # Distances between sites dist.hbef <- dist(hbef2015$coords) # Exponential covariance model cov.model <- "exponential" ms.inits <- list(alpha.comm = 0, beta.comm = 0, beta = 0, alpha = 0, tau.sq.beta = 1, tau.sq.alpha = 1, z = apply(hbef2015$y, c(1, 2), max, na.rm = TRUE), sigma.sq = 2, phi = 3 / mean(dist.hbef), w = matrix(0, N, dim(hbef2015$y)[2]))
We next specify the priors in the priors
argument. The priors are the same as those we specified for the non-spatial multi-species model, with the addition of priors for the parameters controlling the species-specific spatial processes. We assume independent priors for all spatial parameters across the different species. Priors (and their default values in spOccupancy
) for the spatial parameters are exactly analogous to those we saw for the single-species spatially-explicit occupancy model. For each species, we assign an inverse gamma prior for the spatial variance parameter sigma.sq
(tag is sigma.sq.ig
). As of v0.3.2, we can also specify a uniform prior for the spatial variance parameter with tag sigma.sq.unif
(see relevant discussion in single-species spatial occupancy models). We assign uniform priors for the spatial decay parameter phi
and smoothness parameter nu
(if cov.model = 'matern'
), with the associated tags phi.unif
and nu.unif
. All priors are specified as lists with two elements. For the inverse-Gamma prior, the first element is a length $N$ vector of shape parameters for each species, and the second element is a length $N$ vector of scale parameters for each species. As we have seen many times before, if the same prior is used for all species, both elements can be specified as single values. For the uniform priors, the first element is a length $N$ vector of the lower bounds for each species, and the second element is a length $N$ vector of upper bounds for each species. If the same prior is used for all species, both the lower and upper bounds can be specified as single values. For the inverse-Gamma prior on the spatial variances, here we set the shape parameter to 2 and the scale parameter also equal to 2. For a more formal analysis, we may want to do some exploratory data analysis to obtain a better guess for the spatial variance for each species, and then replace the scale parameter with this estimated guess for each species. For the spatial decay parameter, we determine the bounds of the uniform distribution by computing the smallest distance between sites and the largest distance between sites. We then set the lower bound of the uniform to 3/max
and the upper bound to 3/min
, where min
and max
correspond to the predetermined distances between sites. Again, these are the default hyperparameter values for the prior distribution of phi
that spOccupancy
will assign.
# Minimum value is 0, so need to grab second element. min.dist <- sort(unique(dist.hbef))[2] max.dist <- max(dist.hbef) ms.priors <- list(beta.comm.normal = list(mean = 0, var = 2.72), alpha.comm.normal = list(mean = 0, var = 2.72), tau.sq.beta.ig = list(a = 0.1, b = 0.1), tau.sq.alpha.ig = list(a = 0.1, b = 0.1), sigma.sq.ig = list(a = 2, b = 2), phi.unif = list(a = 3 / max.dist, b = 3 / min.dist))
We next set the parameters controlling the Adaptive MCMC algorithm (see spPGOcc()
section for details). Notice our specification of the initial tuning values is exactly the same as for spPGOcc
. We assume the same initial tuning value for all species. However, the adaptive algorithm will allow for species-specific tuning parameters, so these will be adjusted in the algorithm as needed (and reported to the R console if verbose = TRUE
).
batch.length <- 25 n.batch <- 400 n.burn <- 2000 n.thin <- 20 n.chains <- 3 ms.tuning <- list(phi = 0.5) n.omp.threads <- 1 # Values for reporting verbose <- TRUE n.report <- 100
Spatially explicit multi-species occupancy models are currently the most computationally intensive models fit by spOccupancy
. Even for modest sized data sets, we encourage the use of NNGPs instead of full GPs when fitting spatially-explicit models to ease the computational burden of fitting these models. We fit the model with an NNGP below using 5 neighbors and summarize it using the summary()
function, where we specify that we want to summarize both species and community-level parameters.
# Approx. run time: 10 min out.sp.ms <- spMsPGOcc(occ.formula = occ.ms.sp.formula, det.formula = det.ms.sp.formula, data = hbef2015, inits = ms.inits, n.batch = n.batch, batch.length = batch.length, accept.rate = 0.43, priors = ms.priors, cov.model = cov.model, tuning = ms.tuning, n.omp.threads = n.omp.threads, verbose = TRUE, NNGP = TRUE, n.neighbors = 5, n.report = n.report, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains) summary(out.sp.ms, level = 'both')
The resulting object out.sp.ms
is a list of class spMsPGOcc
consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, predictions, and model fit evaluation. The Rhat values indicate convergence of most community and species-level parameters, but some parameters for rare species have not yet converged and the ESS of many species-level parameters are quite small. This is a clear reminder that spatially-explicit models often have slow mixing rates, and so you will often need to run these models for a large number of samples in order to achieve convergence and adequate ESS.
We perform posterior predictive checks to assess Goodness of Fit using ppcOcc()
just as we have previously seen.
# Takes a few seconds to run. ppc.sp.ms.out <- ppcOcc(out.sp.ms, 'freeman-tukey', group = 2) summary(ppc.sp.ms.out, level = 'both')
Below we compute the WAIC using waicOcc()
and compare it to the WAIC for the non-spatial multi-species occupancy model.
waicOcc(out.sp.ms) waicOcc(out.ms)
As always, remember there is MC error in these numbers, and so the values you receive will be slightly different. The WAIC for the spatial model is considerably smaller than that for the nonspatial model, indicating the species-specific spatial processes improve prediction across the entire community. However, in a complete analysis we should ensure the models fully converge (and we have adequate ESS) before performing any model selection or comparison.
We can also compare the WAIC values individually for each species using the by.sp
argument.
waicOcc(out.sp.ms, by.sp = TRUE) waicOcc(out.ms, by.sp = TRUE)
k-fold cross-validation proceeds using the k.fold
argument as we have seen previously, returning a separate scoring rule (deviance) for each species.
Prediction with spMsPGOcc
objects again uses the predict()
function given a set of covariates and spatial coordinates of a set of locations. Results are very similar to the nonspatial multi-species model, so we do not execute the following code, since prediction in spatial models takes some time and run time is multiplied by the number of species here.
elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1]) X.0 <- cbind(1, elev.pred, elev.pred^2) coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')]) out.sp.ms.pred <- predict(out.sp.ms, X.0, coords.0)
Data integration is a model-based approach that leverages multiple data sources to provide inference and prediction on some latent process of interest [@miller2019recent]. Data integration is particularly relevant in ecology as many data sources are often collected to better understand a single ecological phenomenon, with each data source having advantages and disadvantages, say in terms of sample size, spatial extent, or data reliability. Often, multiple detection-nondetection data sources are available to study the occurrence and distribution of some species of interest. For example, both point count surveys performed by human observers and autonomous recording units could be used to monitor a bird species [@doser2021integrating]. Different types of data have different sources of observation error, which we should explicitly incorporate into a model to avoid attributing any variation in detection probability to the true ecological process. Here we describe single-species integrated occupancy models, which combine multiple sources of detection-nondetection data in a single hierarchical modeling framework. These data sources may or may not be replicated, but we note that the combination of two or more sources always results in a replicated data set.
The integrated occupancy model has an identical process model to that of the single-species occupancy model, but has a distinct detection model for each data source, each of which is conditional on the same, shared ecological process (i.e., species occurrence).
Let $z_j$ be the presence or absence of a species at site $j$, with $j = 1, \dots, J$. We assume this latent occurrence variable arises from a Bernoulli process following
\begin{equation} \begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \ &\text{logit}(\psi_j) = \bm{x}^{\top}_j\bm{\beta}, \end{split} \end{equation}
where $\psi_j$ is the probability of occurrence at site $j$, which is a function of site-specific covariates $\bm{X}$ and a vector of regression coefficients ($\bm{\beta}$).
We do not directly observe $z_j$, but rather we observe an imperfect representation of the latent occurrence process. In integrated models, we have $r = 1, \dots, R$ distinct sources of data that are all imperfect representations of a single, shared occurrence process. Let $y_{r, a, k}$ be the observed detection (1) or nondetection (0) of a species of interest in data set $r$ at site $a$ during replicate $k$. Because different data sources have different variables influencing the observation process, we envision a separate detection model for each data source that is defined conditional on a single, shared ecological process described above. We envision the detection-nondetection data from source $r$ as arising from a Bernoulli process conditional on the true latent occurrence process:
\begin{equation} \begin{split} &y_{r, a, k} \sim \text{Bernoulli}(p_{r, a, k}z_{j[a]}), \ &\text{logit}(p_{r, a, k}) = \bm{v}^{\top}_{r, a, k}\bm{\alpha}_r, \end{split} \end{equation}
where $p_{r, a, k}$ is the probability of detecting a species at site $a$ during replicate $k$ (given it is present at site $a$) for data source $r$, which is a function of site, replicate, and data source specific covariates $\bm{V}r$ and a vector of regression coefficients specific to each data source ($\bm{\alpha}_r$). Note that $z{j[a]}$ is the true occurrence status at site $j$ corresponding to the $a$th data source site in the given data set $r$. Each data source may be available at all $J$ sites in the region of interest or at a subset of the $J$ sites. Additionally, data sources can overlap in the sites they sample, or they can be obtained at distinct sites within all $J$ sites of interest in the overall region.
We assume multivariate normal priors for the occurrence ($\bm{\beta}$) and data-set specific detection ($\bm{\alpha}$) regression coefficients to complete the Bayesian specification of a single-species occupancy model. Pólya-Gamma data augmentation is implemented in an analogous manner to that of the previous models to yield an efficient implementation of integrated occupancy models.
To illustrate an integrated occupancy model, we will use two data sets that come from the White Mountain National Forest (WMNF) in New Hampshire and Maine, USA. Our goal is to model the occurrence of OVEN in the WMNF in 2015. Our first data source is the HBEF data set we have used to illustrate all single data source models. Our second data source comes from the National Ecological Observatory Network (NEON) at Bartlett Experimental Forest [@barnett2019terrestrial; @neonData]. The Bartlett Forest and HBEF both lie within the larger WMNF. Suppose we are interested in OVEN occurrence across the entire WMNF. By leveraging both data sources in a single integrated model, we will expand the range of covariates across which we can make reliable predictions, and may obtain results that are more indicative across the entire region of interest and not just a single data source location [@doser2022integrated]. In this particular case, there is no overlap between the two data sources (i.e., Bartlett Forest and HBEF do not overlap spatially). However, the integrated occupancy models fit by spOccupancy
can integrate data sources with no overlap, partial overlap, or complete overlap.
The NEON data are provided along with spOccupancy
in the neon2015
list. We load the NEON data along with the HBEF data below
data(hbef2015) data(neon2015) str(neon2015) # Look at what's in the new data set.
Details on the NEON data set are provided in the package documentation as well as in @doser2022integrated. The NEON data are collected at 80 point count sites in Bartlett Forest using a removal protocol with three time periods, resulting in replicated detection-nondetection data that can be used in an occupancy modeling framework (after reducing counts to binary detection indicators). The neon2015
list, like the hbef2015
object, contains the detection-nondetection data for 12 foliage-gleaning bird species (y
), occurrence covariates stored in occ.covs
, detection covariates stored in det.covs
, and the coordinates of the 80 point count locations stored in coords
. Below we subset the detection-nondetection data in both data sources to solely work with OVEN.
sp.names <- dimnames(hbef2015$y)[[1]] ovenHBEF <- hbef2015 ovenHBEF$y <- ovenHBEF$y[sp.names == "OVEN", , ] ovenNEON <- neon2015 ovenNEON$y <- ovenNEON$y[sp.names == "OVEN", , ] table(ovenHBEF$y) table(ovenNEON$y)
Thus, the ovenbird is observed in a little over half of the possible site/replicate combinations in both of the data sources.
intPGOcc()
The function intPGOcc()
fits single-species integrated occupancy models in spOccupancy
. Syntax is very similar to that of single data source models, and specifically takes the following form:
intPGOcc(occ.formula, det.formula, data, inits, n.samples, priors, n.omp.threads = 1, verbose = TRUE, n.report = 1000, n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, k.fold.data, k.fold.only = FALSE, ...)
The data
argument contains the list of data elements necessary for fitting an integrated occupancy model. For nonspatial integrated occupancy models, data
should be a list comprised of the following objects: y
(list of detection-nondetection data matrices for each data source), occ.covs
(data frame or matrix of covariates for occurrence model), det.covs
(a list of lists where each element of the list corresponds to the detection-nondetection data for the given data source), sites
(a list where each element consists of the site indices for the given data source.
The ovenHBEF
and ovenNEON
lists are currently formatted for use in single data source models and so we need to combine these data sources together. Perhaps the trickiest part of data integration is ensuring each point count location in each data source lines up with the correct geographical location where you want to determine the true presence/absence of the species of interest. In spOccupancy
, most of this bookkeeping is done under the hood, but we will need to combine the two data sources together into a single list in which we are consistent about how the data sources are sorted. To accomplish this, we recommend first creating the occurrence covariates matrix for all data sources. Because our two data sources do not overlap spatially, this is relatively simple here as we can just use rbind()
.
occ.covs.int <- rbind(ovenHBEF$occ.covs, ovenNEON$occ.covs) str(occ.covs.int)
Notice the order in which we placed these covariates: all covariate values for HBEF come first, followed by all covariates for NEON. We need to ensure that we use this identical ordering for all objects in the data
list. Next, we create the site indices stored in sites
. sites
should be a list with two elements (one for each data source), where each element consists of a vector that indicates the rows in occ.covs
that correspond with the specific row of the detection-nondetection data for that data source. When the data sources stem from distinct points (like in our current case), this is again relatively straightforward as the indices simply correspond to how we ordered the points in occ.covs
.
sites.int <- list(hbef = 1:nrow(ovenHBEF$occ.covs), neon = 1:nrow(ovenNEON$occ.covs) + nrow(ovenHBEF$occ.covs)) str(sites.int)
Next we create the combined detection-nondetection data y
. For integrated models in spOccupancy
, y
is a list of matrices, with each element containing the detection-nondetection matrix for the specific data source. Again, we must ensure that we place the data sources in the correct order.
y.int <- list(hbef = ovenHBEF$y, neon = ovenNEON$y) str(y.int)
Lastly, we create the detection covariates det.covs
. This should be a list of the detection covariates from each individual data source. Because individual data source detection covariates are stored as lists for single data source models in spOccupancy
, det.covs
is now a list of lists for integrated occupancy models.
det.covs.int <- list(hbef = ovenHBEF$det.covs, neon = ovenNEON$det.covs)
Finally, we package everything together into a single list, which we call data.int
.
data.int <- list(y = y.int, occ.covs = occ.covs.int, det.covs = det.covs.int, sites = sites.int) str(data.int)
We specify the occurrence and detection model formulas using the occ.formula
, and det.formula
arguments. The occ.formula
remains unchanged from previous models, and we will specify occurrence of the ovenbird as a function of linear and quadratic terms of elevation.
occ.formula.int <- ~ scale(Elevation) + I(scale(Elevation)^2)
For the detection models, we need to specify a different detection model for each data source. We do this by supplying a list to the det.formula
argument, where each element of the list is the model formula for that given data set. Here we specify the detection model for HBEF as a function of linear and quadratic terms for 'day of survey' as well as a linear term for 'time of survey'. In this case, we include the same covariates for the NEON model (though we note that different coefficients are estimated for the two data sources). However, there is no requirement for the data sources to be a function of the same covariates.
det.formula.int = list(hbef = ~ scale(day) + scale(tod) + I(scale(day)^2), neon = ~ scale(day) + scale(tod) + I(scale(day)^2))
Next we specify the initial values. Initial values are specified in a list with the following tags: z
(latent occurrence values), alpha
(detection regression coefficients), and beta
(occurrence regression coefficients). This aligns with fitting single-species occupancy models using PGOcc()
. However, since we now have multiple detection models with different coefficients for each data source, initial values for alpha
are now passed to intPGOcc()
as a list, with each element of the list corresponding to the initial detection parameter values for a given data source. These are specified either as a vector with a value for each parameter or a single value for all parameters).
# Total number of sites J <- nrow(data.int$occ.covs) inits.list <- list(alpha = list(0, 0), beta = 0, z = rep(1, J))
We next specify the priors for all parameters in the integrated occupancy model in a list that is passed into the priors
argument. We specify normal priors for both the occurrence and detection regression coefficients, using tags beta.normal
and alpha.normal
, respectively.
prior.list <- list(beta.normal = list(mean = 0, var = 2.72), alpha.normal = list(mean = list(0, 0), var = list(2.72, 2.72)))
Priors for the occurrence regression coefficients are specified as we have seen in previous models. Because we have multiple detection-nondetection data sets each with distinct detection parameters, we specify the hypermeans and hypervariances in individual lists, where each element of the list corresponds to a specific data source. Again, the ordering of the data sources in the lists must align with the order the data sources are saved in the detection-nondetection data supplied to the data
argument.
Finally, we specify the number of samples, burn-in, and thinning rate using the same approach we have used for previous models.
n.samples <- 8000 n.burn <- 3000 n.thin <- 25
We can now run the integrated occupancy model. Below we set the number of threads used to 1 and print out sampler progress after every 2000th iteration.
# Approx. run time: < 15 sec out.int <- intPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int, data = data.int, inits = inits.list, n.samples = n.samples, priors = prior.list, n.omp.threads = 1, verbose = TRUE, n.report = 2000, n.burn = n.burn, n.thin = n.thin, n.chains = 3)
We again make a call to the summary
function for a concise description of the model results.
summary(out.int)
The summary
function for integrated models returns the detection parameters separately for each detection covariate. The Rhat and ESS values clearly indicate adequate convergence and mixing of the MCMC chains. Looking at the occurrence parameters, we see fairly similar estimates to those from the single data source model using HBEF data only. Of course this is only a half-surprise due to the unequal sizes of the two data sets, where the HBEF data will be expected to dominate estimation of the occupancy parameters. But an alternative explanation for the relative similarity is that the two processes are indeed very similar in the two areas within the greater WMNF region.
We perform posterior predictive checks using ppcOcc()
as before. GoF assessment for integrated models is an active area of research and no consensus has so far appeared on the best approach. In spOccupancy
, we compute posterior predictive checks separately for each data set in the integrated model.
ppc.int.out <- ppcOcc(out.int, 'freeman-tukey', group = 2) summary(ppc.int.out)
The low Bayesian p-value for NEON suggests a potential lack of fit which we would explore in a complete analysis.
We use waicOcc()
to compute the WAIC for integrated occupancy models. Similar to the posterior predictive check, individual WAIC values are reported for each data set. These can be summed across all data sources for an overall WAIC value if desired.
waicOcc(out.int)
k-fold cross-validation is implemented using the k.fold
argument as we have seen in previous spOccupancy
model fitting functions. Cross-validation for models with multiple data sources is not as straightforward as single data source models. For instance, we could envision splitting the data in multiple different ways to assess predictive performance, depending on the purpose of our comparison. In spOccupancy
, we implement two approaches for cross-validation of integrated occupancy models. In the first approach, we hold out locations irrespective of what data source they came from. This results in a scoring rule (deviance) for each individual data source based on the hold out sites where that data source is sampled. More specifically, our first algorithm for K-fold cross-validation is:
This form of k-fold cross-validation is applicable for model-selection between different integrated occupancy models. In other words, this approach can be used to compare models that integrate the same data sources but include different covariates in the occurrence and/or detection portion of the occupancy model. Using our example, we implement 4-fold cross-validation to compare the full integrated model with covariates to an intercept only integrated occupancy model. We do this using the k.fold
, k.fold.threads
, and k.fold.seed
arguments as with previous spOccupancy
models. Below we use the default values for k.fold.threads
and k.fold.seed
.
out.int.k.fold <- intPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int, data = data.int, inits = inits.list, n.samples = n.samples, priors = prior.list, n.omp.threads = 1, verbose = FALSE, n.report = 2000, n.burn = n.burn, n.thin = n.thin, n.chains = 1, k.fold = 4) out.int.k.fold.small <- intPGOcc(occ.formula = ~ 1, det.formula = list(hbef = ~ 1, neon = ~ 1), data = data.int, inits = inits.list, n.samples = n.samples, priors = prior.list, n.omp.threads = 1, verbose = FALSE, n.report = 2000, n.burn = n.burn, n.thin = n.thin, n.chains = 1, k.fold = 4) # Summarize the CV results out.int.k.fold$k.fold.deviance out.int.k.fold.small$k.fold.deviance
We see the deviance for the full model is lower for both data sources compared to the intercept only model.
Alternatively, we may not wish to compare different integrated occupancy models together, but rather wish to assess whether or not data integration is necessary compared to using a single data source occupancy model. To accomplish this task, we can perform cross-validation with the integrated occupancy model using only a single data source as the hold out set, and then compare the deviance scoring rule to a scoring rule obtained from cross-validation with a single data source occupancy model. We accomplish this by using the argument k.fold.data
. If k.fold.data
is specified as an integer between 1 and the number of data sources integrated (in this case 2), only the data source corresponding to that integer will be used in the hold out and k-fold cross-validation process. If k.fold.data
is not specified, k-fold cross-validation holds out data irrespective of the data sources at the given location. Here, we set k.fold.data = 1
and compare the cross-validation results of the integrated model to the occupancy model using only HBEF we fit with PGOcc()
(which is stored in the out.k.fold
object).
out.int.k.fold.hbef <- intPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int, data = data.int, inits = inits.list, n.samples = n.samples, priors = prior.list, n.omp.threads = 1, verbose = TRUE, n.report = 2000, n.burn = n.burn, n.thin = n.thin, k.fold = 4, k.fold.data = 1) # Look at CV results again # Single data source model out.k.fold$k.fold.deviance # Integrated model out.int.k.fold.hbef$k.fold.deviance
Here we see that integration of the two data sources does not improve predictive performance at HBEF alone. We should also do the same approach with the NEON data. We close this section by emphasizing that there are potentially numerous other benefits to data integration than just predictive performance that must be carefully considered when trying to determine if data integration is necessary or not. See @simmonds2020more, the discussion in @doser2022integrated, and Chapter 10 in AHM2 for more on this topic.
Prediction for integrated occupancy models proceeds exactly as before using predict()
. Here we predict occurrence across HBEF for comparison with the single data source models (and remember to load the stars
and ggplot2
packages).
# Make sure to standardize using mean and sd from fitted model elev.pred <- (hbefElev$val - mean(data.int$occ.covs[, 1])) / sd(data.int$occ.covs[, 1]) X.0 <- cbind(1, elev.pred, elev.pred^2) out.int.pred <- predict(out.int, X.0)
# Producing an SDM for HBEF alone (posterior mean) plot.dat <- data.frame(x = hbefElev$Easting, y = hbefElev$Northing, mean.psi = apply(out.int.pred$psi.0.samples, 2, mean), sd.psi = apply(out.int.pred$psi.0.samples, 2, sd)) dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y')) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'Mean OVEN occurrence probability using intPGOcc') + theme_bw() ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'SD OVEN occurrence probability using intPGOcc') + theme_bw()
Single-species spatial integrated occupancy models are identical to integrated occupancy models except the ecological process model now incorporates a spatially-structured random effect following the discussion with single-species spatial occupancy models. All details for the single-species integrated spatial occupancy model have already been presented in previous model descriptions.
spIntPGOcc()
The function spIntPGOcc()
fits single-species spatial integrated occupancy models in spOccupancy
. Syntax is very similar to that of the single data source models and specifically takes the following form:
spIntPGOcc(occ.formula, det.formula, data, inits, n.batch, batch.length, accept.rate = 0.43, priors, cov.model = "exponential", tuning, n.omp.threads = 1, verbose = TRUE, NNGP = TRUE, n.neighbors = 15, search.type = 'cb', n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, k.fold.data, k.fold.only = FALSE, ...)
The occ.formula
, det.formula
, and data
arguments are analogous to what we saw with the nonspatial integrated occupancy model. However, as for all spatial models in spOccupancy
, the data
list must also contain the spatial coordinates in the coords
tag, which we add below.
data.int$coords <- rbind(hbef2015$coords, neon2015$coords) occ.formula.int <- ~ scale(Elevation) + I(scale(Elevation)^2) det.formula.int <- list(hbef = ~ scale(day) + scale(tod) + I(scale(day)^2), neon = ~ scale(day) + scale(tod) + I(scale(day)^2))
Initial values specified in inits
and priors in priors
are specified in the same form as for intPGOcc()
with the additional values for spatial parameters. Analogous to all other spatial models in spOccupancy
, the spatial variance parameter takes an inverse-Gamma prior and the spatial range parameter (as well as the spatial smoothness parameter if cov.model = 'matern'
) takes a uniform prior (see discussion in single-species spatial occupancy models section on how to specify a uniform prior for the spatial variance parameter instead of an inverse-Gamma).
dist.int <- dist(data.int$coords) min.dist <- min(dist.int) max.dist <- max(dist.int) J <- nrow(data.int$occ.covs) # Exponential covariance model cov.model <- "exponential" inits.list <- list(alpha = list(0, 0), beta = 0, z = rep(1, J), sigma.sq = 2, phi = 3 / mean(dist.int), w = rep(0, J)) prior.list <- list(beta.normal = list(mean = 0, var = 2.72), alpha.normal = list(mean = list(0, 0), var = list(2.72, 2.72)), sigma.sq.ig = c(2, 1), phi.unif = c(3 / max.dist, 3 / min.dist))
Finally, we specify the remaining parameters regarding the NNGP specifications, tuning parameters, and the length of the MCMC sampler we will run. We are then all set to fit the model. Remember that spatially-explicit models in spOccupancy
are implemented using an efficient adaptive MCMC sampler that requires us to specify the number of MCMC batches (n.batch
) and the length of each MCMC batch (batch.length
), which together determine the number of MCMC samples (i.e., n.samples = n.batch * batch.length
). We run the model and summarize the results with summary()
.
batch.length <- 25 n.batch <- 800 n.burn <- 10000 n.thin <- 40 tuning <- list(phi = .2) # Approx. run time: 2 min out.sp.int <- spIntPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int, data = data.int, inits = inits.list, priors = prior.list, tuning = tuning, cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, n.batch = n.batch, n.burn = n.burn, n.chains = 3, batch.length = batch.length, n.report = 200) summary(out.sp.int)
We see the occurrence and detection parameters have converged. We should run the chains for a bit longer to ensure convergence of the spatial covariance parameters, but for this example we will continue on.
Below we perform a posterior predictive check for each of the data sets included in the occupancy model using ppcOcc()
.
ppc.sp.int.out <- ppcOcc(out.sp.int, 'freeman-tukey', group = 2) summary(ppc.sp.int.out)
According to the Bayesian p-values, there is no lack of fit for either the HBEF or NEON data.
We can perform model selection using WAIC with the waicOcc()
function as we have seen previously. Here we compare the WAIC from the spatial integrated model to the non-spatial integrated model.
waicOcc(out.int) waicOcc(out.sp.int)
We see the spatial model performs better for both the HBEF data and the NEON data.
Two forms of k-fold cross-validation are implemented for spIntPGOcc()
, analogous to those discussed for non-spatial integrated occupancy models. We first use cross-validation to compare the predictive performance of the spatial integrated model to the nonspatial integrated model across all sites.
out.sp.int.k.fold <- spIntPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int, data = data.int, inits = inits.list, priors = prior.list, tuning = tuning, cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, n.batch = n.batch, n.burn = n.burn, batch.length = batch.length, verbose = FALSE, k.fold = 4) # Non-spatial model out.int.k.fold$k.fold.deviance # Spatial model out.sp.int.k.fold$k.fold.deviance
Further, we perform cross-validation using only the HBEF data source as a hold out data source to compare with single data source models to assess the benefit of integration.
out.sp.int.k.fold.hbef <- spIntPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int, data = data.int, inits = inits.list, priors = prior.list, tuning = tuning, cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, n.batch = n.batch, n.burn = n.burn, batch.length = batch.length, verbose = FALSE, k.fold = 4, k.fold.data = 1) out.sp.int.k.fold.hbef$k.fold.deviance # Non-spatial single data source model out.k.fold$k.fold.deviance
Prediction for spatial integrated occupancy models proceeds exactly analogous to our approach using nonspatial integrated occupancy models. The only difference is that now we must also provide the coordinates of the nonsampled locations. The following code is not run as it provides similar results to the nonspatial integrated occupancy model.
# Make sure to standardize using mean and sd from fitted model elev.pred <- (hbefElev$val - mean(data.int$occ.covs[, 1])) / sd(data.int$occ.covs[, 1]) X.0 <- cbind(1, elev.pred, elev.pred^2) coords.0 <- as.matrix(hbefElev[, c(2, 3)]) out.sp.int.pred <- predict(out.sp.int, X.0, coords.0)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.