In jsocolar/flocker: Flexible Occupancy Estimation with Stan

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Multiseason occupancy models, also called dynamic occupancy models, are models that assume closure over units that are linked by explicit colonization and extinction processes.^[As we will see, this is true even of multiseason models with autologistic occupancy dynamics] That is, sites are surveyed over a series of more than one timestep,^[It's fine if some sites are surveyed for just one timestep, but at least some sites are surveyed for multiple timesteps.] closure is assumed to hold within (but not across) timesteps, and the latent occupancy state is linked across timesteps within a site via models of local colonization and extinction processes. Following the terminology introduced here, we will refer to the units over which closure is assumed as closure-units or just plain units. Groups of units that are linked by colonization-extinction dynamics (i.e. units representing different timesteps within a site) are series.

Multiseason models as HMMs

In flocker, we implement these models using a hidden markov model (HMM) approach to dealing with the unobserved occupancy state in each separate series. Calling our approach a HMM doesn't fundamentally change the model; it serves mainly to highlight the connection between multiseason occupancy models and well-developed techniques for efficiently computing the likelihood for a series.^[For the curious, we use the forward algorithm to compute the likelihood, and the forward-backward algorithm to compute unit-wise posterior distributions for the latent occupancy state, and the forward-filtering-backward-sampling algorithm to draw valid posterior samples for the sequence of latent states] The HMM conceptualizes the latent occupancy state as a Markov-like process (the process need not be strictly Markovian, because the transition probabilities can vary by timestep), where the transition probabilities between the two possible states (occupied or unoccupied) are a colonization probability $\mathcal{C}$, an extinction probability $\mathcal{E}$, and their complements $1 - \mathcal{C}$ and $1 - \mathcal{E}$. The complement of the extinction probability turns out to be particularly important; we call it the persistence probability $\mathcal{P} = 1 - \mathcal{E}$.

To specify the HMM, we need a model for the $\mathcal{C}$, a model for $\mathcal{E}$, and a model for the initial occupancy probability $\mathcal{O}$ in the first timestep. We also need a model for the probability of observing the data in any particular timestep conditional on the true occupancy state in that timestep (in the HMM literature, this probability is referred to as an emission probability).

Emission probabilities

Emission probabilities are just fancy HMM terminology for the most familiar part of multiseason occupancy model. They are the probability of observing the data within a unit given the latent occupancy state, and they depend only on the detection sub-model, which is no different from a single-season detection sub-model. In flocker, we assume that the logit-scale detection probabilities vary according to a suite of covariates that can vary across all levels of organization, including down to the visit level. This is exactly the same way that flocker treats the detection sub-model in a single-season model.

Colonization and extinction

In "colonization-extinction" models, logit colonization probabilities $\mathcal{C}$ and logit extinction probabilities $\mathcal{E}$ are both assumed to vary according to separate predictors based on covariates that may vary across timesteps within series but never across visits within closure-units.

In "autologistic" models, the occupancy probability in timestep $t + 1$ is assumed to depend on a suite of site characteristics plus a term that depends on the latent true occupancy state at time $t$. Some treatments write the autologistic model in terms of an "occupancy probability" that depends on covariates plus the "autologistic term" that gets added on the logit scale when the site was previously occupied. However, it is easier to understand this model in terms of its colonization and extinction probabilities. In fact, the so-called "occupancy probability" mentioned above is actually a colonization probability, because it reflects the probability of occupancy conditional on non-occupancy during the previous timestep. The logit-scale sum of the colonization probability and the autologistic term gives the probability of persistence $\mathcal{P}$.

Thus, the autologistic model differs from the colonization-extinction model in that it assumes that colonization and extinction probabilities are related to one another via an offset that relates colonization to persistence. If the offset is modeled as constant, then we can think of this model as ascribing a "suitability" to each site that defines the position of a pair of probabilities $\mathcal{C}$ and $\mathcal{P}$ that differ by a constant on the logit scale. In practice, $\mathcal{P}$ is almost always larger than $\mathcal{C}$, and so the offset is positive.

In addition to the standard autologistic model that treats the offset as a constant, flocker also affords the flexibility to model the offset via its own covariate-based predictor, which reintroduces flexibility in the relationship between $\mathcal{P}$ and $\mathcal{C}$, but tends to retain the idea that these probabilities ought to be related to one another and ought to tend to move in tandem.

Initial occupancy

Most examples of multiseason occupancy models in the literature include an explicit model for the initial occupancy probability $\mathcal{O}$ via its own covariate-based predictor, which is analogous to the occupancy predictor in a single-season model. Thus, a colonization-extinction model would contain predictors for detection, colonization, extinction, and initial occupancy; an autologistic model would contain linear predictors for detection, colonization, the colonization-persistence offset (often modeled as a constant), and initial occupancy. flocker can fit either model.

Borrowing from the HMM literature, flocker also comes with optional functionality to handle $\mathcal{O}$ differently. Any pair of time-invariant colonization and extinction probabilities defines an equilibrium proportion of sites that are occupied. If colonization and extinction remain constant for a long time prior to the first timestep, the initial occupancy probability should simply be this equilibrium frequency, eliminating the need to parameterize and fit initial occupancy separately from colonization and extinction. flocker is equipped to fit models without any separate predictor for initial occupancy, and instead to begin with the equilibrium frequencies implied by the colonization and extinction probabilities modeled for the first timestep. Note that this approach is sometimes valid even if the colonization and extinction probabilities begin to vary during the timeseries being modeled, as long as they are invariant for a substantial period up to and including the first year modeled. As a concrete example, consider a study system that begins in an equilibrium condition, but in which some sites burn in a forest fire partway through the study. If the impact of the burn on colonization and extinction probabilities is modeled adequately, then the modeled probabilities prior to the burn will reflect the equilibrium colonization and extinction probabilities in the pre-burn state. It is for this reason that, when using equilibrium frequencies as initial occupancy probabilities, flocker always computes the equilibrium based on the probabilities modeled for the first timestep.

As an aside, we somewhat regularly encounter confusion that conflates initial occupancy probabilities $\mathcal{O}$ with colonization probabilities $\mathcal{C}$. This confusion apparently arises from the notion that the autologistic model consists of a single-season occupancy model plus an autologistic term, which leads people to incorrectly surmise that the linear predictor without the autologistic term gives an occupancy probability. It does not; it gives a colonization probability, which is typically much lower. Any modeling exercise that enforces equality between initial occupancy probabilities and subsequent colonization probabilities should be viewed skeptically.

Examples

Let's simulate some data that is valid for use with multiseason models. Here, we will simulate data for three seasons with one unit covariate and one event covariate. The data will be simulated under a colonization-extinction model with explicit inits, but we will be able to fit other models (autologistic, equilibrium inits) to the same data.^[note that simulate_flocker_data() can also simulate directly from these other model types.]

library(flocker)
multi_data <- simulate_flocker_data(
  n_season = 3,
  n_pt = 300,
  n_sp = 1,
  multiseason = "colex", 
  multi_init = "explicit",
  seed = 1
  )

fd <- make_flocker_data(
  multi_data$obs, 
  multi_data$unit_covs, 
  multi_data$event_covs, 
  type = "multi",
  quiet = TRUE
  )

Here's the colonization-extinction model with an explicit model for occupancy in the first timestep. Depending on hardware, fitting this model might take 1-5 minutes.

multi_colex <- flock(
  f_occ = ~ uc1,
  f_det = ~ uc1 + ec1,
  f_col = ~ uc1,
  f_ex = ~ uc1,
  flocker_data = fd,
  multiseason = "colex",
  multi_init = "explicit",
  backend = "cmdstanr",
  cores = 4
)

Here's the colonization-extinction model using equilibrium occupancy probabilities in the first timestep:

multi_colex_eq <- flock(
  f_det = ~ uc1 + ec1,
  f_col = ~ uc1,
  f_ex = ~ uc1,
  flocker_data = fd,
  multiseason = "colex",
  multi_init = "equilibrium",
  backend = "cmdstanr",
  cores = 4
)

Here's the autologistic model with explicit occupancy probabilities in the first timestep. To reflect the stereotypical autologistic model with a constant logit-scale offset separating colonization and persistence probabilities, we use the formula f_auto = ~ 1, but it is fine to relax this constraint and use, e.g. f_auto = ~ uc1.

multi_auto <- flock(
  f_occ = ~ uc1,
  f_det = ~ uc1 + ec1,
  f_col = ~ uc1,
  f_auto = ~ 1,
  flocker_data = fd,
  multiseason = "autologistic",
  multi_init = "explicit",
  backend = "cmdstanr",
  cores = 4
)

And finally the autologistic model with equilibrium occupancy probabilities in the first timestep:

multi_auto_eq <- flock(
  f_det = ~ uc1 + ec1,
  f_col = ~ uc1,
  f_auto = ~ 1,
  flocker_data = fd,
  multiseason = "autologistic",
  multi_init = "equilibrium",
  backend = "cmdstanr",
  cores = 4
)

We can reconstruct the occupancy probabilities at every unit using the get_Z function:

Z <- get_Z(multi_colex_eq)

We can generate posterior predictions:

pp <- predict_flocker(multi_colex_eq)

We can compare the fit of these various models using approximate leave-one-out cross-validation, where the holdouts consist of entire series (leave-one-series-out cross-validation).

loo_out <- loo_compare_flocker(
  list(multi_colex, multi_colex_eq, multi_auto, multi_auto_eq)
)
loo_out

See the main tutorial vignette for an example of the resulting output (this vignette does not actual run the models to save on computational resources).

{ width=30% style="border:none;" }