In mbjoseph/neuralecology: Fit Neural Hierarchical Models of Ecological Populations

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(fig.align='center')
if (!params$preprint) knitr::opts_chunk$set(fig.pos='H')
library(knitr)
library(tidyverse)
library(here)
here::here()

\renewcommand{\vec}[1]{\mathbf{#1}}

\begin{abstract} Neural networks are increasingly being used in science to infer hidden dynamics of natural systems from noisy observations, a task typically handled by hierarchical models in ecology. This paper describes a class of hierarchical models parameterized by neural networks: neural hierarchical models. The derivation of such models analogizes the relationship between regression and neural networks. A case study is developed for a neural dynamic occupancy model of North American bird populations, trained on millions of detection/non-detection time series for hundreds of species, providing insights into colonization and extinction at a continental scale. Flexible models are increasingly needed that scale to large data and represent ecological processes. Neural hierarchical models satisfy this need, providing a bridge between deep learning and ecological modeling that combines the function representation power of neural networks with the inferential capacity of hierarchical models. \end{abstract}

Introduction {-}

Deep neural networks have proved useful in myriad tasks due their ability to represent complex functions over structured domains [@lecun2015deep]. While ecologists are beginning to use such approaches, e.g., to identify plants and animals in images [@norouzzadeh2018automatically; @fricker2019convolutional], there has been relatively little integration of deep neural networks with ecological models.

Ecological processes are difficult to observe. Inference often proceeds by modeling the relationship between imperfect data and latent quantities or processes of interest with hierarchical models [@wikle2003hierarchical]. For example, occupancy models estimate the presence or absence of a species using imperfect detection data [@mackenzie2002estimating], and "dynamic" occupancy models estimate extinction and colonization dynamics [@mackenzie2003estimating]. Population growth provides another example, motivating hierarchical models that link noisy observations to mechanistic models [@de2002fitting]. In such models, it is often desirable to account for heterogeneity among sample units (e.g., differences among habitats and survey conditions), to better understand ecological dynamics.

Many hierarchical models in ecology account for heterogeneity among sample units using a linear combination of explanatory variables, despite there often being reasons to expect non-linearity [@lek1996application; @austin2002spatial; @oksanen2002continuum]. A variety of solutions exist to account for non-linearity. For example, Gaussian processes have been used for species distribution models [@latimer2009hierarchical; @golding2016fast], in animal movement models [@johnson2008general], and in point process models for distance sampling data [@johnson2010model; @yuan2017point]. Generalized additive models also have been used to account for spatial autocorrelation [@miller2013spatial; @webb2014location], nonlinear responses to habitat characteristics [@knapp2003developing; @bled2013dynamic], and differential catchability in capture-recapture studies [@zwane2004semiparametric].

Machine learning provides additional tools for approximating nonlinear functions in hierarchical models. For example, @hutchinson2011incorporating combined a site-occupancy model with an ensemble of decision trees to predict bird occurrence, combining a structured observation and process model from ecology with a flexible random forest model. Similar hybrid approaches could be developed for other classes of ecological models and/or machine learning methods. Neural networks seem particularly worthy of attention, given their success in other domains [@lecun2015deep].

Neural networks have been used in ecology, but to date have not been integrated into hierarchical models that explicitly distinguish between ecological processes and imperfect observations. For example, neural networks have modeled observed abundances of aquatic organisms, but without distinguishing observed from true abundance [@chon2001patterning; @jeong2001prediction; @jeong2008non; @malek2012applying]. Neural networks also have been used to model stock-recruitment and apparent presence/absence data [@manel1999comparing; @chen2006neural; @ozesmi2006methodological; @harris2015generating; @chen2016deep], but have not yet been extended to account for imperfect detection, despite increasing recognition of its importance [@guillera2017modelling; @tobler2019joint]. Notably, standard neural networks that classify or predict data are of limited use for ecological applications where imperfect data are used to learn about dynamics of hard to observe systems. As a solution, this paper describes neural hierarchical models that combine the function representation capacity of neural networks with hierarchical models that represent ecological processes.

Related work {-}

Neural networks {-}

Neural networks are function approximators. Linear regression is a special case, where an input vector $\vec{x}$ is mapped to a predicted value $y$:

$$y = \vec{w}^T \vec{x},$$

where $\vec{w}$ is a parameter vector (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))a). Predicted values are linear combinations of the inputs $\vec{x}$ due to the product $\vec{w}^T \vec{x}$, which restricts the complexity of the function mapping $\vec{x}$ to $y$.

Instead of modeling outputs as linear functions of $\vec{x}$, a model can be developed that is a linear function of a nonlinear transformation of $\vec{x}$. Nonlinear transformations can be specified via polynomial terms, splines, or another basis expansion [@hefley2017basis], but neural networks parameterize the transformation via a set of sequential "hidden layers" [@goodfellow2016deep].

In a neural network with one hidden layer, the first hidden layer maps the length $D$ input $\vec{x}$ to a length $D^{(1)}$ vector of "activations" $\vec{a}^{(1)} = \vec{W}^{(1)} \vec{x}$, where $\vec{W}^{(1)}$ is a $D^{(1)} \times D$ parameter matrix. The activations are passed to a differentiable nonlinear activation function $g$ to obtain the "hidden units" of the first layer $\vec{h}^{(1)} = g(\vec{a}^{(1)})$ (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))b).

The final layer of a neural network maps the a hidden layer to an output. For a neural network with one hidden layer, if the output variable $\vec{y}$ is a $K$ dimensional vector, the output unit activations are given by $\vec{a}^{(2)} = \vec{W}^{(2)} \vec{h}^{(1)}$, where $\vec{W}^{(2)}$ is a $K \times D^{(1)}$ parameter matrix. The output activation can be written as a composition of functions that transform the inputs $x$:

$$\vec{a}^{(2)} = \vec{W}^{(2)} \Big( g(\vec{W}^{(1)} \vec{x}) \Big).$$

Similar to link functions in generalized linear models, outputs can be transformed by an output activation function. For example, if $\vec{y}$ is unbounded, the identity function can act as an output activation so that $\vec{y} = \vec{a}^{(2)}$. Neural networks that predict probabilities typically use a sigmoid (inverse logit) activation function.

Neural networks usually are trained using stochastic gradient based optimization to minimize a loss function, e.g., the negative log likelihood of a Gaussian distribution for a regression task, a Bernoulli distribution for binary classification, or a Poisson distribution for a count model. Partial derivatives of the model parameters are computed with respect to the loss via backpropagation, and parameters are updated to reduce the loss. In practice, these partial derivatives are often computed via automatic differentiation over a "mini-batch" of samples, which provides a noisy estimate of the gradient [@ruder2016overview].

Neural networks are popular both because of their practical successes in a wide variety of applications, and because they possess some desirable theoretical properties. A neural network with suitable activation functions and a single hidden layer containing a finite number of neurons can approximate nearly any continuous function on a compact domain [@cybenko1989approximation; @hornik1991approximation]. "Deep" neural networks with many sequential hidden layers also act as function approximators [@lu2017expressive]. The field of deep learning, which applies such networks, provides a variety of network architectures to account for temporal structure [@hochreiter1997long], spatial structure on regular grids [@long2015fully], or graphs [@niepert2016learning], sets of unordered irregular points [@li2018pointcnn], and spatiotemporal data on grids or graphs [@xingjian2015convolutional; @jain2016structural].

The potential for deep learning has been recognized in Earth science [@reichstein2019deep], the natural sciences [@ching2018opportunities; @gazestani2019genotype; @roscher2019explainable], physical sciences [@carleo2019machine], chemical sciences [@butler2018machine], and ecology [@christin2018applications; @Desjardins-Proulx161125]. For example, models of lake temperature that combine neural networks with loss functions consistent with known physical mechanisms perform better than physical models and neural networks applied alone [@karpatne2017physics]. Similarly, generative adversarial networks with loss functions that encourage mass-balance have expedited electromagnetic calorimeter data generation from the Large Hadron Collider [@paganini2018accelerating; @radovic2018machine]. Convolutional neural networks also have been successfully deployed in population genetics to make inferences about introgression, recombination, selection, and population sizes [@flagel2018unreasonable]. There are various ways to combine science knowledge with deep learning. @ba2019blending provide a useful taxonomy in the context of physics-based deep learning. In ecology, hierarchical models present an opportunity to build upon existing approaches to derive science-based deep learning methods.

Hierarchical models {-}

Hierarchical models combine a data model, a process model, and a parameter model [@berliner1996hierarchical; @wikle2003hierarchical]. Data models represent the probability distribution of observations conditioned on a process and some parameters, e.g., the probability of capturing a marked animal, given its true state (alive or dead). Process models represent states and their dynamics, conditioned on some parameters. State variables are often incompletely observed, e.g., whether an individual animal is alive or whether a site is occupied. Parameter models represent probability distributions for unknown parameters - priors in a Bayesian framework. In a non-Bayesian setting, parameters are treated as unknowns and estimated from the data, but parameter uncertainty is not represented using probability distributions [@cressie2009accounting].

Neural hierarchical models {-}

fig_src_files <- c(here('fig', 'fig2.pdf'), 
                   here('fig', 'roc-test.pdf'), 
                   here('fig', 'centroid-displacement.pdf'),
                   here('fig', 'persist-dist-plot.pdf'), 
                   here('fig', 'route_tsne.pdf'), 
                   here('fig', 'occupancy_scatter.pdf'))

fig_caps <- c(
  "Computation graphs for (a) linear regression, (b) a neural network with one hidden layer, (c) a dynamic occupancy model, (d) a single species neural dynamic occupancy model, and (e) a deep multi-species neural dynamic occupancy model. Yellow and red indicate quantities specific to process and observation components respectively. Inputs are represented by $x$, and predicted values in panels (a) and (b) by $y$. Hidden layers are represented by $h$, with layer-specific superscripts. Outputs include initial occupancy ($\\psi_1$), persistence ($\\phi$), colonization ($\\gamma$), and detection ($p$) probabilities. Latent species embedding vectors are represented by $v$, and GRU indicates a gated recurrent unit.", 
  '(a) Receiver operator characteristic curves of binarized detection/non-detection data for each survey route in the test set, colored by the area under the curve (AUC). Here, the x-axis is the complement of specificity: the ratio of the number of false positives (incorrectly predicted detections, with no observed detections) to the sum of false positives and true negatives (correctly predicted non-detections with observed non-detections). The y-axis is true positive rate: the ratio of the number of true positives (correctly predicted detections with observed detections) to the sum of true positives and false negatives (incorrectly predicted non-detections with observed detections). The overall distribution of AUC values is shown in (b), and (c) shows the locations of test set routes colored by AUC values, where black dots represent routes that were used to train the final model.',
  '(a) Centroid displacement of bird species (x-axis) from 1997 to 2018 in kilometers vs. finite-sample population growth rate (y-axis), where a growth rate of 1 is stable, values less than one are decreasing, and values greater than one are increasing. Species with the highest population growth rates are highlighted. Panel (b) shows the locations of breeding bird survey range centroids in 1997 and 2018 for each highlighted species, along with grey points that represent survey routes where the species was estimated to be present in at least one year.',
  '(a) Scatter plot relating species-specific distance decay coefficients for colonization and persistence, with focal species from each quadrant highlighted. (b) Mean finite-sample occupancy (x-axis) vs. coefficients relating distance from range centroid and the probability of colonization (y-axis, top row), and persistence (y-axis, bottom row). (c) Maps of colonization and persistence probabilities at each route where focal species were likely absent (in grey) and present (in color, normalized to increase the visibility of gradients). Centroids for each year are shown in solid black. (d) Colonization and persistence probabilities (y-axis) as a function of distance from range centroid, averaged among years.',
  'Clustering of North American Breeding Bird Survey routes by ecoregion and route-level features. All panels show a two dimensional plot of route vectors, computed using t-distributed stochastic neighbor embedding (t-SNE) on the latent vectors for initial occupancy, persistence, colonization, and detection probabilities. Each route is shown as a point. Color indicates (a) EPA level 1 ecoregion and (b) z-standardized continuous route-level features. PC1 refers to the first principal component of WorldClim climate data.',
  'Relationships among species estimated occupancy probabilities through time, by location for species pairs that are nearest neighbors in parameter cosine distance (left column), and most cosine dissimilar (right column). Color indicates EPA level 1 ecoregions. Cosine similarity values for species pairs are printed in the upper corners of each panel. Each route is a line segment that connects estimates of occupancy probabilities for each year from 1997 to 2018. Species pairs with high cosine similarity tend to have positively related occupancy trajectories.'
)

if (params$preprint) knitr::include_graphics(fig_src_files[1])

Neural hierarchical models are hierarchical models in which the observation, process, or parameter model is parameterized by a neural network. These models are hierarchical (sensu @berliner1996hierarchical) if they distinguish between a modeled process and available data, e.g., between partial differential equations and noisy observations of their solutions [@raissi2018deep]. "Deep Markov models" - hidden Markov models parameterized by neural networks - provide an example, with successful applications in polyphonic music structure discovery, patient state reconstruction in medical data, and time series forecasting [@krishnan2017structured; @rangapuram2018deep]. State-space neural networks that use recurrent architectures provide another example dating back two decades [@zamarreno1998state; @van2002freeway]. This class of models inherits the flexibility and scalability of neural networks, along with the inferential power of hierarchical models, but applications in ecology and environmental science are only just beginning to emerge [@wikle2019comparison].

Construction of such models from existing hierarchical models is straightforward. For example, one can propose neural variants of occupancy models [@mackenzie2002estimating], dynamic occupancy models [@mackenzie2003estimating; @royle2007bayesian], N-mixture models [@royle2004n], mark-recapture models [@jolly1965explicit; @calvert2009hierarchical], and other hidden Markov models [@patterson2009classifying; @langrock2012flexible; @patterson2017statistical]. Output activation functions can be determined from inverse link functions, e.g., sigmoid (inverse logit) activations for probabilities, and loss functions can be constructed from the negative log likelihoods (see Appendix S1 in Supporting Information for example model specifications). Specialized neural network architectures that operate on structured data can be readily integrated into such models. For example, Appendix S2 in Supporting Information provides a simulated animal movement case study where aerial imagery is mapped to state transition probabilities of a hidden Markov model with a convolutional neural network. To provide an empirical use case, a neural dynamic occupancy model is developed for extinction and colonization dynamics of North American bird communities.

Case study {-}

summary_df <- read_csv(here('data', 'cleaned', 'bbs-summary.csv'))
printnum <- function(x) {
  trimws(prettyNum(x, big.mark=","))
}

The North American Breeding Bird Survey (BBS) is a large-scale annual survey aimed at characterizing trends in roadside bird populations [@link1998estimating; @sauer2011analysis; @sauer2013north; @bbs2018]. Thousands of routes are surveyed once a year during the breeding season. Surveys consist of volunteer observers that stop 50 times at points 800 meters apart on a transect, recording all birds detected within 400 meters for three minutes. Species can be present, but not detected, and may go locally extinct or colonize new routes from year to year, motivating the development of dynamic occupancy models which use imperfect detection data to estimate latent presence or absence states [@mackenzie2003estimating; @royle2007bayesian].

This case study uses BBS data from 1997-2018 excluding unidentified or hybrid species, restricting the analysis to surveys meeting the official BBS criteria [@bbs2018]. The resulting data consists of r printnum(summary_df$n_species) species sampled at r printnum(summary_df$n_routes) routes, for a total of r printnum(summary_df$n_surveys) surveys (not every route is surveyed in each year), r printnum(summary_df$n_species * summary_df$n_routes) observation history time series, and r printnum(summary_df$n_detection_records) detection/non-detection observations.

Process model {-}

A multi-species dynamic occupancy model for spatially referenced routes $s = 1, ..., S$, surveyed in years $t=1,...T$, for species $j=1,...,J$ aims to estimate colonization and extinction dynamics through time. The true occupancy state $z_{t, s, j}=1$ if species $j$ is present, and $z_{t, s, j} = 0$ if species $j$ is absent. The model represents $z_{t, s, j}$ as a Bernoulli distributed random variable, where $\text{Pr}(z_{t, s, j} = 1) = \psi_{t, s, j}$. The probability of occurrence on the first timestep is $\psi_{t=1, s, j}$. Subsequent dynamics are determined by probabilities of persistence from time $t$ to $t+1$ denoted $\phi_{t,s,j}$, and probabilities of colonization from time $t$ to $t+1$ denoted $\gamma_{t, s, j}$, so that the probability of occurrence in timesteps $t=2, ..., T$ is [@mackenzie2003estimating]:

$$\psi_{t, s, j} = z_{t-1, s, j} \phi_{t-1, s, j} + (1 - z_{t-1, s, j}) \gamma_{t-1, s, j}.$$

Observation model {-}

Let $y_{t, s, j}$ represent the number of stops where species $j$ was detected in year $t$ on route $s$. Conditional on a species being present, it is detected at each stop with probability $p_{t, s, j}$. Assume that there are no false-positive detections, so that if a species is absent, it cannot be detected [but see @royle2006generalized]. With $k=50$ replicate stops on each transect, the observations can be modeled using a Binomial likelihood for one species-route-year combination:

$$[y_{t, s, j} \mid z_{t, s, j}, p_{t, s, j}] = \text{Binomial}(y_{t, s,j } \mid z_{t, s, j} p_{t, s, j}, k),$$

where square brackets denote the probability function (e.g., in this case, $[y_{t, s, j} \mid z_{t, s, j}, p_{t, s, j}] = \text{Pr}(Y_{t, s, j} = y_{t, s, j} \mid z_{t, s, j}, p_{t, s, j})$ where $Y_{t, s, j}$ is a discrete random variable and $y_{t, s, j}$ is a particular value). The joint likelihood corresponds to the product of these terms for all years, routes, and species [@dorazio2010models].

Parameter models {-}

Heterogeneity in parameter values among routes, years, species, and surveys was modeled using three different approaches.

Single-species baseline models mapped input features to occupancy parameters with a linear combination on the logit scale [@mackenzie2003estimating] (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))c).
Single-species neural hierarchical models mapped inputs to occupancy parameters using a neural network with one hidden layer (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))d).
A multi-species deep neural hierarchical model was developed to model occupancy dynamics of all species simultaneously (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))e).

Input features included Environmental Protection Agency (EPA) level one ecoregions, the first eight principal components of the standard 19 WorldClim Bioclimatic variables averaged across the years 1970-2000 [@fick2017worldclim], BBS route spatial coordinates, distance from the coast, elevation, and road density within a 10 km buffer of each route [@meijer2018global]. The model of detection probabilities additionally included survey-level features including temperature, duration, wind, and air conditions.

The neural hierarchical models were motivated by joint species distribution models in which species load onto a shared set of latent factors [@thorson2015spatial; @warton2015so; @ovaskainen2016uncovering; @thorson2016joint; @tikhonov2017using], and by recent work on deep neural basis expansions [@mcdermott2019deep; @wikle2019comparison]. Analogously, the neural networks combined inputs into a latent vector for each route (the hidden layers), which act as latent factors that are mapped to parameters (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))d). Because the single species neural hierarchical models were fit separately for each species, latent factors were species-specific. In contrast, latent factors were shared among species in the multi-species model.

The multi-species neural dynamic occupancy model additionally built upon previous work on deep multi-species embedding. Deep multi-species embedding uses vector-valued "entity embeddings" to represent each species [@chen2016deep; @guo2016entity]. These entity embeddings are mappings from categorical data (e.g., a species identity) to continuous numeric vector representations. Embeddings are used extensively in language models such as word2vec, which maps words to vector spaces [@mikolov2013efficient].

Further, the multi-species model combined species embeddings with encoder-decoder components to estimate occupancy parameters. Encoder-decoder neural networks are used in sequence-to-sequence translation [@sutskever2014sequence]. They encode inputs (e.g., a sequence of words) into a latent vector space, then decode that vector representation to generate another sequence using a neural network. Similarly, the multi-species model encoded route-level features into vector representations. These route vectors were decoded by a recurrent neural network to generate a multivariate time series of latent vectors associated with colonization, persistence, and detection [@chung2014empirical]. Finally, the multi-species model combined these route-level latent vectors with species-level embeddings to compute colonization, persistence, and detection probabilities (Fig. \@ref(r ifelse(params$preprint, "fig:figure2", "fig:endfig1"))e). For additional details, see Appendix S3 in Supporting Information.

Model comparisons {-}

clean_routes <- read_csv(here('data', 'cleaned', 'clean_routes.csv'))
route_counts <- count(clean_routes, group)

To compare the performance of the three modeling approaches, the data were partitioned into a training, validation, and test set at the EPA level two ecoregion level [@roberts2017cross]. All routes within an ecoregion were assigned to the same partition. This resulted in r route_counts$n[route_counts$group == "train"] training routes, r route_counts$n[route_counts$group == "validation"] validation routes, and r route_counts$n[route_counts$group == "test"] test routes. K-fold cross validation would also be possible, though it requires retraining of each model K times [@roberts2017cross]. Because of the size of the BBS data and the computational resources required to train these models, a simpler train/validation/test split was used.

For each of the three modeling approaches, routes in the training set were used for parameter estimation. Single-species models were fit separately for each species (modeling approaches 1 and 2), and one multi-species model (approach 3) was fit using all of the training data. Then, using the trained models, the mean predictive log-density of the validation data was evaluated to identify the best performing model [@gelman2014understanding]. This step indicated which model fit best to the withheld validation data. Finally, the best performing model was retrained using the training and validation data, and its predictive performance was evaluated on the withheld test set [@russell2016artificial].

if (params$preprint) knitr::include_graphics(fig_src_files[2])

Final model evaluation {-}

The final model's performance was evaluated quantitatively and qualitatively. Quantitative predictive performance was evaluated in two ways. First, 95% prediction interval coverage was computed for test set counts. Prediction intervals were constructed using the quantiles of the binomial distribution, marginalizing over latent occupancy states. Second, the area under the receiver operator characteristic curve (AUC) was computed for binarized test set counts. The AUC analysis also marginalized over latent states to derive predicted probabilities, e.g., $\text{Pr}(y_{t, s, j} = 0 \mid p_{t, s, j}, \psi_{t, s, j}) = 1 - \psi_{t, s, j} + \psi_{t, s, j}(1 - p_{t, s, j})^{50}$, where the parameters $p$ and $\psi$ are estimated by the model. These two approaches provide information on how well the final model could predict the number of stops with detections at each route, and whether any detections would occur on each route, respectively.

Qualitative analyses were based on predicted occupancy states from the final model. The most likely occupancy states were computed at all BBS routes for each year and species using the Viterbi algorithm [@viterbi1967error]. The estimated occupancy states were then used to compute finite sample population growth rates for each species [@royle2007bayesian]. Occupancy state estimates were also used to compute annual spatial centroids for each species, by taking the spatial centroids of occupied route coordinates. These are hereafter referred to as "BBS range centroids" to differentiate from the actual centroid of a species' entire range.

To evaluate whether the results were qualitatively consistent with previous findings about colonization and extinction gradients over species ranges, correspondence between BBS range centroids and both colonization and persistence probabilities were assessed using linear regression [@mehlman1997change; @doherty2003local; @royle2007bayesian]. For this analysis, the response variable was colonization or persistence averaged over time, the predictor was distance from BBS range centroid. Survey routes were the sample units, and separate analyses were conducted for each species. To avoid bias associated with recently added BBS routes and variance associated with rare species, these analyses only used data from BBS routes surveyed in every year, species observed in every year, and species that occurred in 100 or more routes [@sauer2017expanding].

Finally, visualizations were developed to graphically interpret the final model. First, route-level features were visualized using t-distributed stochastic neighbor embedding, which maps high dimensional vectors to lower dimensional representations [@maaten2008visualizing]. In this lower dimensional space, routes with similar embeddings are close together, and routes with dissimilar embeddings appear distant [@rauber2016visualizing]. Second, species' loading vectors were compared in terms of cosine similarity, which measures the orientation of loading vectors in latent space. Species' occupancy should be positively related if loading vectors are oriented similarly. This expectation was checked graphically by comparing estimated occupancy time series of species that had the most similar and most different loading vectors.

Results {-}

nll <- read_csv(here('out', 'nll-comps.csv'))
valid_nll <- filter(nll, group == 'Validation data')
train_nll <- filter(nll, group == 'Training data')
coverage_df <- read_csv(here('out', 'coverage_df.csv'))
auc_df <- read_csv(here('out', 'auc_df.csv'))
dec_df <- read_csv(here('out', 'dec_df.csv'))

cosine_sim <- read_csv(here('out', 'cosine_sim.csv'))
bbs_species <- read_csv(here('data', 'cleaned', 'bbs_species.csv')) %>%
  mutate(sciname = paste(genus, species))
select_paths <- read_csv(here('out', 'select_paths.csv')) %>%
  arrange(-growth_rate)

notable_phi_sp <- dec_df %>%
  top_n(1, phid_cor) %>%
  bind_rows(top_n(dec_df, 1, -phid_cor)) %>%
  left_join(bbs_species)

The multi-species neural hierarchical model performed best on the withheld validation routes. The difference in mean validation set negative log likelihood was r round(valid_nll$ss_nll - valid_nll$nn_nll, 3) relative to the baseline model and r round(valid_nll$ss_nll - valid_nll$sn_nll, 3) relative to the single-species neural hierarchical model. For the final model, 95% prediction interval coverage for observed counts at withheld test set routes was r round(mean(coverage_df$coverage), 3) * 100%, with a standard deviation of r round(sd(coverage_df$coverage), 3) * 100%, a minimum of r round(min(coverage_df$coverage), 3) * 100%, and a maximum of r round(max(coverage_df$coverage), 3) * 100%. The model also predicted whether species would be detected on test set routes fairly well, with a mean test set AUC of r round(mean(auc_df$.estimate), 3), an among-route standard deviation of r round(sd(auc_df$.estimate), 3), a minimum of r round(min(auc_df$.estimate), 3), and a maximum of r round(max(auc_df$.estimate), 3) (Fig. \@ref(r ifelse(params$preprint, "fig:roc-test", "fig:endfig2"))).

if (params$preprint) knitr::include_graphics(fig_src_files[3])

highest_growth_rates <- paste0(
  tools::toTitleCase(select_paths$english[2:4]), 
  ' (*', 
  paste(tools::toTitleCase(select_paths$genus), select_paths$species)[2:4], 
  '*)')

Qualitative results related to range shifts and population growth rates were plausible given previous work. The model identified the invasive Eurasian Collared-Dove (Streptopelia decaocto) as having the greatest range centroid displacement from 1997 to 2018 and the highest finite sample population growth rate (Fig. \@ref(r ifelse(params$preprint, "fig:centroid-displacement", "fig:endfig3"))), consistent with its invasion of North America from Florida following its introduction in the 1980's [@bled2011hierarchical]. Increasing trends also have previously been reported for the species with the next three highest population growth rates: r highest_growth_rates[1], r highest_growth_rates[2], and r highest_growth_rates[3] [@sauer2013north].

if (params$preprint) knitr::include_graphics(fig_src_files[4])

The majority (r round(mean(dec_df$phid_cor < 0), 2) * 100%) of common species were less likely to persist at routes that were distant from their estimated BBS range centroids. Similarly, r round(mean(dec_df$gammad_cor < 0), 2) * 100% of common species were less likely to colonize routes distant from their BBS range centroids. There were examples of species with positive and negative distance coefficients for persistence and colonization (Fig. \@ref(r ifelse(params$preprint, "fig:persist-dist", "fig:endfig4"))a). Negative relationships were most apparent for common species that occupied a large fraction of BBS routes (Fig. \@ref(r ifelse(params$preprint, "fig:persist-dist", "fig:endfig4"))b). Results for representative species are displayed in Fig. \@ref(r ifelse(params$preprint, "fig:persist-dist", "fig:endfig4"))c-d.

if (params$preprint) knitr::include_graphics(fig_src_files[5])

Route vectors combined information from the categorical and continuous route-level features. Unsurprisingly (because ecoregion was an input feature), routes in the same ecoregions clustered together (Fig. \@ref(r ifelse(params$preprint, "fig:route-tsne", "fig:endfig5"))a). Route embeddings also revealed relationships among ecoregions. For example, Marine West Coast Forest ecoregion routes were similar to Northwestern Forested Mountains routes, and most different from Northern Forests routes (Fig. \@ref(r ifelse(params$preprint, "fig:route-tsne", "fig:endfig5"))a). Variation within clusters also related to continuous route-level features (Fig. \@ref(r ifelse(params$preprint, "fig:route-tsne", "fig:endfig5"))b).

if (params$preprint) knitr::include_graphics(fig_src_files[6])

The model predicted non-linear dependence among species that is interpretable in terms of the cosine similarity among species-specific parameter vectors. For example, parameters for Mourning Dove (Zenaida macroura) were closest to those of r cosine_sim$nearest_species[cosine_sim$english == 'Mourning Dove'] (r bbs_species$sciname[bbs_species$english == cosine_sim$nearest_species[cosine_sim$english == 'Mourning Dove']]), and most different from r cosine_sim$farthest_species[cosine_sim$english == 'Mourning Dove'] (r bbs_species$sciname[bbs_species$english == cosine_sim$farthest_species[cosine_sim$english == 'Mourning Dove']]) (Fig. \@ref(r ifelse(params$preprint, "fig:occupancy-scatter", "fig:endfig6"))a). On BBS routes where Mourning Doves are likely to occur, r cosine_sim$nearest_species[cosine_sim$english == 'Mourning Dove'] are also likely to occur, and r cosine_sim$farthest_species[cosine_sim$english == 'Mourning Dove'] are unlikely to occur. Species pairs of the most similar and most dissimilar loadings are also provided for Eurasian Collared-Dove and Bald Eagle (Fig. \@ref(r ifelse(params$preprint, "fig:occupancy-scatter", "fig:endfig6"))b-c).

Discussion {-}

Neural hierarchical models provide a bridge between hierarchical models built from scientific knowledge, and neural networks that approximate functions over structured domains. This framework integrates research in science-based deep learning and ecological modeling, and can use existing hierarchical models as a starting point. The case study provides a proof of concept example of constructing a scalable and performant neural hierarchical model based on a multi-species dynamic occupancy model, providing insights about colonization and extinction dynamics of North American bird assemblages at a continental scale.

highest_slope_species_english <- notable_phi_sp$english[which.max(notable_phi_sp$phid_cor)]
highest_slope_species_sciname <- bbs_species %>%
  filter(english == highest_slope_species_english) %>%
  mutate(sciname = paste(genus, species)) %>%
  select(sciname) %>%
  unlist

The breeding bird survey case study indicates that neural hierarchical models can outperform simpler models. Notably, the final model was performant both quantitatively and qualitatively, detecting population increases and range expansions that are consistent with prior work. The case study extends previous analyses of persistence probabilities at range edges [@royle2007bayesian], indicating that common species tend to have higher persistence and colonization probabilities at routes close to their BBS range centroid. Yet, interpreting this result is complicated by irregular range geometry which can lead to centroids that near the edge (or even outside) of a species range, and divergence between actual and estimated ranges due to incomplete sampling [@sagarin2002abundant; @fortin2005species; @dallas2017species; @knouft2018appropriate]. Additional complexity is apparent in Fig. \@ref(r ifelse(params$preprint, "fig:persist-dist", "fig:endfig4"))c, which indicates that gradients in colonization and persistence may not be isotropic (the same in every direction). With those caveats, the results are broadly consistent with a theoretical expectation that range boundaries can arise from gradients in local extinction and colonization rates [@holt2000alternative].

The case study used a gated recurrent neural network architecture to handle temporal structure [@chung2014empirical], and architectures designed for other data structures present additional opportunities for ecological applications. For example, neural hierarchical models could be used to couple a convolutional neural network observation model for camera trap images [@norouzzadeh2018automatically; @tabak2019machine] with an ecological process model that describes animal density, movement, or community composition [@burton2015wildlife]. Gridded data such as modeled climate data, remotely sensed Earth observations, and even 96 well microplates also might use convolutional neural network architectures [@rawat2017deep]. Appendix S2 provides a simulated example where gridded data (aerial imagery) are mapped to a state transition matrix of a hidden Markov model. In addition, many ecological datasets exhibit graph structure related to phylogenies, social networks, or network-like spatial structure. While it is possible to adapt convolutional neural networks to operate on distance matrices computed from graphs [@fioravanti2018phylogenetic], graph representation learning can also provide embeddings for nodes that encode network structure and node attributes [@hamilton2017inductive; @cai2018comprehensive].

Neural hierarchical models can scale to data that are too large to fit in computer memory. Indeed, memory limitations precluded comparison of a fully Bayesian multi-species dynamic occupancy model against the multi-species neural hierarchical model. The current state of the art multi-species occupancy models use approximately one tenth of the number of species, at about half the number of sites, and are static in the sense that colonization and extinction dynamics through time are not represented [@tobler2019joint]. Species distribution models are increasingly scalable due to advances in approximate Gaussian process models [@tikhonov2019computationally], but multi-species dynamic occupancy models have not previously been reported at this scale. This is particularly relevant for extensions of the BBS case study, given the volume of (imperfect) bird data accumulating through citizen science programs [@sullivan2009ebird]. The key strategy providing scalability in the case study is stochastic optimization that uses mini-batches - small subsets of a larger dataset - to generate noisy estimates of model performance and partial derivatives that can be used during training [@ruder2016overview]. With this approach, the entire dataset does not need to be loaded into memory at the same time. This could be useful as a way to scale models that integrate BBS, eBird, and other data [@ngiam2011multimodal; @pacifici2017integrating; @zipkin2017integrating; @MistNet2].

Though neural hierarchical models can perform well in the large data regime, they might be overparameterized in a small data setting. For this reason, performance comparisons with simpler baseline models are useful. Appendix S2 provides a simulated example, where simple baselines work better with small datasets, and more complex neural hierarchical models work better with large datasets. Approaches familiar to ecologists such as k-fold cross validation and (to a lesser extent) information criteria such as the Akaike Information Criterion (AIC) have been used with neural networks, but such approaches are less common than cross-validation with training, validation, and test partitions of the data [@anderson2004model; @jiang2016displacement; @ran2017parameter]. With large datasets, considerable computational resources are required for training, so that retraining the same model many times might make k-fold cross validation infeasible. As a practical matter, even if a neural hierarchical model demonstrates superior predictive performance, a simpler model might be preferred if interpretability and/or explainability is a high priority, as it might be in a decision-making context [@rudin2018please].

Neural networks have a reputation for being "black-box" models. However, the interpretation and explanation of such models is an active area of research [@roscher2019explainable]. Interpretations map abstract concepts to domains that humans can make sense of, e.g., mapping neural network parameters to species identity or spatial location [@montavon2018methods]. Explanations are collections of features in such a domain that contributed to a prediction [@montavon2018methods], e.g., sensitivity of model output to perturbations of the input [@olden2002illuminating; @gevrey2003review].

Looking ahead, in a forecasting or decision-making setting, it would be important to estimate both aleatoric uncertainty (arising from noise in an observation process) and epistemic uncertainty (arising from uncertainty in a model and its parameters) [@clark2001ecological; @kendall2017uncertainties]. Recent advances in accounting for uncertainty in neural network parameter estimates and architectures could be applied in future work. Approaches include "Bayes by backpropagation" [@blundell2015weight], normalizing flows for variational approximations [@kingma2016improved], adversarial training [@lakshminarayanan2017simple], methods that use dropout or its continuous relaxation [@gal2017concrete], and ensemble approaches [@mcdermott2017ensemble]. These methods can also help explain neural networks, e.g., to probabilistically estimate sensitivity to model inputs to random masking [@chang2017interpreting], or to decompose predictive uncertainty into component parts [@thiagarajan2019understanding].

The potential for combining neural networks with mechanistic models was recognized more than thirty years ago [@psichogios1992hybrid; @meade1994numerical; @lagaris1998artificial]. This potential is more easily realized today due to methodological spillover from deep learning into the natural sciences, but also increases in computing power, availability of ecological data, and the proliferation of educational content for quantitative ecology and machine learning. Further, modern deep learning frameworks provide abstractions that allow users to focus on model construction rather than the details of implementation, increasing accessibility in the same way that WinBUGS, JAGS, OpenBUGS, and Stan have done [@lunn2000winbugs; @plummer2003jags; @spiegelhalter2005openbugs; @carpenter2017stan].

Although deep learning and ecological modeling may seem to be separate activities, neural hierarchical models bridge these disciplines. Given the increasing availability of massive ecological data, scalable and flexible science-based models are increasingly needed. Neural hierarchical models satisfy this need, and can provide a framework that links imperfect observational data to ecological processes and mechanisms by construction.

Acknowledgements {-}

Thanks to David Zonana and Roland Knapp for discussions on the potential of neural hierarchical models, and to Susie Ellis for providing feedback on a draft of the manuscript. This manuscript was also greatly improved by three anonymous reviewers. Also thanks to Brandon Edwards for developing the bbsBayes R package, which was used to acquire and parse the North American Breeding Bird Survey data. This work was motivated by a workshop on machine learning in Earth science, hosted by Earth Lab and organized by the Federation of Earth Science Information Partners and the National Aeronatics and Space Administration Advanced Information Systems Technology program. This work was made possible by the CU Boulder Grand Challenge initiative and the Cooperative Institute for Research in the Environmental Sciences through their investment in Earth Lab. Thanks also to the thousands of U.S. and Canadian participants who annually perform and coordinate the North American Breeding Bird Survey.

Literature cited {-}

\clearpage

r if (!params$preprint) "# Figures {-}"

if (!params$preprint) knitr::include_graphics(fig_src_files[1])