Introduction to the rGAI package

The rGAI package is an extension of functionality provided by Dennis et al (2016). The package provides a user interface for calculating a Generalised Abundance Index (GAI) for seasonal invertebrates, with multiple options for covariate inclusion, seasonal flight patterns and observed count distributions, along with the ability to conduct bootstraps to obtain measures of uncertainty for estimated parameter and GAI values.

This vignette will illustrate the features available within the package by providing an example workflow on simulated data.

Installing the Package from Source

In order to install the rGAI package from source, the folder containing the source code must be downloaded from GitHub, or installed via install_github in the devtools package.

```{R loading, results = 'hide', warning = F}

library(devtools, quietly = T)

install_github("calliste-fagard-jenkin/rGAI", quiet = F)

library(rGAI)

# Preparing Data for the `rGAI` package

Let us begin by loading an example data set which is provided within the `rGAI`
package, and examining its structure. These data consist of a simulation from 
a mixture model with three broods, across 50 sites and 26 occasions (one survey
every week from April to September, inclusive, for a given year). The `rGAI`
package comes with a handy function (`extract_counts`) to convert `data.frame`
data to a matrix of counts, to facilitate plotting. This function also checks if 
the survey has any missing occasions, and so can be a useful way of sanitising a 
new dataset.

```{R data, fig.width = 7, fig.height = 5}
# For the pipe operator:
library(magrittr)
data("example_data")
head(example_data)

test_data <- extract_counts(example_data, returnDF = T)$matrix

# Plot the observed densities, averaged across all sites:
test_data %>% apply(2, mean, na.rm = T) %>% 
  plot(x = 1:ncol(test_data), type = 'l', col = rgb(1, 0, 0, 0.6), lty = 2, 
       xlab = 'Week', ylab = 'Observed count, averaged across all sites')

Each row in our input data.frame is used to store the result of the observed count at a given site on a given occasion. We assume that the interval between each occasion is fixed, and that sites are independent. The 'site', 'occasion' and 'count' columns must be given these precise names to allow the rGAI package to recognise them. Missing observations can be listed with an NA entry.

Sites and occasions can have either character or numeric names. Because of this, the rGAI package will assume occasions occurred in the order they first appear in the data frame. If no observations were collected for any sites on a given occasion, this occasion should still be included in the data.frame with NA entries for each site.

The fourth column 'altitude' represents a covariate (with standardised values), and will be used to illustrate how covariates can be included into different models available within the package.

Fitting a Simple Model

As described by Dennis et al (2016), the Generalised Abundance Index is a model comprised of two parts. Firstly, the seasonal flight pattern describes the density of arrivals at each occasion in the study, and can be modelled either by a stopover model, a spline, or an n-component mixture model. The choice of seasonal flight pattern can be specified using the options "stopover", "spline" and "mixture", for each of the above options, respectively.

The second part of the GAI model describes the distribution of observed counts of individuals at each site on a given occasion. The package includes Poisson ("P"), Negative Binomial ("NB") and Zero-Inflated Poisson ("ZIP") as options for this distribution.

The options argument can be used to indicate if broods should all share the same standard deviation, or to provide the desired degree (of the polynomial) and df (degrees of freedom) of any splines used. The verbose = TRUE option can be used to display the negative log likelihood at each iteration of the model fitting process. The option hessian = TRUE can be used to tell the fit_GAI function to obtain an estimate of the hessian of the likelihood, which can be used to estimate the Fisher information.

```{R fitting}

We can load the true parameter values of the simulated data set from the

rGAI package, to more easily find starting parameter values for this example:

data("example_par")

Options for mixture and stopover models:

my_options <- list(B = 3, shared_sigma = T)

Options for spline models:

options_for_splines <- list(df = 20, degree = 3)

Now, fitting the model:

my_mixture_GAI <- fit_GAI(start = example_par, DF = example_data, a_choice = "mixture", dist_choice = "ZIP", options = my_options, verbose = T, hessian = T)

Print the MLEs to show the model has fitted correctly:

my_mixture_GAI$par

Not forgetting to add degrees of freedom and degree of splines information,

and noting that specifying the number of broods will no longer have an

effect:

my_spline_GAI <- fit_GAI(start = rep(0, 20), DF = example_data, a_choice = "splines", dist_choice = "P", options = options_for_splines, verbose = T, hessian = T)

Print the MLEs to show the model has fitted correctly:

my_spline_GAI$par

If an incorrect number of starting values is given to `fit_GAI`, an error will 
be raised, and a printout will indicate the parameter starting values that
`fit_GAI` expects. The order of these starting values is important, and matches 
the order of the output `fit_GAI` produces in this scenario. This feature should
be used to avoid making mistakes when unsure of how many parameters a model
specification will require. Confusingly, NA values are not allowed as starting
values, but functionality exists within the code to aid in finding reasonable
starting values, which is outlined later in this vignette.

It's worth noting that multiple numerical methods exist for finding the MLEs 
of the GAI model. The package uses the `optim` function which makes use of the 
"Nelder-Mead" algorithm by default, but this can be changed to other options,
such as "BFGS" or "SANN" (simulated annealing) by passing in `method = "SANN"`
(for example) as an argument to `fit_GAI`.

```{R mistake}
# We cut off some of our starting parameters on purpose, to cause the exception
# to be raised:
try({my_mixture_GAI <- fit_GAI(start = example_par[1:3], DF = example_data,
                               a_choice = "mixture", dist_choice = "ZIP",
                               options = my_options, verbose = T,
                               hessian = T)}, silent = T)

Including Covariates

Models using 'mixture' and 'stopover' seasonal flight patterns are able to include covariates on the values of the means, dispersions and weights for their normal distribution components.

Covariates can be included either as a general formula, which will apply to all broods, or be specified individually for each brood. Covariates for the normal components should be specified by adding a mu_formula, sigma_formula, or w_formula to the options argument, for the mean, standard deviation, and for weights of each component, respectively.

When specifying covariate formulae one by one for each brood, we have the option of giving completely different formulae for each brood, or using NULL to specify that no covariates should be fitted for a given brood. Because of this flexibility, it is good practice to ensure the list of formulae always has the same number of elements as the number of broods specified. If more formulae are specified than there are broods, these extra formulae will simply be ignored.

It should be noted that the value 0 is often used to initialise covariate parameters, and reflects the a priori assumption that the covariate has no effect. In the case of stopover models, this value leads to a probability of 0.5 for sojourning, once link functions have been applied.

```{R covariates}

To specify a formula which will be identical for each brood,

general_options <- list(B = 3, shared_sigma = T, mu_formula = formula(~altitude))

To be able to fit brood-specific covariate options for the standard deviations

we require the standard deviation to be estimated individually for each brood,

and so we set shared_sigma = F.

brood_specific_options <- list(B = 3, shared_sigma = F, sigma_formula = list(NULL, # for brood 1 formula(~altitude), # for brood 2 formula(~I(altitude^2)))) # for brood 3

We keep the same starting estimates as before, with 0 for the new parameter:

general_fit_start <- example_par[1:(length(example_par) - 1)] %>% c(0)

Having specified our covariate formulae, we can fit the model in exactly the

same way as before:

general_fit <- fit_GAI(start = general_fit_start, DF = example_data, a_choice = "mixture", dist_choice = "P", options = general_options, hessian = T)

Because these covariate data are dummies, we expect the fitted value to be

very close to zero:

general_fit$par

To make the starting values for the new model, we steal the known values from

the example_par vector (using the shared standard deviation across broods as

the estimate for each brood individually). We then also add two 0s as our

starting values for the covariate parameters

brood_specific_start <- c(example_par[c(1:3, rep(4, 3), 5:6)], rep(0, 2)) brood_specific_fit <- fit_GAI(start = brood_specific_start, DF = example_data, a_choice = "mixture", dist_choice = "P", hessian = T, options = brood_specific_options)

Checking the MLEs:

brood_specific_fit$par

A quick example to show a stopover model with the same options:

general_fit_stopover <- fit_GAI(start = c(general_fit_start, 0), DF = example_data, a_choice = "stopover", dist_choice = "P", options = general_options)

general_fit_stopover$par

brood_specific_stopover <- fit_GAI(start = c(brood_specific_start, 0), DF = example_data, a_choice = "stopover", dist_choice = "P", options = brood_specific_options)

brood_specific_stopover$par

And of course, we could also fit covariates to the simple univolitine case

(although not for weights):

univ_options <- list(B = 1, sigma_formula = formula(~altitude)) univoltine_fit <- fit_GAI(start = c(2.2, 1, 0, 0), DF = example_data, a_choice = "stopover", options = univ_options)

univoltine_fit$par

It should be noted that the `rGAI` package comes with no methods for interpolating 
missing covariate values. Because of this, any NA covariate values in `data.frame`
objects passed to `fit_GAI` will cause an error to be outputted to the console. 
The `rGAI` package only supports 'spatial' covariates, that is, covariates whose 
values vary only by site, and are identical for every single sampling occasion, 
for a given site. Supplying time-varying covariates to `fit_GAI` will similarly
lead to an error being thrown.

```{R covErrors}
example_NA <- example_time_varying <- example_data

# Turn roughly 5% of our altitude data to NA values:
indices <- example_NA$altitude %>% length %>% rbinom(size = 1, prob = 0.05)
example_NA$altitude[indices %>% as.logical] <- NA

# Turn our altitude data into a time-varying covariate by making it a function 
# of the occasion:
example_time_varying$altitude <- example_data$altitude * example_data$occasion

# Trying to fit a covariate model with these new data will throw the relevant
# errors:
error_fit <- try(fit_GAI(start = general_fit_start, DF = example_NA,
                         a_choice = "mixture", dist_choice = "P",
                         options = general_options, hessian = T))

# Trying to fit with a time-varying formula: 
error_fit <- try(fit_GAI(start = general_fit_start, DF = example_time_varying,
                         a_choice = "mixture", dist_choice = "P",
                         options = general_options, hessian = T))

Finding Starting Values

To aid numerical routines in estimating the maximum likelihood estimates of parameter values, link functions are used to ensure that parameter guesses can always be made on the entire real line. Because many broods can be used, the rGAI package uses custom link functions for mean arrival times and dispersions, which are slightly more complicated than the typical log and logistic link functions.

The fit_GAI function expects all starting values to be given on this link scale, which can make using our intuition or knowledge of our data very difficult to produce good starting values. To remedy this, the transform_starting_values function can be used to transform parameter guesses on the 'real world' scale to the link scale.

In the below chunk, we annotate the previous plot of observed counts. Adding in lines to mark estimates of the mean brood arrival times, and horizontal lines for the estimates of the standard deviation in arrival times for each brood. By assuming that most observations are contained within two standard deviations of the mean, we use the half width of a single horizontal bar as an estimate of the standard deviation. To estimate the component weights, we assume the height of the peak of a brood's component in the mixture is roughly proportional to the percentage of the population that are in that brood.

```{R transform1, fig.width = 7, fig.height = 5} test_data %>% apply(2, mean, na.rm = T) %>% plot(x = 1:ncol(test_data), type = 'l', col = rgb(1, 0, 0, 0.6), lty = 2, xlab = 'Week', ylab = 'Observed count, averaged across all sites')

plot.col <- 'blue' plot.base <- 1

Mean brood arrival lines:

lines(rep(4.1, 2), c(plot.base, 4.6), col = plot.col) lines(rep(13.0, 2), c(plot.base, 21.3), col = plot.col) lines(rep(23.0, 2), c(plot.base, 8), col = plot.col)

Mean brood dispersion lines:

lines(c(1, 7.5), rep(plot.base, 2), col = plot.col) lines(c(9, 18), rep(plot.base, 2), col = plot.col) lines(c(19, 26), rep(plot.base, 2), col = plot.col)

The length of the three horizontal bars that measure the width of our broods:

brood_widths <- c(6.5, 9, 7) sigma_guesses <- brood_widths / 4

w_guesses <- c(4.6 - plot.base, 21.3 - plot.base, 8 - plot.base) %>% sum_to_one

Our guesses for the means can simply be read off of the lines of code that

produced the plot:

mu_guesses <- c(4.1, 13, 23)

Let's consider our `brood_specific_fit` once more, but using the 
`transform_starting_values` function to obtain better starting values, without
knowing the distribution our data came from. The `transform_starting_values` 
function will assume that 0 is a reasonable starting value for all covariate
parameters and spline parameters, which usually leads to numerically sound
estimation of the MLEs. However, if covariates aren't normalised, and have 
large magnitude, this won't be the case. Note that in the case of 
zero-inflated poisson or negative binomial count distributions, an initial 
guess for the additional distributional parameter can also be provided, using 
the name `dist.par`. When covariates are included in a model specification, we
must also make sure the `transform_starting_values` function is supplied with
the `data.frame` of observations, with the argument `DF`.

```{R transform2}
# We create a list of starting values for parameters with the same structure 
# as the options argument for fit_GAI:
my_starting_guesses <- list(mu = mu_guesses, sigma = sigma_guesses,
                            w = w_guesses)

new_brood_specific_start <-
  transform_starting_values(starting_values = my_starting_guesses,
                            a_choice = "mixture", dist_choice = "P",
                            options = brood_specific_options,
                            DF = example_data)

# Let's print these new starting values, and refit the model:
print(new_brood_specific_start)
new_brood_specific_fit <- fit_GAI(start = new_brood_specific_start,
                                  DF = example_data, a_choice = "mixture",
                                  dist_choice = "P", hessian = T,
                                  options = brood_specific_options)
# Checking the MLEs:
new_brood_specific_fit$par

Using Bootstraps

Bootstraps in the rGAI package are available with two options. The first option will resample sites (and their covariate values) from the observed data set, refitting the model each time it does this. It then extracts the bootstrap distribution of estimated site superpopulation sizes and parameter estimates to create a quantile method confidence interval for these. Because this type of bootstrap can be incredibly computionally expensive, functionality has been provided to do these calculations in parallel.

The second method of bootstrapping involves resampling parameter values from their asymptotic multivariate normal distribution. This involves a numerical estimate of the Fisher Information matrix, and so is only available if a model has been fitted with the hessian = TRUE option in fit_GAI.

The second type of bootstrap runs considerably faster, and is likely to be the only available option when many sites or covariates have been included, due to computational constrains. However, it should be noted that the first type typically produces intervals with coverage closer to the target, as it has no dependence on asymptotic properties or estimates of the Information matrix.

Bootstraps can be produced for transformed covariates, for models with and without covariates, in both the 'refit the model', and the 'resample the MLEs' case.

The option R within the bootstrap function allows us to specify the number of bootstrap iterations the function should perform. refit = TRUE selects the 'refit the model' style of bootstrap. alpha specifies the targeted coverage of the bootstrap, and transform = TRUE indicates the bootstrap should be performed on transformed output parameter values (on the parameter scale, as opposed to the link function scale). The cores option will be used to determine the number of cores used to perform the calculation in parallel, when this is possible. The default value is one less than the number of cores on the device, and therefore, it is important to specify this value when using any computing devices which are shared between multiple users.

```{R bootstrap, warnings = F} general_fit_bootstrap <- bootstrap(general_fit, R = 500, refit = F, alpha = 0.01, transform = T)

refitting_bootstrap <- bootstrap(general_fit, R = 9, refit = T, parallel = F, cores = 3, alpha = 0.01, transform = T)

untransformed_bootstrap <- bootstrap(general_fit, R = 500, refit = F, transform = F, alpha = 0.01)

The bootstraps output contains raw values from each iteration of the estimated
parameter values, the flight path densities for each site and occasion, the estimated site totals (site super-populations), and the average estimated site
total. The $1 - \alpha$ confidence intervals are available in the `$par`,
`$sites` and `$index` elements of the bootstrap output for the parameters,
individual site totals and average site total, respectively:

Bootstrap outputs for parameters can also be transformed for a range of 
custom covariate functions, in order to produce plots, or for interpolation.
This is done with the `transform_bootstrap_parameters' function, which can take
as argument an input `data.frame` of parameter values, a fitted model object 
produced from the `fit_GAI` function (which it uses to extract model
specifications and design matrices), and a `data.frame` of custom covariate 
values for which parameter scale transformations are desired.

```{R intervals}
refitting_bootstrap$par            # parameter estimates
refitting_bootstrap$N[,1:5]        # site super-population estimates
refitting_bootstrap$EC[,1:5, 1:2]  # expected counts at each site, per occasion

transform_bootstrap_parameters(untransformed_bootstrap$par, general_fit, 
                               data.frame(altitude = c(-1e2, 0, 1e2)))

Model Outputs

The rGAI package comes with some simple functionality for summarising models, as well as more detailed outputs, which will be covered in more detail. The standard R summary function can be used to see the model's AIC, average estimate site total across all sites, and also the MLEs for all model parameters.

```{R summary}

Get a basic summary of the model outputs:

summary(my_mixture_GAI)

We can also obtain the AIC on its own using the standard AIC R function if

we aren't interested in producing the rest of the summary:

AIC(my_mixture_GAI)

The output of the `fit_GAI` function is a list with a few important elements.
The `par` element gives named estimates of the MLEs for the model parameters, 
with `value` giving the value of the negative log likelihood
evaluated at the MLEs. `counts`, `convergence`, `message` and `hessian` are all
standard outputs given by the `optim` function with details on the numerical 
process of estimating the MLEs. `spline_specs` contains the user-specified
options for fitting splines, and will be an empty list for mixture and 
stopover models. `dist_choice` and `a_choice` contain the count distribution 
and flight path distributions, respectively. `skeleton` contains the skeleton
list of parameter values that the package uses to fit the model, with 
`options` being the list of options passed to `fit_GAI`. `maxiter` refers to the
maximum number of iterations used to estimate the MLEs of ZIP and NB models. `A`
and `N` contain the matrix of estimated seasonal densities at each site and 
occasion, and the vector of estimated site totals, respectively. `DMs` contains
the list of design matrices used by the `rGAI` package to include the selected
covariate formulas in the model. `obs` contains the count observations in 
matrix form, with sites as rows and occasions as columns. This is the same
format as for `A`. `DF` contains the original `data.frame` supplied to `fit_GAI`,
and finally, `tol` specififies the stopping condition used for the model (an 
epsilon such that during an iterative process for fitting a ZIP or NB model, 
a difference of less than epsilon in the negative log likelihood between two
iterations causes the process to terminate).

# Transforming Parameter Estimates

Numerical solvers that attempt to find MLEs must be able to perform guesses
anywhere on the real scale. This causes issues with certain parameters which 
can only take on values in a particular range, such as probabilities which
must always be between 0 and 1, for example. Further to this, the estimation 
of values such as mean arrival times for broods can be unidentifiable if 
specified ambiguously. In this example, unless we force the mean for the second
brood to be greater than that for the first (and so forth), then multiple sets
of parameters could give the same maximal value of the log likelihood. For
these reasons, the `rGAI` package uses link functions to map between the real line
on which we make parameter value guesses during fitting, and the 'parameter
space', where all parameters are within their correct bounds.

In order to be able to interpret our output values, we must apply the relevant
link functions to transform them back to the appropriate scale. For mu
parameters this can be done with the `means_link` function. For standard
deviations, a log-link is used, and so fitted values should be exponentiated.
Weights also use a custom link function, which can be applied to fitted values
using the `probs_link` function. The extra distributional parameter for the ZIP
uses a logistic link (by virtue of being a probability), which can be applied
with the `plogis` function, whereas the rate parameter for an NB model should
be exponentiated, to ensure it is always positive. Sojourning probabilities in
stopover models also make use of the logistic link (`plogis`).

The `transform_output` function within the `rGAI` package uses model settings stored
in the fitted model object to apply these link functions automatically, making
sure that the effect of covariates has also been taken into account.
```{R backtransform}
# We can create a `data.frame` with custom covariate values, or reuse values that
# we observed during the survey:
DF_to_transform <- `data.frame`(altitude = c(-1e2, 0, 1e2))
DF_to_transform <- general_fit$DF[1:3,]

# The transform_output function deals with all the covariate formulas and link
# functions by using the information contained in the fitted model object:
transform_output(general_fit, DF_to_transform)

# When no covariates were included in the model, a blank `data.frame`, or no 
# `data.frame` at all can be used to only provide the transformed values:
transform_output(my_mixture_GAI)

# We can also use this function to get out the a_func matrix for a set of
# covariate values:
A <- transform_output(general_fit, DF_to_transform, provide_A = T)$A

# It's important to include all covariates exactly as they were named in the 
# call to fit_GAI, otherwise an error will be thrown, giving the name of the 
# missing covariate.
try(transform_output(brood_specific_fit, `data.frame`(altiitude = c(-10, 0, 10))))

# NA covariate values will throw an error, as always:
try(transform_output(brood_specific_fit, `data.frame`(altitude = c(NA, 0, 10))))

Plotting Fitted Curves

The standard R plot function can be used to produce simple plots of the flight path of a fitted model. Optionally, either the flight path for each site can be included as a separate curve on the same graph, or a smooth can be plotted instead. This smooth averages through the different values observed at each site, at a set of quantiles that can be specified by the user. The default behaviour produces plots for the median density across sites for a given occasion, as well as the 5th and 95th quantiles for flight path density.

To better compare the relative abundance at each site throughout the year, especially when covariates have been included in the model, it can be useful to scale the flight path curve of each site by the estimated site total. This can be done by setting scale_by_N = TRUE in the call to the plot function.

If custom colours are to be used by the plotting function, the colour palette can be passed to the plot function as a vector of integers or character colour hex values using the colours argument.

```{R, include = F} library(ggplot2) library(reshape2)

# Convert this document to R code, to make sure the script is up to date:

knitr::purl("vignette.Rmd")

```{R plotting, fig.width = 7, fig.height = 5}
colours <- c("#33FFF9", "#33A8FF", "#4233FF")

# The default behaviour will use quantiles = c(0.05, 0.5, 0.95), and therefore 
# will not plot all sites individually.
plot(general_fit, scale_by_N = T, quantiles = c(0.01, 0.25, 0.5, 0.75, 0.99))

# For the sake of demonstration, a plot with manual colours, with no scaling:
plot(brood_specific_fit, scale_by_N = F, quantiles = c(0.001, 0.5, 0.999),
     colours = colours)


calliste-fagard-jenkin/GAI documentation built on July 26, 2022, 11:20 a.m.