fit_model: Fit VAST to data
In James-Thorson/VAST: Vector-Autoregressive Spatio-Temporal (VAST) Model

fit_model

R Documentation

Fit VAST to data

Description

This function is the user-interface for the multiple mid-level functions that perform separate components of a spatio-temporal analysis:

determine the extrapolation-grid make_extrapolation_info,
define spatial objects make_spatial_info,
build covariates from a formula interface make_covariates,
assemble data make_data,
build model make_model,
estimate parameters fit_tmb, and
check for obvious problems with the estimates check_fit.

Please see reference documetation for each of those functions (e.g., ?make_extrapolation_info) to see a list of arguments used by each mid-level function.

Usage

fit_model(
  settings,
  Lat_i,
  Lon_i,
  t_i,
  b_i,
  a_i,
  c_iz = rep(0, length(b_i)),
  v_i = rep(0, length(b_i)),
  working_dir = tempdir(),
  X1config_cp = NULL,
  X2config_cp = NULL,
  covariate_data,
  X1_formula = ~0,
  X2_formula = ~0,
  Q1config_k = NULL,
  Q2config_k = NULL,
  catchability_data,
  Q1_formula = ~0,
  Q2_formula = ~0,
  newtonsteps = 1,
  silent = TRUE,
  build_model = TRUE,
  run_model = TRUE,
  test_fit = TRUE,
  category_names = NULL,
  year_labels = NULL,
  framework = "TMBad",
  use_new_epsilon = TRUE,
  ...
)

Arguments

`settings`	Output from `make_settings`
`Lat_i`	Numeric vector, providing latitude for each sample
`Lon_i`	Numeric vector, providing longitude for each sample
`t_i`	Vector of integers, providing the time (e.g., calendar year) for each observation i. The first modeled interval will be `min(t_i)`, and last will be `max(t_i)`, and every interval corresponding to integers between `min(t_i)` and `max(t_i)` will also be modeled.
`b_i`	Vector, providing sampled value (biomass, counts, etc.) for each observation i. Users should provide values as class `units`, e.g., `as_units(x, "kg")` for data in units kilograms, `as_units(x, "count")` for numbers, or `as_units(x, "kg/km^2")` for area-standarized biomass, etc. If units are missing, the default behavior is to coerce the vector to units "kg". Units are then used in subsequent plotting. Users can also specify `b_i=NA` for some observations, e.g., when seeking to ensure that the model includes hindcast/forecast years that do not otherwise have sampling data
`a_i`	Vector containing values greater than zero, providing sampled area for each observation i, e.g., `as_units(x, "km^2")` for area swept in square-kilometers. use `as_units(1, unitless)` for observations without a natural area measurement (noting that resulting densities no longer have interpretable units in that case), or for observations where the response `b_i` already standardizes for area, e.g., `b_i = as_units(x, "kg/km^2")`
`c_iz`	Vector of integers ranging from 0 to the number of variables minus 1, providing the category (e.g., species, length-bin) for each observation i. This can be specified as a matrix, such that each observation is associated with multiple categories. Such specification treats samples as arising from the sum across multiple categories, e.g., to account for unlabeled multispecies data.
`v_i`	Vector of integers ranging from 0 to the number of vessels minus 1, providing sampling category (e.g., vessel or tow) associated with overdispersed variation for each observation i (by default `v_i=0` for all samples, which will not affect things given the default values for `OverdispersionConfig`). In some cases a portion of observations have a overdispersion-effect, but not others, and in this case the observations without are specified as `v_i=NA`
`X1config_cp`	matrix of settings for each density covariate for the 1st lienar predictor, where the row corresponds to model category, and column corresponds to each density covariate `X1config_cp[c,p]=0` `X_ip[,p]` has no effect on the 1st linear predictor for category c `X1config_cp[c,p]=1` `X_ip[,p]` has a linear effect on 1st linear predictor for category c `X1config_cp[c,p]=2` `X_ip[,p]` has a spatially varying, zero-centered linear effect on 1st linear predictor for category c `X1config_cp[c,p]=3` `X_ip[,p]` has a spatially varying linear effect on 1st linear predictor for category c `X1config_cp[c,p]=4` Replaces covariate `X_ip[,p]` with `beta1_tc[,c]+beta2_tc[,c]` for the corresponding year and category, which has a zero-centered spatially varying effect on 1st linear predictor. This then represents a density-dependent effect on distribution. (NOTE: still in beta testing) `X1config_cp[c,p]=-1` `X1config_cp[c,p]=-1` is the same as `X1config_cp[c,p]=2`, but without including the log-likelihood term; this is useful in special cases when carefully mirroring spatially varying coefficients, e.g., to use cohort effects in a age-structured spatio-temporal model
`X2config_cp`	Same as argument `X1config_cp` but for 2nd linear predictor
`covariate_data`	data frame of covariate values with columns `Lat`, `Lon`, and `Year`, and other columns matching names in `formula`; `Year=NA` can be used for covariates that do not change among years (e.g., depth)
`X1_formula`	right-sided formula affecting the 1st linear predictor which is then estimated independently for each model category `c_i`, e.g., use `X1_formula=~BOT_DEPTH+BOT_DEPTH^2` for a quadratic effect of variable `BOT_DEPTH` that is estimated independently for each category. The effect of an estimated effect is also used when predicting the value for each location in the extrapolation-grid. Therefore, `X1_formula` is interepreted as affecting the "true" underlying value of each variable, and it affects both samples and extrapolated values. It is allowed to include `Year` in the formula, although please check whether it is interpreted as numeric or factor-valued. Note that `X1_formula` is internally updated (and resulting design-matrices are modified) to avoid any intercept from arising in `X1_ip` and `X1_gctp`, to avoid identifiability issues between covariates and intercepts for each category.
`X2_formula`	same as `X1_formula` but affecting the 2nd linear predictor.
`Q1config_k`	Same as argument `X1config_cp` but affecting affecting the 1st linear predictor for catchability, and note that it is a vector (instead of matrix) given that catchability responses do not vary among variables `c` by default (but can be specified to do so when an appropriate 'formula' is supplied)
`Q2config_k`	Same as argument `Q1config_cp` but affecting affecting the 2nd linear predictor for catchability
`catchability_data`	data-frame of covariates for use when specifying `Q1_formula` and `Q2_formula`
`Q1_formula`	Similar to `X1_formula`, `Q1_formula` affects the 1st linear predictor for samples. However, the effect of `Q1_formula` is not used when predicting values at extrapolation-grid locations. Therefore, the `Q1_formula` is interpreted as affecting "catchability" (a.k.a. "detectabiility"), and it represents processes that affect the outcome of sampling but not the "true" underlying value of a variable being sampled. For example, to estimate the relative performance of differeng gear types, include `catchability_data = data.frame(gear=gear_factor)` where `gear_factor` is a factor-valued indicator for different gear types and `Q1_formula = ... + gear`, and this will estimate the catchability for each level relative to the base level of that factor. Note that `Q1_formula` defines a relationship that is applied to all samples (regardless of category `c_i`), whereas `X1_formula` defines a relationship that is estimated independently for each category. Also note that `make_data` includes `c_iz[,1]` as a column labeled `category` in `catchability_data`, and that `Q1_formula` is internally updated (and resulting design-matrices are modified) to avoid any category-specific intercept from arising in `Q1_ik`, to avoid identifiability issues between category-specific covariates and intercepts. For example, for a catchability covariate that varies by category, include an interaction with category in `Q1_formula`, e.g., `Q1_formula = ... + category:gear` where `gear` is a factor to estimate category-specific catchability ratio for different levels of `gear`.
`Q2_formula`	same as `Q2_formula` but affecting the 2nd linear predictor.
`newtonsteps`	Integer specifying the number of extra newton steps to take after optimization (alternative to `loopnum`). Each newtonstep requires calculating the Hessian matrix and is therefore slow. But for well-behaved models, each Newton step will typically decrease the maximum gradient of the loglikelihood with respect to each fixed effect, and therefore this option can be used to achieve an arbitrarily low final gradient given sufficient time for well-behaved models. However, this option will also perform strangely or have unexpected consequences for poorly-behaved models, e.g., when fixed effects are at upper or lower bounds.
`build_model`	Boolean indicating whether to build the model, `build_model=TRUE`, or simply build the inputs, `build_model=FALSE`
`run_model`	Boolean indicating whether to run the model or simply return the inputs and built TMB object
`test_fit`	Boolean indicating whether to apply `check_fit` before calculating standard errors, to test for parameters hitting bounds etc; defaults to TRUE
`category_names`	character vector specifying names for labeling categories `c_i`
`year_labels`	character vector specifying names for labeling times `t_i`
`framework`	Which AD framework to use ('TMBad' or 'CppAD')
`...`	additional arguments to pass to `make_extrapolation_info`, `make_spatial_info`, `make_data`, `make_model`, or `fit_tmb`, where arguments are matched by name against each function. If an argument doesn't match, it is still passed to `make_data`. Note that `make_spatial_info` passes named arguments to `fm_mesh_2d`.

Details

Specifically, the mid-level functions called by fit_model(.) look for arguments in the following order of precedence (from highest to lowest precedence):

fit_model(.) prioritizes using named arguments passed directly to fit_model(.). If arguments are passed this way, they are used instead of other options below.
If an argument is not passed supplied directly to fit_model(.), then fit_model(.) looks for elements in input settings, as typically created by make_settings.
If an argument is not supplied via (1) or (2) above, then each mid-level function uses default values defined in those function arguments, e.g., see args(make_extrapolation_info) for defaults for function make_extrapolation_info(.)

Collectively, this order of precedence allows users to specify inputs for a specific project via input method (1), the package author to change defaults through changes in the settings defined for a given purpose in make_settings(.) via input method (2), while still defaulting to package defaults via option (3).

Variables are indexed internally for locations g, categories c, and times y. Location index g represents Longitude-Latitude fit$extrapolation_list$Data_Extrap[which(fit$spatial_list$g_e==g),c('Lon','Lat')]; Time index y represents time fit$year_labels; and Category g corresponds to values in fit$data_list$g_i.

Value

Object of class fit_model, containing formatted inputs and outputs from VAST

parameter_estimates: Output from fit_tmb; see that documentation for definition of contents
extrapolation_list: Output from make_extrapolation_info; see that documentation for definition of contents
spatial_list: Output from make_spatial_info; see that documentation for definition of contents
data_list: Output from make_data; see that documentation for definition of contents
tmb_list: Output from make_model; see that documentation for definition of contents
ParHat: Tagged list of maximum likelihood estimatesion of fixed effects and empirical Bayes estimates of random effects, following format of initial values generated by make_parameters; see that documentation for definition of contents
Report: Tagged list of VAST outputs. For example, estimated density for grid g, category c, and time y is available as fit$Report$D_gcy[g,c,y]; see Details section for description of indexing

Examples

## Not run: 
# Load packages
library(VAST)

# load data set
# see `?load_example` for list of stocks with example data
# that are installed automatically with `VAST`.
example = load_example( data_set="EBS_pollock" )

# Make settings
settings = make_settings( n_x=50,
         Region=example$Region,
         purpose="index",
         strata.limits=example$strata.limits )

# Run model
fit = fit_model( "settings"=settings,
    "Lat_i"=example$sampling_data[,'Lat'],
    "Lon_i"=example$sampling_data[,'Lon'],
    "t_i"=example$sampling_data[,'Year'],
    "c_i"=rep(0,nrow(example$sampling_data)),
    "b_i"=example$sampling_data[,'Catch_KG'],
    "a_i"=example$sampling_data[,'AreaSwept_km2'],
    "v_i"=example$sampling_data[,'Vessel'] )

# Plot results
plot_results( settings=settings, fit=fit )

## End(Not run)

James-Thorson/VAST documentation built on Feb. 9, 2025, 9:05 a.m.