fit_model: Fit VAST to data

View source: R/fit_model.R

fit_modelR Documentation

Fit VAST to data

Description

fit_model fits a spatio-temporal model to data

Usage

fit_model(
  settings,
  Lat_i,
  Lon_i,
  t_i,
  b_i,
  a_i,
  c_iz = rep(0, length(b_i)),
  v_i = rep(0, length(b_i)),
  working_dir = tempdir(),
  X1config_cp = NULL,
  X2config_cp = NULL,
  covariate_data,
  X1_formula = ~0,
  X2_formula = ~0,
  Q1config_k = NULL,
  Q2config_k = NULL,
  catchability_data,
  Q1_formula = ~0,
  Q2_formula = ~0,
  newtonsteps = 1,
  silent = TRUE,
  build_model = TRUE,
  run_model = TRUE,
  test_fit = TRUE,
  category_names = NULL,
  year_labels = NULL,
  framework = "TMBad",
  use_new_epsilon = TRUE,
  ...
)

Arguments

settings

Output from make_settings

Lat_i

Numeric vector, providing latitude for each sample

Lon_i

Numeric vector, providing longitude for each sample

t_i

Vector of integers, providing the time (e.g., calendar year) for each observation i. The first modeled interval will be min(t_i), and last will be max(t_i), and every interval corresponding to integers between min(t_i) and max(t_i) will also be modeled.

b_i

Vector, providing sampled value (biomass, counts, etc.) for each observation i. Users should provide values as class units, e.g., as_units(x, "kg") for data in units kilograms, as_units(x, "count") for numbers, or as_units(x, "kg/km^2") for area-standarized biomass, etc. If units are missing, the default behavior is to coerce the vector to units "kg". Units are then used in subsequent plotting. Users can also specify b_i=NA for some observations, e.g., when seeking to ensure that the model includes hindcast/forecast years that do not otherwise have sampling data

a_i

Vector containing values greater than zero, providing sampled area for each observation i, e.g., as_units(x, "km^2") for area swept in square-kilometers. use as_units(1, unitless) for observations without a natural area measurement (noting that resulting densities no longer have interpretable units in that case), or for observations where the response b_i already standardizes for area, e.g., b_i = as_units(x, "kg/km^2")

c_iz

Vector of integers ranging from 0 to the number of variables minus 1, providing the category (e.g., species, length-bin) for each observation i. This can be specified as a matrix, such that each observation is associated with multiple categories. Such specification treats samples as arising from the sum across multiple categories, e.g., to account for unlabeled multispecies data.

v_i

Vector of integers ranging from 0 to the number of vessels minus 1, providing sampling category (e.g., vessel or tow) associated with overdispersed variation for each observation i (by default v_i=0 for all samples, which will not affect things given the default values for OverdispersionConfig). In some cases a portion of observations have a overdispersion-effect, but not others, and in this case the observations without are specified as v_i=NA

X1config_cp

matrix of settings for each density covariate for the 1st lienar predictor, where the row corresponds to model category, and column corresponds to each density covariate

X1config_cp[c,p]=0

X_ip[,p] has no effect on the 1st linear predictor for category c

X1config_cp[c,p]=1

X_ip[,p] has a linear effect on 1st linear predictor for category c

X1config_cp[c,p]=2

X_ip[,p] has a spatially varying, zero-centered linear effect on 1st linear predictor for category c

X1config_cp[c,p]=3

X_ip[,p] has a spatially varying linear effect on 1st linear predictor for category c

X1config_cp[c,p]=4

Replaces covariate X_ip[,p] with beta1_tc[,c]+beta2_tc[,c] for the corresponding year and category, which has a zero-centered spatially varying effect on 1st linear predictor. This then represents a density-dependent effect on distribution. (NOTE: still in beta testing)

X1config_cp[c,p]=-1

X1config_cp[c,p]=-1 is the same as X1config_cp[c,p]=2, but without including the log-likelihood term; this is useful in special cases when carefully mirroring spatially varying coefficients, e.g., to use cohort effects in a age-structured spatio-temporal model

X2config_cp

Same as argument X1config_cp but for 2nd linear predictor

covariate_data

data frame of covariate values with columns Lat, Lon, and Year, and other columns matching names in formula; Year=NA can be used for covariates that do not change among years (e.g., depth)

X1_formula

right-sided formula affecting the 1st linear predictor which is then estimated independently for each model category c_i, e.g., use X1_formula=~BOT_DEPTH+BOT_DEPTH^2 for a quadratic effect of variable BOT_DEPTH that is estimated independently for each category. The effect of an estimated effect is also used when predicting the value for each location in the extrapolation-grid. Therefore, X1_formula is interepreted as affecting the "true" underlying value of each variable, and it affects both samples and extrapolated values. It is allowed to include Year in the formula, although please check whether it is interpreted as numeric or factor-valued. Note that X1_formula is internally updated (and resulting design-matrices are modified) to avoid any intercept from arising in X1_ip and X1_gctp, to avoid identifiability issues between covariates and intercepts for each category.

X2_formula

same as X1_formula but affecting the 2nd linear predictor.

Q1config_k

Same as argument X1config_cp but affecting affecting the 1st linear predictor for catchability, and note that it is a vector (instead of matrix) given that catchability responses do not vary among variables c by default (but can be specified to do so when an appropriate 'formula' is supplied)

Q2config_k

Same as argument Q1config_cp but affecting affecting the 2nd linear predictor for catchability

catchability_data

data-frame of covariates for use when specifying Q1_formula and Q2_formula

Q1_formula

Similar to X1_formula, Q1_formula affects the 1st linear predictor for samples. However, the effect of Q1_formula is not used when predicting values at extrapolation-grid locations. Therefore, the Q1_formula is interpreted as affecting "catchability" (a.k.a. "detectabiility"), and it represents processes that affect the outcome of sampling but not the "true" underlying value of a variable being sampled. For example, to estimate the relative performance of differeng gear types, include catchability_data = data.frame(gear=gear_factor) where gear_factor is a factor-valued indicator for different gear types and Q1_formula = ... + gear, and this will estimate the catchability for each level relative to the base level of that factor. Note that Q1_formula defines a relationship that is applied to all samples (regardless of category c_i), whereas X1_formula defines a relationship that is estimated independently for each category. Also note that make_data includes c_iz[,1] as a column labeled category in catchability_data, and that Q1_formula is internally updated (and resulting design-matrices are modified) to avoid any category-specific intercept from arising in Q1_ik, to avoid identifiability issues between category-specific covariates and intercepts. For example, for a catchability covariate that varies by category, include an interaction with category in Q1_formula, e.g., Q1_formula = ... + category:gear where gear is a factor to estimate category-specific catchability ratio for different levels of gear.

Q2_formula

same as Q2_formula but affecting the 2nd linear predictor.

newtonsteps

Integer specifying the number of extra newton steps to take after optimization (alternative to loopnum). Each newtonstep requires calculating the Hessian matrix and is therefore slow. But for well-behaved models, each Newton step will typically decrease the maximum gradient of the loglikelihood with respect to each fixed effect, and therefore this option can be used to achieve an arbitrarily low final gradient given sufficient time for well-behaved models. However, this option will also perform strangely or have unexpected consequences for poorly-behaved models, e.g., when fixed effects are at upper or lower bounds.

build_model

Boolean indicating whether to build the model, build_model=TRUE, or simply build the inputs, build_model=FALSE

run_model

Boolean indicating whether to run the model or simply return the inputs and built TMB object

test_fit

Boolean indicating whether to apply check_fit before calculating standard errors, to test for parameters hitting bounds etc; defaults to TRUE

category_names

character vector specifying names for labeling categories c_i

year_labels

character vector specifying names for labeling times t_i

framework

Which AD framework to use ('TMBad' or 'CppAD')

...

additional arguments to pass to make_extrapolation_info, make_spatial_info, make_data, make_model, or fit_tmb, where arguments are matched by name against each function. If an argument doesn't match, it is still passed to make_data. Note that make_spatial_info passes named arguments to fm_mesh_2d.

Details

This function is the user-interface for the multiple mid-level functions that perform separate components of a spatio-temporal analysis:

  • determine the extrapolation-grid make_extrapolation_info,

  • define spatial objects make_spatial_info,

  • build covariates from a formula interface make_covariates,

  • assemble data make_data,

  • build model make_model,

  • estimate parameters fit_tmb, and

  • check for obvious problems with the estimates check_fit.

Please see reference documetation for each of those functions (e.g., ?make_extrapolation_info) to see a list of arguments used by each mid-level function.

Specifically, the mid-level functions called by fit_model(.) look for arguments in the following order of precedence (from highest to lowest precedence):

  1. fit_model(.) prioritizes using named arguments passed directly to fit_model(.). If arguments are passed this way, they are used instead of other options below.

  2. If an argument is not passed supplied directly to fit_model(.), then fit_model(.) looks for elements in input settings, as typically created by make_settings.

  3. If an argument is not supplied via (1) or (2) above, then each mid-level function uses default values defined in those function arguments, e.g., see args(make_extrapolation_info) for defaults for function make_extrapolation_info(.)

Collectively, this order of precedence allows users to specify inputs for a specific project via input method (1), the package author to change defaults through changes in the settings defined for a given purpose in make_settings(.) via input method (2), while still defaulting to package defaults via option (3).

Variables are indexed internally for locations g, categories c, and times y. Location index g represents Longitude-Latitude fit$extrapolation_list$Data_Extrap[which(fit$spatial_list$g_e==g),c('Lon','Lat')]; Time index y represents time fit$year_labels; and Category g corresponds to values in fit$data_list$g_i.

Value

Object of class fit_model, containing formatted inputs and outputs from VAST

parameter_estimates

Output from fit_tmb; see that documentation for definition of contents

extrapolation_list

Output from make_extrapolation_info; see that documentation for definition of contents

spatial_list

Output from make_spatial_info; see that documentation for definition of contents

data_list

Output from make_data; see that documentation for definition of contents

tmb_list

Output from make_model; see that documentation for definition of contents

ParHat

Tagged list of maximum likelihood estimatesion of fixed effects and empirical Bayes estimates of random effects, following format of initial values generated by make_parameters; see that documentation for definition of contents

Report

Tagged list of VAST outputs. For example, estimated density for grid g, category c, and time y is available as fit$Report$D_gcy[g,c,y]; see Details section for description of indexing

See Also

VAST for general documentation, make_settings for generic settings, fit_model for model fitting, and plot_results for generic plots

VAST wiki https://github.com/James-Thorson-NOAA/VAST/wiki for examples documenting many different use-cases and features.

GitHub mainpage https://github.com/James-Thorson-NOAA/VAST#description for a list of user resources and publications documenting features

summary.fit_model for methods to summarize output, including obtain a dataframe of estimated densities and an explanation of DHARMa Probability-Integral-Transform residuals

predict.fit_model for methods to predict at new locations using existing or updated covariate values

Other wrapper functions: make_settings(), plot_results()

Examples

## Not run: 
# Load packages
library(VAST)

# load data set
# see `?load_example` for list of stocks with example data
# that are installed automatically with `FishStatsUtils`.
example = load_example( data_set="EBS_pollock" )

# Make settings
settings = make_settings( n_x=50,
         Region=example$Region,
         purpose="index",
         strata.limits=example$strata.limits )

# Run model
fit = fit_model( "settings"=settings,
    "Lat_i"=example$sampling_data[,'Lat'],
    "Lon_i"=example$sampling_data[,'Lon'],
    "t_i"=example$sampling_data[,'Year'],
    "c_i"=rep(0,nrow(example$sampling_data)),
    "b_i"=example$sampling_data[,'Catch_KG'],
    "a_i"=example$sampling_data[,'AreaSwept_km2'],
    "v_i"=example$sampling_data[,'Vessel'] )

# Plot results
plot_results( settings=settings, fit=fit )

## End(Not run)


James-Thorson/FishStatsUtils documentation built on Feb. 6, 2024, 4:26 a.m.