make_data: Build data input for VAST model
In James-Thorson/VAST: Vector-Autoregressive Spatio-Temporal (VAST) Model

make_data

R Documentation

Build data input for VAST model

Description

make_data builds a tagged list of data inputs used by TMB for running the model

Usage

make_data(
  b_i,
  a_i,
  t_i,
  c_iz = rep(0, length(b_i)),
  e_i = c_iz[, 1],
  v_i = rep(0, length(b_i)),
  FieldConfig = c(Omega1 = "IID", Epsilon1 = "IID", Omega2 = "IID", Epsilon2 = "IID"),
  ObsModel_ez = c(PosDist = 1, Link = 0),
  OverdispersionConfig = c(eta1 = 0, eta2 = 0),
  RhoConfig = c(Beta1 = 0, Beta2 = 0, Epsilon1 = 0, Epsilon2 = 0),
  VamConfig = c(Method = 0, Rank = 0, Timing = 0),
  Aniso = TRUE,
  PredTF_i = rep(0, length(b_i)),
  covariate_data = NULL,
  X1_formula = ~0,
  X2_formula = ~0,
  X1config_cp = NULL,
  X2config_cp = NULL,
  X_contrasts = NULL,
  catchability_data = NULL,
  Q1_formula = ~0,
  Q2_formula = ~0,
  Q1config_k = NULL,
  Q2config_k = NULL,
  spatial_list,
  Network_sz = NULL,
  F_ct = NULL,
  F_init = 1,
  CheckForErrors = TRUE,
  yearbounds_zz = NULL,
  Options = c(),
  Expansion_cz = NULL,
  Z_gm = NULL,
  Version = get_latest_version(package = "VAST"),
  overlap_zz = matrix(ncol = 7, nrow = 0),
  ...
)

Arguments

`b_i`	Vector, providing sampled value (biomass, counts, etc.) for each observation i. Users should provide values as class `units`, e.g., `as_units(x, "kg")` for data in units kilograms, `as_units(x, "count")` for numbers, or `as_units(x, "kg/km^2")` for area-standarized biomass, etc. If units are missing, the default behavior is to coerce the vector to units "kg". Units are then used in subsequent plotting. Users can also specify `b_i=NA` for some observations, e.g., when seeking to ensure that the model includes hindcast/forecast years that do not otherwise have sampling data
`a_i`	Vector containing values greater than zero, providing sampled area for each observation i, e.g., `as_units(x, "km^2")` for area swept in square-kilometers. use `as_units(1, unitless)` for observations without a natural area measurement (noting that resulting densities no longer have interpretable units in that case), or for observations where the response `b_i` already standardizes for area, e.g., `b_i = as_units(x, "kg/km^2")`
`t_i`	Vector of integers, providing the time (e.g., calendar year) for each observation i. The first modeled interval will be `min(t_i)`, and last will be `max(t_i)`, and every interval corresponding to integers between `min(t_i)` and `max(t_i)` will also be modeled.
`c_iz`	Vector of integers ranging from 0 to the number of variables minus 1, providing the category (e.g., species, length-bin) for each observation i. This can be specified as a matrix, such that each observation is associated with multiple categories. Such specification treats samples as arising from the sum across multiple categories, e.g., to account for unlabeled multispecies data.
`e_i`	Optional vector of integers ranging from 0 to the number of different error distributions, providing the error distribution to use for each observation i; by default `e_i=c_iz[,1]` such that each category has a unique estimated magnitude of measurement error. This option can be useful to specify (1) time-blocks in the variance of measurement error, (2) differentg distributions for subsets of samples, or various other applications.
`v_i`	Vector of integers ranging from 0 to the number of vessels minus 1, providing sampling category (e.g., vessel or tow) associated with overdispersed variation for each observation i (by default `v_i=0` for all samples, which will not affect things given the default values for `OverdispersionConfig`). In some cases a portion of observations have a overdispersion-effect, but not others, and in this case the observations without are specified as `v_i=NA`
`FieldConfig`	See Details section of `make_data` for details
`ObsModel_ez`	an optional matrix with two columns where the first column specifies the distribution for positive catch rates, and the second column specifies the functional form for encounter probabilities `ObsModel_ez[e,1]=0` Delta-Normal `ObsModel_ez[e,1]=1` Delta-Lognormal, using bias-corrected-mean and log-sd as parameters; I recommend using `ObsModel_ez[e,1]=4` instead for consistent interpretation of logSigmaM as log-CV as used for Gamma and inverse-Gaussian distribution `ObsModel_ez[e,1]=2` Delta-Gamma `ObsModel_ez[e,1]=4` Delta-Lornormal, using bias-corrected-mean and log-coefficient of variation (CV) as parameters; see note for `ObsModel_ez[e,1]=1` `ObsModel_ez[e,1]=5` Zero-inflated negative binomial (1st linear predictor for logit-linked zero-inflation; 2nd linear predictor for log-linked conditional mean of NB), using two variance parameters for linear and quadratic components `ObsModel_ez[e,1]=7` Zero-inflated Poisson (1st linear predictor for logit-linked zero-inflation; 2nd linear predictor for log-linked conditional mean of Poisson) `ObsModel_ez[e,1]=9` Delta-Generalized Gamma, which involves two variance parameters and includes the gamma and lognormal as nested submodels `ObsModel_ez[e,1]=10` Tweedie distribution, where expected biomass (lambda) is the product of 1st-component and 2nd-component, variance scalar (phi) is the 1st component, and logis-SigmaM is the power. This parameterization is fast (i.e., comparable to delta-models) as long as random effects are turned off for the 1st component, but is otherwise extremely slow. `ObsModel_ez[e,1]=11` Zero-inflated Poisson with additional normally-distributed variation overdispersion in the log-intensity of the Poisson distribution `ObsModel_ez[e,1]=12` Poisson distribution (not zero-inflated) with log-intensity from the 1st linear predictor, to be used in combination with the Poisson-link delta model for combining multiple data types `ObsModel_ez[e,1]=13` Bernoulli distribution using complementary log-log (cloglog) link from the 1st linear predictor, to be used in combination with the Poisson-link delta model for combining multiple data types `ObsModel_ez[e,1]=14` Similar to 12, but also including lognormal overdispersion `ObsModel_ez[e,2]=0` Conventional delta-model using logit-link for encounter probability and log-link for positive catch rates `ObsModel_ez[e,2]=1` Alternative "Poisson-link delta-model" using log-link for numbers-density and log-link for biomass per number `ObsModel_ez[e,2]=2` Link function for Tweedie distribution, necessary for `ObsModel_ez[e,1]=8` or `ObsModel_ez[e,1]=10` `ObsModel_ez[e,2]=3` Conventional delta-model, but fixing encounter probability=1 for any year where all samples encounter the species `ObsModel_ez[e,2]=4` Poisson-link delta-model, but fixing encounter probability=1 for any year where all samples encounter the species and encounter probability=0 for any year where no samples encounter the species
`OverdispersionConfig`	a vector of format `c("eta1"=0, "eta2"="AR1")` governing any correlated overdispersion among categories for each level of `v_i`, where eta1 is for encounter probability, and eta2 is for positive catch rates, where `0` is off, code"AR1" is an AR1 process, and aninteger greater than zero (e.g., `2`) is the number of elements in a factor-analysis covariance (by default, `c("eta1"=0, "eta2"=0)` and this turns off overdispersion)
`RhoConfig`	vector of form `c("Beta1"=0,"Beta2"=0,"Epsilon1"=0,"Epsilon2"=0)` specifying whether either intercepts (Beta1 and Beta2) or spatio-temporal variation (Epsilon1 and Epsilon2) is structured among time intervals, e.g. for component `Epsilon2` indicated in the 4rd slot: `RhoConfig[4]=0` Each year as fixed effect `RhoConfig[4]=1` Each year as an independent and identically distributed random effect, thus estimating the variance as fixed effect `RhoConfig[4]=2` Each year as a random effect following a random walk, thus estimating the variance as fixed effect `RhoConfig[4]=3` Constant among years as fixed effect `RhoConfig[4]=4` Each year as a random effect following a first-order autoregressive process, thus estimating the variance as fixed effects and a single first-order autoregression parameter `RhoConfig[4]=5` Each year as a random effect following a first-order autoregressive process, estimating the variance as fixed effects and a separate first-order autoregression parameter for each category `RhoConfig[4]=6` Only possible for `Epsilon2` or `Beta2`, and specifying that that associated hyperparameters parameters have identical values to the first component `Epsilon1` or `Beta1` If missing, the default is to assume a value of zero for each element (i.e., `RhoConfig[1:4]=0`)
`VamConfig`	Options to estimate interactions, containing three slots: `VamConfig[0]` selects method for forming interaction matrix; Turn off feature using 0, or I recommend using 2 by default `VamConfig[1]` indicates the rank of the interaction matrix, indicating the number of community axes that have regulated dynamics `VamConfig[2]` Indicates whether interactions occur before spatio-temporal variation (`VamConfig[2]=0`) or after `VamConfig[2]=1`
`Aniso`	whether to assume isotropy, `Aniso=0`, or geometric anisotropy, `Aniso=1`
`PredTF_i`	OPTIONAL, whether each observation i is included in the likelihood, `PredTF_i[i]=0`, or in the predictive probability, `PredTF_i[i]=1`
`covariate_data`	data-frame of covariates that is used when constructing density covariates. any variable referenced in `X1_formula` and `X2_formula` must be in `covariate_data`
`X1_formula`	right-sided formula affecting the 1st linear predictor which is then estimated independently for each model category `c_i`, e.g., use `X1_formula=~BOT_DEPTH+BOT_DEPTH^2` for a quadratic effect of variable `BOT_DEPTH` that is estimated independently for each category. The effect of an estimated effect is also used when predicting the value for each location in the extrapolation-grid. Therefore, `X1_formula` is interepreted as affecting the "true" underlying value of each variable, and it affects both samples and extrapolated values. It is allowed to include `Year` in the formula, although please check whether it is interpreted as numeric or factor-valued. Note that `X1_formula` is internally updated (and resulting design-matrices are modified) to avoid any intercept from arising in `X1_ip` and `X1_gctp`, to avoid identifiability issues between covariates and intercepts for each category.
`X2_formula`	same as `X1_formula` but affecting the 2nd linear predictor.
`X1config_cp`	matrix of settings for each density covariate for the 1st lienar predictor, where the row corresponds to model category, and column corresponds to each density covariate `X1config_cp[c,p]=0` `X_ip[,p]` has no effect on the 1st linear predictor for category c `X1config_cp[c,p]=1` `X_ip[,p]` has a linear effect on 1st linear predictor for category c `X1config_cp[c,p]=2` `X_ip[,p]` has a spatially varying, zero-centered linear effect on 1st linear predictor for category c `X1config_cp[c,p]=3` `X_ip[,p]` has a spatially varying linear effect on 1st linear predictor for category c `X1config_cp[c,p]=4` Replaces covariate `X_ip[,p]` with `beta1_tc[,c]+beta2_tc[,c]` for the corresponding year and category, which has a zero-centered spatially varying effect on 1st linear predictor. This then represents a density-dependent effect on distribution. (NOTE: still in beta testing) `X1config_cp[c,p]=-1` `X1config_cp[c,p]=-1` is the same as `X1config_cp[c,p]=2`, but without including the log-likelihood term; this is useful in special cases when carefully mirroring spatially varying coefficients, e.g., to use cohort effects in a age-structured spatio-temporal model
`X2config_cp`	Same as argument `X1config_cp` but for 2nd linear predictor
`X_contrasts`	list defining the contrasts for each density-covariate factor specified in `X1_formula` or `X2_formula` with specification according to `contrasts`. By default uses `X1_contrasts = NULL`, which will set the first level of each covariate factor as the reference level.
`catchability_data`	data-frame of covariates for use when specifying `Q1_formula` and `Q2_formula`
`Q1_formula`	Similar to `X1_formula`, `Q1_formula` affects the 1st linear predictor for samples. However, the effect of `Q1_formula` is not used when predicting values at extrapolation-grid locations. Therefore, the `Q1_formula` is interpreted as affecting "catchability" (a.k.a. "detectabiility"), and it represents processes that affect the outcome of sampling but not the "true" underlying value of a variable being sampled. For example, to estimate the relative performance of differeng gear types, include `catchability_data = data.frame(gear=gear_factor)` where `gear_factor` is a factor-valued indicator for different gear types and `Q1_formula = ... + gear`, and this will estimate the catchability for each level relative to the base level of that factor. Note that `Q1_formula` defines a relationship that is applied to all samples (regardless of category `c_i`), whereas `X1_formula` defines a relationship that is estimated independently for each category. Also note that `make_data` includes `c_iz[,1]` as a column labeled `category` in `catchability_data`, and that `Q1_formula` is internally updated (and resulting design-matrices are modified) to avoid any category-specific intercept from arising in `Q1_ik`, to avoid identifiability issues between category-specific covariates and intercepts. For example, for a catchability covariate that varies by category, include an interaction with category in `Q1_formula`, e.g., `Q1_formula = ... + category:gear` where `gear` is a factor to estimate category-specific catchability ratio for different levels of `gear`.
`Q2_formula`	same as `Q2_formula` but affecting the 2nd linear predictor.
`Q1config_k`	Same as argument `X1config_cp` but affecting affecting the 1st linear predictor for catchability, and note that it is a vector (instead of matrix) given that catchability responses do not vary among variables `c` by default (but can be specified to do so when an appropriate 'formula' is supplied)
`Q2config_k`	Same as argument `Q1config_cp` but affecting affecting the 2nd linear predictor for catchability
`spatial_list`	tagged list of locational information from , i.e., from `make_spatial_info`
`Network_sz`	An interface to specify a directed acyclic graphic representing a stream network, used only when `Method="Stream_network"`.
`F_ct`	matrix of instantanous fishing mortality for each category c and year t. Only feasible when using a Poisson-link delta model and specifying temporal structure on intercepts, when the temporal autocorrelation is equivalent to a Spawning Potential Ratio (SPR) proxy for fishing mortality targets given the implied Gompertz density dependence.
`F_init`	In a vector-autoregressive model specifying fishing mortality rates via `F_ct`, the initial conditions can be based on the stationary distribution assuming that the first year of fishing mortality had happened indefinitely prior to data `F_init=2`, or that fishing mortality was zero prior to data, `F_init=1`
`CheckForErrors`	whether to check for errors in input (NOTE: when `CheckForErrors=TRUE`, the function will throw an error if it detects a problem with inputs. However, failing to throw an error is no guarantee that the inputs are all correct)
`yearbounds_zz`	matrix with two columns, giving first and last years for defining one or more periods (rows) used to calculate changes in synchrony over time (only used if `Options['Calculate_Synchrony']=1`)
`Options`	a tagged-vector that is empty by default `Options=c()`, but where the following slots might be helpful to add, either by passing `Options` to `make_settings`, or editing after a call to that function: `Options["Calculate_Range"]=TRUE` Turns on internal calculation and SE for center-of-gravity `Options["Calculate_effective_area"]=TRUE` Turns on internal calculation and SE for effective area occupied measuring range expansion/contraction `Options["Calculate_Cov_SE"]=TRUE` Turns on internal calculation and SE for covariance among categories (i.e. in factor model) `Options["Calculate_proportion"]=TRUE` Turns on internal calculation and SE for proportion of response within each category (e.g., for calculating proportion-at-age or species turnover) `Options["Calculate_Synchrony"]=TRUE` Turns on internal calculation and SE for Loreau metric of synchrony (a.k.a. portfolio effects) `Options["report_additional_variables"]=TRUE` Export additional variables to `Report` object, to use for diagnostics or additional exploration `Options["basin_method"]` Controls how the density-dependent index is generated from model variables. Default `Options["basin_method"]=2`) uses annual mean of betas and epsilons as index. Alternative `Options["basin_method"]=4`) uses a Lagrange multiplier to penalize index towards total abundance `Options["range_fraction"]` The decorrelation range when passing over land relative to over water; the default value `Options["range_fraction"]=0.2` indicates that the range is shorter over land, i.e., that correlations are strongest via water, while changing to `Options["range_fraction"]=5` would represent correlations transfer via land more than water `Options["Project_factors"]` Whether to estimate spatial projection of factors representing covariance among years and/or variables, as used to plot these rank-reduced patterns
`Expansion_cz`	matrix specifying how densities are expanded when calculating annual indices, with a row for each category `c` and two columns. The first column specifies whether to calculate annual index for category `c` as the weighted-sum across density estimates as follows: `Expansion[c,1]=0` density is weighted by area ("area-weighted expansion", the default) `Expansion[c,1]=1` density is weighted by the expanded value for another category ("abundance weighted expansion") `Expansion[c,1]=2` the index is calculated as the weighted average of density weighted by the expanded value for another category ("abundance weighted-average expansion") `Expansion[c,1]=3` area-weighted abundance is added to the expanded abundance for a prior category ("area-weighted cumulative total") `Expansion[c,1]=4` The fraction across categories is calculated, and then multiplied by another area-weighted index ("abundance weighted proportional expansion") The 2nd column is used when `Expansion[c1,1]=1` or `Expansion[c1,1]=2` or `Expansion[c1,1]=3` or `Expansion[c1,1]=4`, and specifies the category to use for abundance-weighted expansion/average/summation/proportions, where `Expansion[c1,2]=c2` and `c2` must be lower than `c1`.
`Z_gm`	matrix specifying coordinates to use when calculating center-of-gravity and range-edge statistics. Defaults to eastings and northings for each knots or extrapolation-grid cell.
`Version`	Which CPP version to use. If missing, defaults to latest version using `get_latest_version`. Can be used to specify using an older CPP, to maintain backwards compatibility.
`overlap_zz`	A matrix with seven columns (and zero rows by default), indicating whether overlap metrics should be calculated as a derived quantity. Each row of `overlap_zz` specifies an additional quantity to be calculated.
`...`	interface to pass deprecated inputs, included for backwards compatibility with previous versions which, e.g., specified elements of `spatial_list` individually instead of as a single object

Details

Specification of FieldConfig can be seen by calling make_settings, which is the recommended way of generating this input for beginning users.

Argument FieldConfig is a matrix of form FieldConfig = matrix( c(0,10,"IID","Identity", "AR1",10,"IID","Identity"), ncol=2, nrow=4, dimnames=list(c("Omega","Epsilon","Beta","Epsilon_year"),c("Component_1","Component_2")). However, for backwards compatibility, FieldConfig can instead be specified as a vector of format FieldConfig = c("Omega1"=0, "Epsilon1"=10, "Omega2"="AR1", "Epsilon2"=10), which generates the same settings as the matrix specification, given the values of each shown in this example.

Both vector (simplified) and matrix (full) specification of FieldConfig involve named elements using the following naming conventions:

Omega: specifies whether spatial variation is present and/or correlated among variables
Epsilon: specifies whether spatio-temporal variation is present and/or correlated among variables
Beta: specifies whether temporal variation (a.k.a. "intercepts") is present and/or correlated among variables
Epsilon_year: specifies whether spatio-temporal variation is correlated among years

Meanwhile, Component_1 (or the numeral "1" after a component name) refers to the 1st lienar predictor (e.g., of a delta-model), while Component_2 (or the numeral "2" after a component name) refers to the 2nd linear predictor. The simplified vector-specification does not include slots for Beta or Epsilon_year and therefore is not as general.

In each slot of FieldConfig, the user can specify various options:

0: turns off a given model component
integer greater than zero: specifies the rank (number of factors) in a factor-analysis covariance matrix
"AR1": specifies that a model component is correlated following an first-order autoregressive process
"IID": specifies that a given model component is a random effect that is independent for every level
"Identity": specifies that a given model component has covariance of an identity-matrix; this is only useful for Epsilon_year to "turn off" covariance among years while still including spatio-temporal variation