make_data: Build data input for VAST model

View source: R/make_data.R

make_dataR Documentation

Build data input for VAST model

Description

make_data builds a tagged list of data inputs used by TMB for running the model

Usage

make_data(
  b_i,
  a_i,
  t_i,
  c_iz = rep(0, length(b_i)),
  e_i = c_iz[, 1],
  v_i = rep(0, length(b_i)),
  FieldConfig = c(Omega1 = "IID", Epsilon1 = "IID", Omega2 = "IID", Epsilon2 = "IID"),
  ObsModel_ez = c(PosDist = 1, Link = 0),
  OverdispersionConfig = c(eta1 = 0, eta2 = 0),
  RhoConfig = c(Beta1 = 0, Beta2 = 0, Epsilon1 = 0, Epsilon2 = 0),
  VamConfig = c(Method = 0, Rank = 0, Timing = 0),
  Aniso = TRUE,
  PredTF_i = rep(0, length(b_i)),
  covariate_data = NULL,
  X1_formula = ~0,
  X2_formula = ~0,
  X1config_cp = NULL,
  X2config_cp = NULL,
  X_contrasts = NULL,
  catchability_data = NULL,
  Q1_formula = ~0,
  Q2_formula = ~0,
  Q1config_k = NULL,
  Q2config_k = NULL,
  spatial_list,
  Network_sz = NULL,
  F_ct = NULL,
  F_init = 1,
  CheckForErrors = TRUE,
  yearbounds_zz = NULL,
  Options = c(),
  Expansion_cz = NULL,
  Z_gm = NULL,
  Version = FishStatsUtils::get_latest_version(package = "VAST"),
  overlap_zz = matrix(ncol = 7, nrow = 0),
  ...
)

Arguments

b_i

Vector, providing sampled value (biomass, counts, etc.) for each observation i. Users should provide values as class units, e.g., as_units(x, "kg") for data in units kilograms, as_units(x, "count") for numbers, or as_units(x, "kg/km^2") for area-standarized biomass, etc. If units are missing, the default behavior is to coerce the vector to units "kg". Units are then used in subsequent plotting. Users can also specify b_i=NA for some observations, e.g., when seeking to ensure that the model includes hindcast/forecast years that do not otherwise have sampling data

a_i

Vector containing values greater than zero, providing sampled area for each observation i, e.g., as_units(x, "km^2") for area swept in square-kilometers. use as_units(1, unitless) for observations without a natural area measurement (noting that resulting densities no longer have interpretable units in that case), or for observations where the response b_i already standardizes for area, e.g., b_i = as_units(x, "kg/km^2")

t_i

Vector of integers, providing the time (e.g., calendar year) for each observation i. The first modeled interval will be min(t_i), and last will be max(t_i), and every interval corresponding to integers between min(t_i) and max(t_i) will also be modeled.

c_iz

Vector of integers ranging from 0 to the number of variables minus 1, providing the category (e.g., species, length-bin) for each observation i. This can be specified as a matrix, such that each observation is associated with multiple categories. Such specification treats samples as arising from the sum across multiple categories, e.g., to account for unlabeled multispecies data.

e_i

Optional vector of integers ranging from 0 to the number of different error distributions, providing the error distribution to use for each observation i; by default e_i=c_iz[,1] such that each category has a unique estimated magnitude of measurement error. This option can be useful to specify (1) time-blocks in the variance of measurement error, (2) differentg distributions for subsets of samples, or various other applications.

v_i

Vector of integers ranging from 0 to the number of vessels minus 1, providing sampling category (e.g., vessel or tow) associated with overdispersed variation for each observation i (by default v_i=0 for all samples, which will not affect things given the default values for OverdispersionConfig). In some cases a portion of observations have a overdispersion-effect, but not others, and in this case the observations without are specified as v_i=NA

FieldConfig

See Details section of make_data for details

ObsModel_ez

an optional matrix with two columns where the first column specifies the distribution for positive catch rates, and the second column specifies the functional form for encounter probabilities

ObsModel_ez[e,1]=0

Delta-Normal

ObsModel_ez[e,1]=1

Delta-Lognormal, using bias-corrected-mean and log-sd as parameters; I recommend using ObsModel_ez[e,1]=4 instead for consistent interpretation of logSigmaM as log-CV as used for Gamma and inverse-Gaussian distribution

ObsModel_ez[e,1]=2

Delta-Gamma

ObsModel_ez[e,1]=4

Delta-Lornormal, using bias-corrected-mean and log-coefficient of variation (CV) as parameters; see note for ObsModel_ez[e,1]=1

ObsModel_ez[e,1]=5

Zero-inflated negative binomial (1st linear predictor for logit-linked zero-inflation; 2nd linear predictor for log-linked conditional mean of NB), using two variance parameters for linear and quadratic components

ObsModel_ez[e,1]=7

Zero-inflated Poisson (1st linear predictor for logit-linked zero-inflation; 2nd linear predictor for log-linked conditional mean of Poisson)

ObsModel_ez[e,1]=9

Delta-Generalized Gamma, which involves two variance parameters and includes the gamma and lognormal as nested submodels

ObsModel_ez[e,1]=10

Tweedie distribution, where expected biomass (lambda) is the product of 1st-component and 2nd-component, variance scalar (phi) is the 1st component, and logis-SigmaM is the power. This parameterization is fast (i.e., comparable to delta-models) as long as random effects are turned off for the 1st component, but is otherwise extremely slow.

ObsModel_ez[e,1]=11

Zero-inflated Poisson with additional normally-distributed variation overdispersion in the log-intensity of the Poisson distribution

ObsModel_ez[e,1]=12

Poisson distribution (not zero-inflated) with log-intensity from the 1st linear predictor, to be used in combination with the Poisson-link delta model for combining multiple data types

ObsModel_ez[e,1]=13

Bernoulli distribution using complementary log-log (cloglog) link from the 1st linear predictor, to be used in combination with the Poisson-link delta model for combining multiple data types

ObsModel_ez[e,1]=14

Similar to 12, but also including lognormal overdispersion

ObsModel_ez[e,2]=0

Conventional delta-model using logit-link for encounter probability and log-link for positive catch rates

ObsModel_ez[e,2]=1

Alternative "Poisson-link delta-model" using log-link for numbers-density and log-link for biomass per number

ObsModel_ez[e,2]=2

Link function for Tweedie distribution, necessary for ObsModel_ez[e,1]=8 or ObsModel_ez[e,1]=10

ObsModel_ez[e,2]=3

Conventional delta-model, but fixing encounter probability=1 for any year where all samples encounter the species

ObsModel_ez[e,2]=4

Poisson-link delta-model, but fixing encounter probability=1 for any year where all samples encounter the species and encounter probability=0 for any year where no samples encounter the species

OverdispersionConfig

a vector of format c("eta1"=0, "eta2"="AR1") governing any correlated overdispersion among categories for each level of v_i, where eta1 is for encounter probability, and eta2 is for positive catch rates, where 0 is off, code"AR1" is an AR1 process, and aninteger greater than zero (e.g., 2) is the number of elements in a factor-analysis covariance (by default, c("eta1"=0, "eta2"=0) and this turns off overdispersion)

RhoConfig

vector of form c("Beta1"=0,"Beta2"=0,"Epsilon1"=0,"Epsilon2"=0) specifying whether either intercepts (Beta1 and Beta2) or spatio-temporal variation (Epsilon1 and Epsilon2) is structured among time intervals, e.g. for component Epsilon2 indicated in the 4rd slot:

RhoConfig[4]=0

Each year as fixed effect

RhoConfig[4]=1

Each year as an independent and identically distributed random effect, thus estimating the variance as fixed effect

RhoConfig[4]=2

Each year as a random effect following a random walk, thus estimating the variance as fixed effect

RhoConfig[4]=3

Constant among years as fixed effect

RhoConfig[4]=4

Each year as a random effect following a first-order autoregressive process, thus estimating the variance as fixed effects and a single first-order autoregression parameter

RhoConfig[4]=5

Each year as a random effect following a first-order autoregressive process, estimating the variance as fixed effects and a separate first-order autoregression parameter for each category

RhoConfig[4]=6

Only possible for Epsilon2 or Beta2, and specifying that that associated hyperparameters parameters have identical values to the first component Epsilon1 or Beta1

If missing, the default is to assume a value of zero for each element (i.e., RhoConfig[1:4]=0)

VamConfig

Options to estimate interactions, containing three slots:

VamConfig[0]

selects method for forming interaction matrix; Turn off feature using 0, or I recommend using 2 by default

VamConfig[1]

indicates the rank of the interaction matrix, indicating the number of community axes that have regulated dynamics

VamConfig[2]

Indicates whether interactions occur before spatio-temporal variation (VamConfig[2]=0) or after VamConfig[2]=1

Aniso

whether to assume isotropy, Aniso=0, or geometric anisotropy, Aniso=1

PredTF_i

OPTIONAL, whether each observation i is included in the likelihood, PredTF_i[i]=0, or in the predictive probability, PredTF_i[i]=1

covariate_data

data-frame of covariates that is used when constructing density covariates. any variable referenced in X1_formula and X2_formula must be in covariate_data

X1_formula

right-sided formula affecting the 1st linear predictor which is then estimated independently for each model category c_i, e.g., use X1_formula=~BOT_DEPTH+BOT_DEPTH^2 for a quadratic effect of variable BOT_DEPTH that is estimated independently for each category. The effect of an estimated effect is also used when predicting the value for each location in the extrapolation-grid. Therefore, X1_formula is interepreted as affecting the "true" underlying value of each variable, and it affects both samples and extrapolated values. It is allowed to include Year in the formula, although please check whether it is interpreted as numeric or factor-valued. Note that X1_formula is internally updated (and resulting design-matrices are modified) to avoid any intercept from arising in X1_ip and X1_gctp, to avoid identifiability issues between covariates and intercepts for each category.

X2_formula

same as X1_formula but affecting the 2nd linear predictor.

X1config_cp

matrix of settings for each density covariate for the 1st lienar predictor, where the row corresponds to model category, and column corresponds to each density covariate

X1config_cp[c,p]=0

X_ip[,p] has no effect on the 1st linear predictor for category c

X1config_cp[c,p]=1

X_ip[,p] has a linear effect on 1st linear predictor for category c

X1config_cp[c,p]=2

X_ip[,p] has a spatially varying, zero-centered linear effect on 1st linear predictor for category c

X1config_cp[c,p]=3

X_ip[,p] has a spatially varying linear effect on 1st linear predictor for category c

X1config_cp[c,p]=4

Replaces covariate X_ip[,p] with beta1_tc[,c]+beta2_tc[,c] for the corresponding year and category, which has a zero-centered spatially varying effect on 1st linear predictor. This then represents a density-dependent effect on distribution. (NOTE: still in beta testing)

X1config_cp[c,p]=-1

X1config_cp[c,p]=-1 is the same as X1config_cp[c,p]=2, but without including the log-likelihood term; this is useful in special cases when carefully mirroring spatially varying coefficients, e.g., to use cohort effects in a age-structured spatio-temporal model

X2config_cp

Same as argument X1config_cp but for 2nd linear predictor

X_contrasts

list defining the contrasts for each density-covariate factor specified in X1_formula or X2_formula with specification according to contrasts. By default uses X1_contrasts = NULL, which will set the first level of each covariate factor as the reference level.

catchability_data

data-frame of covariates for use when specifying Q1_formula and Q2_formula

Q1_formula

Similar to X1_formula, Q1_formula affects the 1st linear predictor for samples. However, the effect of Q1_formula is not used when predicting values at extrapolation-grid locations. Therefore, the Q1_formula is interpreted as affecting "catchability" (a.k.a. "detectabiility"), and it represents processes that affect the outcome of sampling but not the "true" underlying value of a variable being sampled. For example, to estimate the relative performance of differeng gear types, include catchability_data = data.frame(gear=gear_factor) where gear_factor is a factor-valued indicator for different gear types and Q1_formula = ... + gear, and this will estimate the catchability for each level relative to the base level of that factor. Note that Q1_formula defines a relationship that is applied to all samples (regardless of category c_i), whereas X1_formula defines a relationship that is estimated independently for each category. Also note that make_data includes c_iz[,1] as a column labeled category in catchability_data, and that Q1_formula is internally updated (and resulting design-matrices are modified) to avoid any category-specific intercept from arising in Q1_ik, to avoid identifiability issues between category-specific covariates and intercepts. For example, for a catchability covariate that varies by category, include an interaction with category in Q1_formula, e.g., Q1_formula = ... + category:gear where gear is a factor to estimate category-specific catchability ratio for different levels of gear.

Q2_formula

same as Q2_formula but affecting the 2nd linear predictor.

Q1config_k

Same as argument X1config_cp but affecting affecting the 1st linear predictor for catchability, and note that it is a vector (instead of matrix) given that catchability responses do not vary among variables c by default (but can be specified to do so when an appropriate 'formula' is supplied)

Q2config_k

Same as argument Q1config_cp but affecting affecting the 2nd linear predictor for catchability

spatial_list

tagged list of locational information from , i.e., from make_spatial_info

Network_sz

An interface to specify a directed acyclic graphic representing a stream network, used only when Method="Stream_network".

F_ct

matrix of instantanous fishing mortality for each category c and year t. Only feasible when using a Poisson-link delta model and specifying temporal structure on intercepts, when the temporal autocorrelation is equivalent to a Spawning Potential Ratio (SPR) proxy for fishing mortality targets given the implied Gompertz density dependence.

F_init

In a vector-autoregressive model specifying fishing mortality rates via F_ct, the initial conditions can be based on the stationary distribution assuming that the first year of fishing mortality had happened indefinitely prior to data F_init=2, or that fishing mortality was zero prior to data, F_init=1

CheckForErrors

whether to check for errors in input (NOTE: when CheckForErrors=TRUE, the function will throw an error if it detects a problem with inputs. However, failing to throw an error is no guarantee that the inputs are all correct)

yearbounds_zz

matrix with two columns, giving first and last years for defining one or more periods (rows) used to calculate changes in synchrony over time (only used if Options['Calculate_Synchrony']=1)

Expansion_cz

matrix specifying how densities are expanded when calculating annual indices, with a row for each category c and two columns. The first column specifies whether to calculate annual index for category c as the weighted-sum across density estimates as follows:

Expansion[c,1]=0

density is weighted by area ("area-weighted expansion", the default)

Expansion[c,1]=1

density is weighted by the expanded value for another category ("abundance weighted expansion")

Expansion[c,1]=2

the index is calculated as the weighted average of density weighted by the expanded value for another category ("abundance weighted-average expansion")

Expansion[c,1]=3

area-weighted abundance is added to the expanded abundance for a prior category ("area-weighted cumulative total")

Expansion[c,1]=4

The fraction across categories is calculated, and then multiplied by another area-weighted index ("abundance weighted proportional expansion")

The 2nd column is used when Expansion[c1,1]=1 or Expansion[c1,1]=2 or Expansion[c1,1]=3 or Expansion[c1,1]=4, and specifies the category to use for abundance-weighted expansion/average/summation/proportions, where Expansion[c1,2]=c2 and c2 must be lower than c1.

Z_gm

matrix specifying coordinates to use when calculating center-of-gravity and range-edge statistics. Defaults to eastings and northings for each knots or extrapolation-grid cell.

Version

Which CPP version to use. If missing, defaults to latest version using get_latest_version. Can be used to specify using an older CPP, to maintain backwards compatibility.

overlap_zz

A matrix with seven columns (and zero rows by default), indicating whether overlap metrics should be calculated as a derived quantity. Each row of overlap_zz specifies an additional quantity to be calculated.

...

interface to pass deprecated inputs, included for backwards compatibility with previous versions which, e.g., specified elements of spatial_list individually instead of as a single object

Details

Specification of FieldConfig can be seen by calling make_settings, which is the recommended way of generating this input for beginning users.

Argument FieldConfig is a matrix of form FieldConfig = matrix( c(0,10,"IID","Identity", "AR1",10,"IID","Identity"), ncol=2, nrow=4, dimnames=list(c("Omega","Epsilon","Beta","Epsilon_year"),c("Component_1","Component_2")). However, for backwards compatibility, FieldConfig can instead be specified as a vector of format FieldConfig = c("Omega1"=0, "Epsilon1"=10, "Omega2"="AR1", "Epsilon2"=10), which generates the same settings as the matrix specification, given the values of each shown in this example.

Both vector (simplified) and matrix (full) specification of FieldConfig involve named elements using the following naming conventions:

Omega

specifies whether spatial variation is present and/or correlated among variables

Epsilon

specifies whether spatio-temporal variation is present and/or correlated among variables

Beta

specifies whether temporal variation (a.k.a. "intercepts") is present and/or correlated among variables

Epsilon_year

specifies whether spatio-temporal variation is correlated among years

Meanwhile, Component_1 (or the numeral "1" after a component name) refers to the 1st lienar predictor (e.g., of a delta-model), while Component_2 (or the numeral "2" after a component name) refers to the 2nd linear predictor. The simplified vector-specification does not include slots for Beta or Epsilon_year and therefore is not as general.

In each slot of FieldConfig, the user can specify various options:

0

turns off a given model component

integer greater than zero

specifies the rank (number of factors) in a factor-analysis covariance matrix

"AR1"

specifies that a model component is correlated following an first-order autoregressive process

"IID"

specifies that a given model component is a random effect that is independent for every level

"Identity"

specifies that a given model component has covariance of an identity-matrix; this is only useful for Epsilon_year to "turn off" covariance among years while still including spatio-temporal variation

make_data generates arrays of covariates X_gtp and X_ip using make_covariates; see that function for more details.

Value

Object of class make_data, containing inputs to function make_model


James-Thorson/VAST documentation built on Jan. 31, 2024, 12:13 p.m.