crossval_univariate_models: Cross-validation of univariate models
In MichaelHoltonPrice/yada: Yet Another Demographic Analysis package

[crossval_univariate_models] utilizes the out-of-sample negative log-likelihood to rank univariate models. The ranking involves three considerations. First, ordinal models can be rejected for one of the reasons outlined below. Second, the models are ordered from best to worst by their out-of-sample negative log-likelihoods, which are stored in two arrays (see below): cv_array_ord for ordinal variables and cv_array_cont for continuous variables). Third, all models within cand_tol of the best model are considered equally good, and such models are re-ordered (if necessary) by their simplicity, where the ordering is lin_ord_const < lin_ord_lin_pos_int < log_ord_const < log_ord_lin_pos_int < pow_law_ord_const < pow_law_ord_lin_pos_int for ordinal variables and pow_law_const < pow_law_lin_pos_int for continuous variables. This function assumes that the full set of possible yada candidate models are used, which could be relaxed in a future release (there are six candidate ordinal models and two candidate continuous models).

There are four reasons that ordinal models are rejected and not included in the final ranking:

(a) At least one of the folds failed to fit successfully (b) A log_ord model could not be fit (c) The scaling exponent is close to zero (less than scale_exp_min), which implies an identifiability problem (d) The heteroskedastic noise term, beta2, is too large (greater than beta2_max), which implies that the noise at x=0 tends to zero (relative to the response).

The preceding rejection reasons are discussed in greater detail in the following publication:

TODO: add the final citation and link once it is available

Aside from the preceding four reasons, some models have a very small heteroskedastic noise term, beta2, which could be added as another failure term. However, such models are typically very close to constant models, and thus typically rejected by the combination of using cand_tol and applying the simplicity metric (this was the case for all variables in the publication referenced above).

There are no tailored rejection criteria for continuous models.

[crossval_univariate_models] takes the following inputs:

data_dir The directory with save files and in which to store the results of the cross-validation analysis_name A "analysis_name" that uniquely specifies this set of models scale_exp_min The minimum acceptable value of the scaling exponent cand_tol Candidate model tolerance. The best models are considered equally good if their respective out-of-sample negative log-likelihoods lie within cand_tol of each other. Model implicity is then used as a "tie-breaker" beta2_max The maximum acceptable value of beta2, the heteroskedastic noise parameter

The output of [crossval_univariate_models] is a list with the following named elements:

cv_array_ord Ordinal out-of-sample cross-validation array with dimensions num_models_ord x num_folds x J cv_array_ord Continuous out-of-sample cross-validation array with dimensions num_models_cont x num_folds x K num_folds The number of cross validation folds (for the preceding publication, there are 4 folds) param_list_ord A list of lists of lists with parameter value matrices. The lengths of the lists are J then num_models_ord then num_param x (1+num_folds) param_list_cont A list of lists of lists with parameter value matrices. The lengths of the lists are K then num_models_cont then num_param x (1+num_folds) num_obs_vect A vector with the total number of observations for each variable (length J+K) can_do_log_ord A boolean vector indicating whether the log_ord fits could be done (length J) ord_models A vector containing the six known ordinal models cont_models A vector containing the two known continuous models mod_select_ord A list of data frames giving the model selection information for ordinal variables. The list has length J, with each element of the list having dimensions num_models_ord x 5 (see function documentation for column definitions) mod_select_cont A list of data frames giving the model selection information for continuous variables. The list has length K, with each element of the list having dimensions num_models_cont x 5 (see function documentation for column definitions) cand_tol The value for this input parameter scale_exp_min The value for this input parameter beta2_max The value for this input parameter

where the following definitions apply:

num_models_ord The number of candidate ordinal models (for the preceding publication, there are six candidate models) J The number of ordinal variables K The number of continuous variables num_models_cont The number of candidate continuous models (for the preceding publication, there are two candidate models)

Aside form returning the preceding output list, it is saved to an .rds file in the data_dir directory.

crossval_univariate_models(
  data_dir,
  analysis_name,
  cand_tol,
  scale_exp_min,
  beta2_max
)

`data_dir`	The directory with save files and in which to store the results of the cross-validation
`analysis_name`	A "analysis_name" that uniquely specifies this set of models
`cand_tol`	Candidate model tolerance. The best models are considered equally good if their respective out-of-sample negative log-likelihoods lie within cand_tol of each other. Model simplicity is then used as a "tie-breaker"
`scale_exp_min`	The minimum acceptable value of the scaling exponent
`beta2_max`	The maximum acceptable value of beta2, the heteroskedastic noise parameter