model_dataset: Model dataset

Description Usage Arguments Details Value

View source: R/model-dataset.R

Description

Analyses a dataset of chord sequences by constructing and optimising a viewpoint regression model, and using this model to generate predictions for these sequences.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
model_dataset(
  corpus_test,
  corpus_pretrain,
  output_dir,
  viewpoints = hvr::hvr_viewpoints,
  weights = NULL,
  poly_degree = 4L,
  max_iter = 500,
  corpus_test_folds = list(seq_along(corpus_test)),
  allow_repeats = FALSE,
  max_sample = Inf,
  sample_seed = 1,
  stm_opt = stm_options(),
  ltm_opt = ltm_options(),
  na_val = 0,
  perm_int = TRUE,
  perm_int_seed = 1,
  perm_int_reps = 5,
  allow_negative_weights = FALSE
)

Arguments

corpus_test

Corpus of chord sequences to predict, as created by corpus.

corpus_pretrain

Corpus of chord sequences with which to pretrain the model, as created by corpus. These chord sequences are used solely to pretrain the discrete viewpoint models; continuous viewpoint effects and discrete viewpoint weights are optimised on corpus_test.

output_dir

(Character scalar) Directory in which to save the model outputs.

viewpoints

List of viewpoints to apply, as created by new_viewpoint. Defaults to a fairly comprehensive list, hvr_viewpoints.

weights

(NULL or numeric vector) An optional set of viewpoint regression weights; if not provided, weights will be optimised automatically. These weights should be provided as a named numeric vector in a specific order; the best way to find this format is to fit a pilot regression model with the desired viewpoint set.

poly_degree

(Integer scalar) Degree of the polynomials to compute for the continuous features.

max_iter

(Integer scalar) Maximum number of iterations for the optimisation routine.

corpus_test_folds

List of cross-validation folds for applying discrete viewpoint models to the sequences in corpus_test. Each list element should be an integer vector indexing into corpus_test. These integer vectors must exhaustively partition the sequences in corpus_test. The algorithm iterates over each fold, predicting the sequences within that fold, and training the model using the combination of a) the sequences from the other folds in corpus_test_folds and b) the sequences in corpus_pretrain. By default, there is just one fold corresponding to the entire of corpus_test, meaning that no cross-validation is applied.

allow_repeats

(Logical scalar) Whether repeated chords are theoretically permitted in the chord sequences. It is recommended to remove such repetitions before modelling.

max_sample

(Numeric scalar) Maximum number of events to sample for the model matrix, defaults to Inf (no downsampling). Lower values of max_sample prompt random downsampling.

sample_seed

(Integer scalar) Random seed to make the downsampling reproducible.

stm_opt

Options list for the short-term PPM models, as created by the function stm_options.

ltm_opt

Options list for the long-term PPM models, as created by the function ltm_options.

na_val

(Numeric scalar) Value to use to code for NA in the model matrix. The statistical analyses are mostly unaffected by this value.

perm_int

(Logical scalar) Whether to compute permutation-based feature importances.

perm_int_seed

(Integer scalar) Random seed for the permutation-based feature importances.

perm_int_reps

(Integer scalar) Number of replicates for the permutation-based feature importances (the final estimates are averages over these replicates).

allow_negative_weights

(Logical scalar) Whether negative weights should be allowed for discrete features (FALSE by default).

Details

This function wraps the following sub-routines:

  1. compute_viewpoints

  2. compute_ppm_analyses

  3. compute_model_matrix

  4. viewpoint_regression

  5. compute_predictions

Users may wish to use these sub-routines explicitly if performing repeated analyses with different parameter settings, to save redundant computation.

Value

Various model outputs are saved to output_dir. The function returns a tibble of predicted probabilities for the chords in corpus_test; see compute_predictions for an explanation of this tibble.


pmcharrison/hvr documentation built on April 14, 2020, 2:47 a.m.