View source: R/cross_validate.R
cross_validate | R Documentation |
Performs k-fold cross-validation of a data set and a set of input bias parameters. Cross-validation allows the space of bias parameters to be searched to find the settings that best support generalization to unseen data.
cross_validate(
input,
k,
mu_values,
sigma_values,
grid_search = FALSE,
output_path = NA,
out_sep = ",",
control_params = NA,
upper_bound = DEFAULT_UPPER_BOUND,
encoding = "unknown",
model_name = NA,
allow_negative_weights = FALSE
)
input |
The input data frame/data table/tibble. This should contain one or more OT tableaux consisting of mappings between underlying and surface forms with observed frequency and violation profiles. Constraint violations must be numeric. For an example of the data frame format, see inst/extdata/sample_data_frame.csv. You can read this file into a data frame using read.csv or into a tibble using dplyr::read_csv. This function also supports the legacy OTSoft file format. You can use this format by passing in a file path string to the OTSoft file rather than a data frame. For examples of OTSoft format, see inst/extdata/sample_data_file.txt. |
k |
The number of folds to use in cross-validation. |
mu_values |
A vector or list of mu bias parameters to use in cross-validation. Parameters may either be scalars, in which case the same mu parameter will be applied to every constraint, or vectors/lists containing a separate mu bias parameter for each constraint. |
sigma_values |
A vector or list of sigma bias parameters to use in cross-validation. Parameters may either be scalars, in which case the same sigma parameter will be applied to every constraint, or vectors/lists containing a separate sigma bias parameter for each constraint. |
grid_search |
(optional) If TRUE, the Cartesian product of the values
in |
output_path |
(optional) A string specifying the path to a file to which the cross-validation results will be saved. If the file exists it will be overwritten. If this argument isn't provided, the output will not be written to a file. |
out_sep |
(optional) The delimiter used in the output files. Defaults to tabs. |
control_params |
(optional) A named list of control parameters that
will be passed to the optim function. See the documentation
of that function for details. Note that some parameter settings may
interfere with optimization. The parameter |
upper_bound |
(optional) The maximum value for constraint weights. Defaults to 100. |
encoding |
(optional) The character encoding of the input file. Defaults to "unknown". |
model_name |
(optional) A name for the model. If not provided, the file name will be used if the input is a file path. If the input is a data frame the name of the variable will be used. |
allow_negative_weights |
(optional) Whether the optimizer should allow negative weights. Defaults to FALSE. |
The cross-validation procedure is as follows:
Randomly divide the data into k partitions.
Iterate through every combination of mu and sigma specified in the
input arguments (see the documentation for the grid_search
argument
for details on how this is done).
For each combination, for each of the k partitions, train a model
on the other (k-1) partitions using optimize_weights
and then run
predict_probabilities
on the remaining partition.
Record the mean log likelihood the models apply to the held-out partitions.
A data frame with the following columns:
model_name
: the name of the model
mu
: the value(s) of mu tested
sigma
: the value(s) of sigma tested
folds
: the number of folds
mean_ll
: the mean log likelihood of k-fold cross-validation
using these bias parameters
# Get paths to OTSoft file. Note that you can also pass dataframes into
# this function, as described in the documentation for `optimize`.
data_file <- system.file(
"extdata", "amp_demo_grammar.csv", package = "maxent.ot"
)
tableaux_df <- read.csv(data_file)
# Define mu and sigma parameters to try
mus <- c(0, 1)
sigmas <- c(0.01, 0.1)
# Do 2-fold cross-validation
cross_validate(tableaux_df, 2, mus, sigmas)
# Do 2-fold cross-validation with grid search of parameters
cross_validate(tableaux_df, 2, mus, sigmas, grid_search=TRUE)
# You can also use vectors/lists for some/all of the bias parameters to set
# separate biases for each constraint
mus_v <- list(
c(0, 1),
c(1, 0)
)
sigmas_v <- list(
c(0.01, 0.1),
c(0.1, 0.01)
)
cross_validate(tableaux_df, 2, mus_v, sigmas_v)
# Save cross-validation results to a file
tmp_output <- tempfile()
cross_validate(tableaux_df, 2, mus, sigmas, output_path=tmp_output)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.