learn_curve: Get the learning curve of a model as training data quantity...
In mirvie/mirmodels: Models Built by the Informatics Team at Mirvie

Description Usage Arguments Value See Also Examples

Given a training and test set, fit a model on increasing fractions of the training set, up to the full set, with a constant test set per repeat (each repeat will have a different test set). The default is to use 10%, 20%, 30%, . . ., 90%, 100%. Care is taken to make sure each fraction is a subset of the last e.g. all samples present in the 10% will be present in the 20% to simulate the addition of more data, as opposed to a random sample of more data. Optionally, you can pass all of your data in as the training_data and then get the function to do the splitting for you.

learn_curve(
  model_evaluate,
  training_data,
  outcome,
  testing_data = NULL,
  testing_frac = NULL,
  training_fracs = seq(0.1, 1, by = 0.1),
  repeats = 1,
  strata = NULL,
  n_cores = 1
)

`model_evaluate`	A function with exactly two arguments: `training_data` and `testing_data` that trains the model of choice on `training_data` and then produces predictions on `testing_data`, finally evaluating those predictions and outputting a length two numeric vector with names "cv" and "test" giving the cross-validation and test scores from the evaluation.
`training_data`	A data frame. Subsets of this will be used for training. If `testing_data` is `NULL` and `testing_frac` is not, this will be split into training and testing sets, with `testing_frac` used for testing.
`outcome`	A string. The name of the outcome variable. This must be a column in `training_data`.
`testing_data`	A data frame. The trained models will all be tested against this constant test set.
`testing_frac`	A numeric vector with values between 0 and 1/3.The fraction of `training_data` to use for the test set. This can only be used if `testing_data` is `NULL`. To try many different fractions, specify all of them as a numeric vector.
`training_fracs`	A numeric vector. Fractions of the training data to use. This must be a positive, increasing vector of real numbers ending in 1.
`repeats`	A positive integer. The number of times to repeat the sampling for each proportion in `testing_frac`. This can be greater than 1 only if `testing_data` is `NULL` and `testing_frac` is not `NULL`. For each repeat, a different subsetting of `testing_data` remains takes place.
`strata`	A string. Variable to stratify on when splitting data.
`n_cores`	A positive integer. The cross-validation can optionally be done in parallel. Specify the number of cores for parallel processing here.

A data frame with the following columns.

rep: The repeat number.
testing_frac: The fraction of training_data that is set aside for testing. If the testing_data argument is specified, testing_frac will be 0, because none of training_data is set aside for testing.
training_frac: The fraction of the (post train/test split)training data used for learning.
testing_indices: The row indices of the training_data argument that were set aside for testing. If testing_data is specified (and hence none of training_data needs to be set aside for testing, this will be a vector of NAs with length equal to the number of rows in testing_data.
training_indices: The row indices of the training_data that were used for learning.
cv: The cross-validation score.
test: The test score.

autoplot.mirvie_learning_curve()

data("BostonHousing", package = "mlbench")
bh <- dplyr::select_if(BostonHousing, is.numeric)
model_evaluate <- function(training_data, testing_data) {
  trained_mod <- lm(medv ~ ., training_data)
  training_preds <- predict(trained_mod, newdata = training_data)
  preds <- predict(trained_mod, newdata = testing_data)
  c(
    train = yardstick::mae_vec(training_data$medv, training_preds),
    test = yardstick::mae_vec(testing_data$medv, preds)
  )
}
mlc <- mlc0 <- suppressWarnings(
  learn_curve(model_evaluate, bh, "medv",
    training_fracs = c(seq(0.1, 0.7, 0.2), 0.85),
    testing_frac = c(0.25, 0.5), repeats = 8,
    strata = "medv", n_cores = 4
  )
)