learn_curve: Get the learning curve of a model as training data quantity...

Description Usage Arguments Value See Also Examples

View source: R/learning-curve.R

Description

Given a training and test set, fit a model on increasing fractions of the training set, up to the full set, with a constant test set per repeat (each repeat will have a different test set). The default is to use 10%, 20%, 30%, . . ., 90%, 100%. Care is taken to make sure each fraction is a subset of the last e.g. all samples present in the 10% will be present in the 20% to simulate the addition of more data, as opposed to a random sample of more data. Optionally, you can pass all of your data in as the training_data and then get the function to do the splitting for you.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
learn_curve(
  model_evaluate,
  training_data,
  outcome,
  testing_data = NULL,
  testing_frac = NULL,
  training_fracs = seq(0.1, 1, by = 0.1),
  repeats = 1,
  strata = NULL,
  n_cores = 1
)

Arguments

model_evaluate

A function with exactly two arguments: training_data and testing_data that trains the model of choice on training_data and then produces predictions on testing_data, finally evaluating those predictions and outputting a length two numeric vector with names "cv" and "test" giving the cross-validation and test scores from the evaluation.

training_data

A data frame. Subsets of this will be used for training. If testing_data is NULL and testing_frac is not, this will be split into training and testing sets, with testing_frac used for testing.

outcome

A string. The name of the outcome variable. This must be a column in training_data.

testing_data

A data frame. The trained models will all be tested against this constant test set.

testing_frac

A numeric vector with values between 0 and 1/3.The fraction of training_data to use for the test set. This can only be used if testing_data is NULL. To try many different fractions, specify all of them as a numeric vector.

training_fracs

A numeric vector. Fractions of the training data to use. This must be a positive, increasing vector of real numbers ending in 1.

repeats

A positive integer. The number of times to repeat the sampling for each proportion in testing_frac. This can be greater than 1 only if testing_data is NULL and testing_frac is not NULL. For each repeat, a different subsetting of testing_data remains takes place.

strata

A string. Variable to stratify on when splitting data.

n_cores

A positive integer. The cross-validation can optionally be done in parallel. Specify the number of cores for parallel processing here.

Value

A data frame with the following columns.

See Also

autoplot.mirvie_learning_curve()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
data("BostonHousing", package = "mlbench")
bh <- dplyr::select_if(BostonHousing, is.numeric)
model_evaluate <- function(training_data, testing_data) {
  trained_mod <- lm(medv ~ ., training_data)
  training_preds <- predict(trained_mod, newdata = training_data)
  preds <- predict(trained_mod, newdata = testing_data)
  c(
    train = yardstick::mae_vec(training_data$medv, training_preds),
    test = yardstick::mae_vec(testing_data$medv, preds)
  )
}
mlc <- mlc0 <- suppressWarnings(
  learn_curve(model_evaluate, bh, "medv",
    training_fracs = c(seq(0.1, 0.7, 0.2), 0.85),
    testing_frac = c(0.25, 0.5), repeats = 8,
    strata = "medv", n_cores = 4
  )
)

mirvie/mirmodels documentation built on Jan. 14, 2022, 11:12 a.m.