developer_functions: Developer functions for creating recipes steps

developer_functionsR Documentation

Developer functions for creating recipes steps

Description

This page provides a comprehensive list of the exported functions for creating recipes steps and guidance on how to use them.

Creating steps

add_step() and add_check() are required when creating a new step. The output of add_step() should be the return value of all steps and should have the following format:

step_example <- function(recipe,
                         ...,
                         role = NA,
                         trained = FALSE,
                         skip = FALSE,
                         id = rand_id("example")) {
  add_step(
    recipe,
    step_example_new(
      terms = enquos(...),
      role = role,
      trained = trained,
      skip = skip,
      id = id
    )
  )
}

rand_id() should be used in the arguments of step_example() to specify the argument, as we see in the above example.

recipes_pkg_check() should be used in step_example() functions together with required_pkgs() to alert users that certain other packages are required. The standard way of using this function is the following format:

recipes_pkg_check(required_pkgs.step_example())

step() and check() are used within the ⁠step_*_new()⁠ function that you use in your new step. It will be used in the following way:

step_example_new <- function(terms, role, trained, skip, id) {
  step(
    subclass = "example",
    terms = terms,
    role = role,
    trained = trained,
    skip = skip,
    id = id
  )
}

recipes_eval_select() is used within ⁠prep.step_*()⁠ functions, and are used to turn the terms object into a character vector of the selected variables.

It will most likely be used like so:

col_names <- recipes_eval_select(x$terms, training, info)

check_type() can be used within ⁠prep.step_*()⁠ functions to check that the variables passed in are the right types. We recommend that you use the types argument as it offers higher flexibility and it matches the types defined by .get_data_types(). When using types we find it better to be explicit, e.g. writing types = c("double", "integer") instead of types = "numeric", as it produces cleaner error messages.

It should be used like so:

check_type(training[, col_names], types = c("double", "integer"))

check_new_data() should be used within ⁠bake.step_*()⁠. This function is used to make check that the required columns are present in the data. It should be one of the first lines inside the function.

It should be used like so:

check_new_data(names(object$columns), object, new_data)

check_name() should be used in ⁠bake.step_*()⁠ functions for steps that add new columns to the data set. The function throws an error if the column names already exist in the data set. It should be called before adding the new columns to the data set.

get_keep_original_cols() and remove_original_cols() are used within steps with the keep_original_cols argument. get_keep_original_cols() is used in ⁠prep.step_*()⁠ functions for steps that were created before the keep_original_cols argument was added, and acts as a way to throw a warning that the user should regenerate the recipe. remove_original_cols() should be used in ⁠bake.step_*()⁠ functions to remove the original columns. It is worth noting that remove_original_cols() can remove multiple columns at once and when possible should be put outside for loops.

new_data <- remove_original_cols(new_data, object, names_of_original_cols)

recipes_remove_cols() should be used in ⁠prep.step_*()⁠ functions, and is used to remove columns from the data set, either by using the object$removals field or by using the col_names argument.

recipes_names_predictors() and recipes_names_outcomes() should be used in ⁠prep.step_*()⁠ functions, and are used to get names of predictors and outcomes.

get_case_weights() and are_weights_used() are functions that help you extract case weights and help determine if they are used or not within the step. They will typically be used within the ⁠prep.step_*()⁠ functions if the step in question supports case weights.

print_step() is used inside ⁠print.step_*()⁠ functions. This function is replacing the internally deprecated printer() function.

sel2char() is mostly used within ⁠tidy.step_*()⁠ functions to turn selections into character vectors.

names0() creates a series of num names with a common prefix. The names are numbered with leading zeros (e.g. prefix01-prefix10 instead of prefix1-prefix10). This is useful for many types of steps that produce new columns.

Interacting with recipe objects

recipes_ptype() returns the ptype, expected variables and types, that a recipe object expects at prep() and bake() time. Controlled using the stage argument. Can be used by functions that interact with recipes to verify data is correct before passing it to prep() and bake().

recipes_ptype_validate() checks that a recipe and its data are compatible using information extracted using recipes_ptype(). Can be used by functions that interact with recipes to verify data is correct before passing it to prep() and bake().

detect_step() returns a logical indicator to determine if a given step or check is included in a recipe.

fully_trained() returns a logical indicator if the recipe is fully trained. The function is_trained() can be used to check in any individual steps are trained or not.

.get_data_types() is an S3 method that is used for selections. This method can be extended to work with column types not supported by recipes.

recipes_extension_check() is recommended to be used by package authors to make sure that all steps have ⁠prep.step_*()⁠, ⁠bake.step_*()⁠, ⁠print.step_*()⁠, ⁠tidy.step_*()⁠, and ⁠required_pkgs.step_*()⁠ methods. It should be used as a test, preferably like this:

test_that("recipes_extension_check", {
  expect_snapshot(
    recipes::recipes_extension_check(
      pkg = "pkgname"
    )
  )
})

tidymodels/recipes documentation built on Nov. 29, 2024, 3:05 p.m.