step_mas: Transform set-valued variables to logical membership...

View source: R/step-mas.r

step_masR Documentation

Transform set-valued variables to logical membership variables

Description

The functions step_mas() create specifications of recipe steps that will create binary variables from set-valued attributes.

Usage

step_mas(
  recipe,
  ...,
  role = "predictor",
  trained = FALSE,
  max_length = Inf,
  min_support = 0.01,
  min_all_confidence = 0.1,
  min_overlap = 12L,
  itemsets = NULL,
  itemnums = NULL,
  itemlabs = NULL,
  skip = FALSE,
  id = rand_id("mas")
)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose which variables are affected by the step.

role

For model terms created by this step, what analysis role should they be assigned? By default, the function assumes that the new columns created by the original variables will be used as predictors in a model.

trained

A logical value indicating whether the values used for binarization have been checked.

max_length, min_support, min_all_confidence, min_overlap

Parameters used by the MAS algorithm.

itemsets, itemnums, itemlabs

A named list of itemsets, the numbers of items in each, and the unique items that appear in each. These are NULL until the step is trained by recipes::prep.recipe().

skip

A logical value indicating whether the step should be skipped when the recipe is baked by bake.recipe().

id

A character string that is unique to this step, used to identify it.

Details

step_mas() will construct a collection of binary variables that encode maximal itemsets from within a set-valued attribute using the MAS (Maximal-frequent All-confident pattern Selection) algorithm of Zhong &al (2020).

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

References

Zhong H, Loukides G, & Gwadera R (2020) "Clustering datasets with demographics and diagnosis codes". Journal of Biomedical Informatics 102, 103360. doi: 10.1016/j.jbi.2019.103360

Examples

# toy data set
toy_data <- data.frame(
  id = LETTERS[seq(21L, 26L)],
  letter = I(list(
    c("a", "b", "d"),
    c("a", "b", "c"),
    c("b", "d"),
    c("b"),
    c("a", "b", "d"),
    c("a", "b", "d", "e")
  )),
  part = rep(c("train", "test"), each = 3L)
)
# each part contains values missing from the other
print(toy_data)

# build preprocessing recipe
toy_data %>%
  filter(part == "train") %>%
  recipe() %>%
  step_mas(letter) %>%
  prep(strings_as_factors = FALSE) ->
  toy_rec

# preprocess training data
bake(toy_rec, new_data = NULL)

# preprocess testing data
toy_data %>%
  filter(part == "test") %>%
  bake(object = toy_rec)

corybrunson/imtidy documentation built on Sept. 15, 2022, 1:11 a.m.