pipe_categorical_filter: Remove values from categorical variables that do not occur...

Description Usage Arguments Details Value

Description

Remove values from categorical variables that do not occur often enough

Usage

1
2
3
4
pipe_categorical_filter(train, response,
  insufficient_occurance_marker = "insignificant_category",
  categorical_columns = colnames(train)[purrr::map_lgl(train,
  is.character)], threshold_function = function(data) 30)

Arguments

train

The train dataset, as a data.frame or data.table. Data.tables may be changed by reference.

response

The response column, as a string. Will only be used to ensure this is not included in the categorical_columns variable.

insufficient_occurance_marker

The value to substitute when another value in the categorical column doesn't occur often enough. This can be a string, integer or numeric.

categorical_columns

The columns to apply the filter over. Should be a character vector.

threshold_function

A function that takes the train dataset and produces a scalar value: the minimum number of times a value has to occur if it should be preserved.

Details

Be careful: if for instance only one value gets substituted in a column, then the insufficient_occurance_marker value will just replace that one, preserving the problem.

Value

A list containing the transformed train dataset and a trained pipe.


jeroenvdhoven/datapiper documentation built on July 14, 2019, 9:34 p.m.