Description Usage Arguments Details Value
Remove values from categorical variables that do not occur often enough
1 2 3 4 | pipe_categorical_filter(train, response,
insufficient_occurance_marker = "insignificant_category",
categorical_columns = colnames(train)[purrr::map_lgl(train,
is.character)], threshold_function = function(data) 30)
|
train |
The train dataset, as a data.frame or data.table. Data.tables may be changed by reference. |
response |
The response column, as a string. Will only be used to ensure this is not included in the |
insufficient_occurance_marker |
The value to substitute when another value in the categorical column doesn't occur often enough. This can be a string, integer or numeric. |
categorical_columns |
The columns to apply the filter over. Should be a character vector. |
threshold_function |
A function that takes the train dataset and produces a scalar value: the minimum number of times a value has to occur if it should be preserved. |
Be careful: if for instance only one value gets substituted in a column, then the insufficient_occurance_marker
value will just replace that one, preserving the problem.
A list containing the transformed train dataset and a trained pipe.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.