encode_freq: encode_freq

Description Usage Arguments Value Examples

View source: R/encode_freq.R

Description

This function takes one or more categorical variables and tranforms transforms them using frequency encoding. The first level (1) is Unknown values, the second level (2) is the most popular level, third (3) the second most popular level. Small levels can be grouped together using the n_level flag. These are placed at the end

Usage

1
2
3
4
5
6
7
encode_freq(
  data,
  n_levels = NULL,
  min_level_count = NULL,
  unknown_levels = NULL,
  unknown_treatment_method = 1
)

Arguments

data

vector or dataframe - data to frequency encode. For dataframe columns are encoded

n_levels

numeric or named list of numerics - This is the maximun number of categorical levels to include (note "Unknown" and "Other" are always added). If a named list is given the names must be the same as the dataframe colnames

min_level_count

numeric or named list of numerics - This in the minimum number of instances for a level to be counted (This is stricter than a given value for n_levels) If a named list is given the names must be the same as the dataframe colnames

unknown_levels

vector[String] or named list of vector[Sting] - These values will be treated as unknown. NA and "" are always treated as unknown If a named list is given the names must be the same as the dataframe colnames

unknown_treatment_method

numeric or named list of numerics - Must be 1. Gives option to treat unknowns differently. Not implememted If a named list is given the names must be the same as the dataframe colnames

Value

list(data, levels) - data is transformed data in the same shape as the input data levels is a vector (if data is vector) or named list of vectors (if data is dataframe) containing the order of the categorical levels

Examples

1
2
3
4
5
6
7
8
9
data_in <- c(rep("cat", 2) , rep("dog", 3), rep("fish", 4), "llama", NA)
encode_freq(data=data_in)

data_in_df <- data.frame(
pet=c(rep("cat", 2) , rep("dog", 3), rep("fish", 4), "llama", NA),
letter=c(rep("a",5), rep("b",5), "c")
)

encode_freq(data=data_in_df)

gloverd2/admr documentation built on Dec. 2, 2020, 11:16 p.m.