knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())

Introduction

This vignette shows you the different encodings available in helperFuncs and the affect of different parameters in each function. Normally, these functions would be stored as a package - however it is (probably) better practice to store the scripts with your project, so it is reproduceable and more easily inspectable.

We work with some sample data:

# Loads the functions
#source("R/encodings.R")

getwd()

# Loads the data.
#dt <- readRDS("Data/catEncoding.RDS")
#dt$floatingPoint <- rnorm(nrow(dt))

The 'floatingPoint' column is meant to be left out from these encodings, since you would normally apply other types of transformations on a numeric column like this. Notice that GarageCars was included, even though it is also a number. Sometimes you may wish to perform these encodings on numeric values, so it is included in this example.

Frequency Encoding

Frequency encoding is a common encoding type for tree models (gradient boosting, random forest). The general premise is that it converts a categorical variable into a number by replacing each value with the number of times that value shows up in the data.

Rare Variable Encoding

Dummy Variables

Dummy variables are created when you one-hot-encode your data. This requires making 1 boolean column for every possible value of a variable. This results in a massive dataset, which can be many times the size of your original set, depending on missingness, datatypes, etc etc.



AnotherSamWilson/helperFuncs documentation built on Oct. 1, 2019, 8:51 p.m.