knitr::opts_chunk$set(echo = TRUE) rm(list = ls())
This vignette shows you the different encodings available in helperFuncs
and the affect of different parameters in each function. Normally, these functions would be stored as a package - however it is (probably) better practice to store the scripts with your project, so it is reproduceable and more easily inspectable.
We work with some sample data:
# Loads the functions #source("R/encodings.R") getwd() # Loads the data. #dt <- readRDS("Data/catEncoding.RDS") #dt$floatingPoint <- rnorm(nrow(dt))
The 'floatingPoint' column is meant to be left out from these encodings, since you would normally apply other types of transformations on a numeric column like this. Notice that GarageCars was included, even though it is also a number. Sometimes you may wish to perform these encodings on numeric values, so it is included in this example.
Frequency encoding is a common encoding type for tree models (gradient boosting, random forest). The general premise is that it converts a categorical variable into a number by replacing each value with the number of times that value shows up in the data.
Dummy variables are created when you one-hot-encode your data. This requires making 1 boolean column for every possible value of a variable. This results in a massive dataset, which can be many times the size of your original set, depending on missingness, datatypes, etc etc.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.