In AnotherSamWilson/helperFuncs: Helper Functions for common manipuations

knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
options(width = 1000)

Dependencies

The formulas in this package require the following packages:

install.packages("data.table")    # data manipulation (Everything)
install.packages("e1071")         # skewness function for BoxCox transform (encodings.R)
install.packages("odbc")          # connecting to databases (odbc.R)

Introduction

This vignette shows you the different encodings available in helperFuncs and the affect of different parameters in each function. Normally, these functions would be stored as a package - however it is (probably) better practice to store the scripts with your project, so it is reproduceable and more easily inspectable. You also don't run into dependency problems.

One of the most difficult, painful things in predictive modeling is the introduction of new levels into categorical variables. These functions allow you to handle new levels however you want. These functions are not meant to replace the entire feature engineering process. They should be applied after a proper EDA, with an understanding of how the functions work.

The following encodings are available: Missing Value Imputation Frequency Encoding Rare Level Grouping Uniform (Range) Transformation Gaussian Scaling (mean centered, stdev adjusted) Box Cox transformation * Dummy Variables (lightening fast!)

We work with some sample data:

# data.table must be loaded to use this functions.
require(data.table,quietly = TRUE)

# Loads the functions
source("R/encodings.R")

# Loads the data.
catEncoding <- readRDS("Support/Data/catEncoding.RDS")
numericEncodings <- readRDS("Support/Data/numericEncodings.RDS")
catEncoding$floatingPoint <- rnorm(nrow(catEncoding))

The 'floatingPoint' column is meant to be left out from these encodings, since you would normally apply other types of transformations on a numeric column like this. Notice that GarageCars was included, even though it is also a number. Sometimes you may wish to perform these encodings on numeric values, so it is included in this example.

There is also a 'FoundationFactr' variable, which shows that these functions work as intended on factors.

Frequency Encoding

Frequency encoding is a common encoding type for tree models (gradient boosting, random forest). The general premise is that it converts a categorical variable into a number by replacing each value with the number of times that value shows up in the data.

First, we use the frequencyEncode function to create a 'freqDefs' object. This object gives you all the information you need to know about how your data will be encoded:

# Which variables do we want to encode - Remember we don't want to encode floatingPoint
catVars <- c("Foundation","FireplaceQu","GarageCars","Street","FoundationFactr")

# Create the freqDefs object
freqEncod_TRUE <- frequencyEncode(catEncoding,catVars,encodeNA = TRUE, allowNewLevels = TRUE)
freqEncod_FALSE <- frequencyEncode(catEncoding,catVars,encodeNA = FALSE, allowNewLevels = FALSE)

Searching through these objects, you will find the parameters used to make the object, as well as some 'tables'. The enc column is the value that will replace each level when you apply this transformation.

freqEncod_TRUE$tables

Check out the freqEncod_FALSE$tables object to see how encodeNA and allowNewLevels affected the tables.

Applying The Frequency Encoding

The frequencyEncode function does not return a dataset on purpose. Often times, you will want to apply encodings to multiple datasets. This is basically a required setup if you want to run your model on new samples in the future. If frequencyEncoding only returned the dataset, you would get different results if you tried to run it again on a different dataset in the future.

Therefore, good practice is to save your encoding objects as an RDS they can be used again.

To apply the encoding, simply use applyEncoding:

freqDT_TRUE <- applyEncoding(catEncoding,freqEncod_TRUE, inPlace = TRUE)
freqDT_FALSE <- applyEncoding(catEncoding,freqEncod_FALSE, inPlace = FALSE)


# Notice the difference that encodeNA made.
data.table(
    "freqDT_TRUE$FireplaceQu" = freqDT_TRUE$FireplaceQu
  , "   freqDT_FALSE$FireplaceQu" = freqDT_FALSE$FireplaceQu
)

You can also just replace the columns in your original dataset by specifying inPlace = TRUE

freqDT_inPlace <- applyEncoding(catEncoding,freqEncod_TRUE, inPlace = TRUE)
print(freqDT_inPlace)

Effect of New Levels, and how to use allowNewLevels parameter

You'll notice that we set the allowNewLevels parameter in the above frequencyEncode functions. Here is what happens if you encounter new levels on a future dataset, and try to apply the encoding:

# Add a row with a bunch of crap
catEncWithNewLevels <- rbindlist(list(catEncoding,list("New","TA",10,"Pave","New",0)))

# When allowNewLevels = TRUE, a warning is thrown.
freqDT_TnewLevels <- applyEncoding(catEncWithNewLevels,freqEncod_TRUE)

# When allowNewLevels = FALSE, an error is thrown. Can't show it here, or I couldn't make the vignette.
# freqDT_FnewLevels <- applyEncoding(catEncWithNewLevels,freqEncod_FALSE)

Rare Variable Encoding

Grouping uncommon levels together is good practice. It is cumersome to do it manually, so this function unintelligently groups all uncommon categorical levels into some new group. The syntax is very similar to frequency encoding. If you want to do any other kind of specific grouping, it is better to do that seperately.

Here, we group all levels together that make up less than 5% of the data we are using. It's up to you to be smart about what percentage you use

# Create the rareDefs object
rareEncod_TRUE <- rareEncode(catEncoding,catVars,minPerc = 0.05, encodeNA = TRUE, allowNewLevels = TRUE)
rareEncod_FALSE <- rareEncode(catEncoding,catVars,minPerc = 0.05, encodeNA = FALSE, allowNewLevels = FALSE)

You can view exactly how the algorithm is going to group these variables by looking at the 'tables' list again

rareEncod_TRUE$tables

We can apply this to the data.table the same way as our freqDefs object:

rareDT_TRUE <- applyEncoding(catEncoding,rareEncod_TRUE)
rareDT_FALSE <- applyEncoding(catEncoding,rareEncod_FALSE)

# Notice that encodeNA did nothing in this case. That's because NAs are not rare, so they do not need to be encoded. If they _were_ rare, then the algorithm would have grouped them into 'rareGroup' if encodeNA = TRUE.
data.table(
    "rareDT_T$FireplaceQu" = rareDT_TRUE$FireplaceQu
  , "   rareDT_F$FireplaceQu" = rareDT_FALSE$FireplaceQu
)

Dummy Variables

Dummy variables are created when you one-hot-encode your data. This requires making 1 boolean column for every possible value of a variable. This results in a massive dataset, which can be many times the size of your original set, depending on missingness, datatypes, etc etc.

dummyEncode provides an efficient way to do this. This is a more complex encoding strategy, with several options.

dummyEnc1 <- dummyEncode(catEncoding,catVars,"newLevel")

dummyDT1 <- applyEncoding(catEncoding,dummyEnc1)

# Look at how 1 variable was transformed, instead of printing all of the columns
dummyDT1[,dummyEnc1$lvlNames$FireplaceQu,with = FALSE]

This implementation is much more effecient than the caret implementation for large datasets. It is not more effecient for very small datasets, but who cares about speed when you're talking about 0.01 seconds.

# Our implementation
dt2 <- catEncoding[sample(1:nrow(catEncoding),size=10000000,replace=TRUE)]
system.time(
   {
      dummyEnc1 <- dummyEncode(dt2, catVars)
      dummyDT1 <- applyEncoding(dt2, dummyEnc1)
   }
)

# Carat's Implementation
library(caret, quietly = TRUE) # Caret package must be loaded to use dummyVars 
system.time(
  {
    caretDummy <- dummyVars(~.,dt2[,catVars,with=FALSE])
    caretDumDT <- predict(caretDummy,dt2)

    # Predict returns a named list of vectors... We still need to convert this to a data.table.
    # This is where caret underperforms.
    caretDumDT <- as.data.table(caretDumDT)
  }
)

The result from this parameter combination is less obvious:

dummyEnc2 <- dummyEncode(catEncoding,catVars,treatNA = "ghost", values = c(-1,1))

dummyDT2 <- applyEncoding(catEncoding,dummyEnc2)

dummyDT2[,dummyEnc2$lvlNames$FireplaceQu,with = FALSE]

treatNA = 'ghost' causes any NA values to be set to your specification for the 'false' encoding, i.e. it would typically be set to 0 for all of the dummy variables. In this case, I specified that my 'false' encoding shouldn't be 0, it should be -1. You won't need to do this often, it just shows the conbination of effects.

Distribution Adjustments

Sometimes, distributions matter. If you are using basically anything besides a tree model, you probably prefer your data to be within a certain range, and have a well-behaved distribution.

Gaussian (normal) Scaling

# Which variables to we want to encode in numericEncodings
numerVars <- c("LotFrontage","LotArea","GarageCars","BsmtFinSF2")

# Create encoding objects
gaussEncStNorm <- gaussianEncode(numericEncodings,numerVars)
gaussEncCustom <- gaussianEncode(numericEncodings,numerVars,newMean = 2,newSD = 2)

# Apply encoding objects
gaussStNormDT <- applyEncoding(numericEncodings,gaussEncStNorm)
gaussCustomDT <- applyEncoding(numericEncodings,gaussEncCustom)

# Show mean and SD of each
lapply(gaussStNormDT[,gaussEncStNorm$vars,with=FALSE],function(x) c(mean = mean(x,na.rm=TRUE),SD = sd(x,na.rm=TRUE)))
lapply(gaussCustomDT[,gaussEncCustom$vars,with=FALSE],function(x) c(mean = mean(x,na.rm=TRUE),SD = sd(x,na.rm=TRUE)))

You can see the differences that newMean and newSD make:

require(ggplot2, quietly = TRUE)
gaussPlotDT <- melt(data.table(StandardNormal = gaussStNormDT$BsmtFinSF2, Custom = gaussCustomDT$BsmtFinSF2), measure.vars = c("StandardNormal","Custom"))
ggplot(gaussPlotDT[!is.na(value)], aes(x = value, color = variable)) +geom_density() + xlab("Gaussian Transformed BsmtFinSF2")

Box Cox Transformation

This transformation is useful if your data is skewed. Outliers can play hell on a model, and the box-cox transformation can reduce their negative effects. This transformation is a little complex. It is inherently risky, since passing values <= 0 will cause box-cox to fail.

# Create the box-cox object with default parameters
boxCoxEnc <- boxCoxEncode(numericEncodings,numerVars)

# Apply the transformation to numericEncodings
boxCoxDT <- applyEncoding(numericEncodings,boxCoxEnc)

# Plot results on LotArea
require(gridExtra,quietly = TRUE)
p1 <- ggplot(boxCoxDT[!is.na(LotArea)], aes(x = LotArea)) + geom_density() + ggtitle("Transformed With Box-Cox  |  Skewness = 0")
p2 <- ggplot(numericEncodings[!is.na(LotArea)], aes(x = LotArea)) + geom_density() + ggtitle("Original Data  |  Skewness = 12.18")
grid.arrange(p2,p1, ncol = 1)

The minimum LotArea in the original data is 1300. Now imagine if, in the future, we start passing unknown values as -1. All of a sudden, our box-cox fails, and that can be dangerous. There are 2 parameters in this function that allow you to be more risk averse. The first is minNormalize, which represents the number of Standard Deviations you wish the lower bound of your data to be above 0. Take LotArea for example - if you made minNormalize 0.5 (a very large value), then the formula would shift your values over by 0.5 * sd(LotArea) - min(LotArea). Shifting the values does not affect the reasonableness of the distribution too much, since you are already shifting in the box-cox transformation.

The second variable that reduces riskyness is capNegPredOutliers. This min caps the value at a certain number of standard deviations avobe 0. This should be lower than or equal to minNormalize, or else you will have negative values capped to be more than the original minimum value of your variable.

boxCoxEncCust <- boxCoxEncode(
    numericEncodings
  , numerVars
  , minNormalize = list(LotFrontage = 1, LotArea = 0.05, GarageCars = 0.05, BsmtFinSF2 = 0.05)
  , capNegPredOutliers = 0.02)

boxCoxDT <- applyEncoding(numericEncodings,boxCoxEnc)
lapply(numericEncodings[,numerVars,with=FALSE],min,na.rm=TRUE)