Warning: This package contains binary RDS files in the Support/Data folder. This is tabular data to showcase the functionality of this package. I purposely avoid S3 methods, as I want to use this script to source functions instead of actually using it as a package. I set it up as a package for documentation and easy dependency management.
This publivate package includes helper functions that I commonly use while modeling. So far, it includes:
All transformations return objects that are applied to a dataset (except na-encode, no point). This helps me apply the same transformation to every dataset for a given model, which is surprisingly tricky.
# data.table must be loaded to use this functions.
require(data.table,quietly = TRUE)
# Loads the functions
source("R/encodings.R")
# Loads the data.
catEncoding <- readRDS("Support/Data/catEncoding.RDS")
numericEncodings <- readRDS("Support/Data/numericEncodings.RDS")
catEncoding$floatingPoint <- rnorm(nrow(catEncoding))
The ‘floatingPoint’ column is meant to be left out from these encodings, since you would normally apply other types of transformations on a numeric column like this. Notice that GarageCars was included, even though it is also a number. Sometimes you may wish to perform these encodings on numeric values, so it is included in this example.
There is also a ‘FoundationFactr’ variable, which shows that these functions work as intended on factors.
Frequency encoding is a common encoding type for tree models (gradient boosting, random forest). The general premise is that it converts a categorical variable into a number by replacing each value with the number of times that value shows up in the data.
First, we use the frequencyEncode function to create a ‘freqDefs’ object. This object gives you all the information you need to know about how your data will be encoded:
# Which variables do we want to encode - Remember we don't want to encode floatingPoint
catVars <- c("Foundation","FireplaceQu","GarageCars","Street","FoundationFactr")
# Create the freqDefs object
freqEncod_TRUE <- frequencyEncode(catEncoding,catVars,encodeNA = TRUE, allowNewLevels = TRUE)
freqEncod_FALSE <- frequencyEncode(catEncoding,catVars,encodeNA = FALSE, allowNewLevels = FALSE)
Searching through these objects, you will find the parameters used to
make the object, as well as some ‘tables’. The enc
column is the value
that will replace each level when you apply this transformation.
freqEncod_TRUE$tables
## $Foundation
## freq Foundation enc
## 1: 647 PConc 6
## 2: 634 CBlock 5
## 3: 146 BrkTil 4
## 4: 24 Slab 3
## 5: 6 Stone 2
## 6: 3 Wood 1
## 7: 0 <NA> 0
## 8: 0 __NEWLEVEL__ -1
##
## $FireplaceQu
## freq FireplaceQu enc
## 1: 690 <NA> 6
## 2: 380 Gd 5
## 3: 313 TA 4
## 4: 33 Fa 3
## 5: 24 Ex 2
## 6: 20 Po 1
## 7: 0 __NEWLEVEL__ -1
##
## $GarageCars
## freq GarageCars enc
## 1: 824 2 5
## 2: 369 1 4
## 3: 181 3 3
## 4: 81 0 2
## 5: 5 4 1
## 6: 0 <NA> 0
## 7: 0 __NEWLEVEL__ -1
##
## $Street
## freq Street enc
## 1: 1454 Pave 2
## 2: 6 Grvl 1
## 3: 0 <NA> 0
## 4: 0 __NEWLEVEL__ -1
##
## $FoundationFactr
## freq FoundationFactr enc
## 1: 647 PConc 6
## 2: 634 CBlock 5
## 3: 146 BrkTil 4
## 4: 24 Slab 3
## 5: 6 Stone 2
## 6: 3 Wood 1
## 7: 0 <NA> 0
## 8: 0 __NEWLEVEL__ -1
Check out the freqEncod_FALSE$tables
object to see how encodeNA
and
allowNewLevels
affected the tables.
The frequencyEncode function does not return a dataset on purpose. Often times, you will want to apply encodings to multiple datasets. This is basically a required setup if you want to run your model on new samples in the future. If frequencyEncoding only returned the dataset, you would get different results if you tried to run it again on a different dataset in the future.
Therefore, good practice is to save your encoding objects as an RDS they can be used again.
To apply the encoding, simply use applyEncoding
:
freqDT_TRUE <- applyEncoding(catEncoding,freqEncod_TRUE, inPlace = TRUE)
freqDT_FALSE <- applyEncoding(catEncoding,freqEncod_FALSE, inPlace = FALSE)
# Notice the difference that encodeNA made.
data.table(
"freqDT_TRUE$FireplaceQu" = freqDT_TRUE$FireplaceQu
, " freqDT_FALSE$FireplaceQu" = freqDT_FALSE$FireplaceQu
)
## freqDT_TRUE$FireplaceQu freqDT_FALSE$FireplaceQu
## 1: 6 NA
## 2: 4 4
## 3: 4 4
## 4: 5 5
## 5: 4 4
## ---
## 1456: 4 4
## 1457: 4 4
## 1458: 5 5
## 1459: 6 NA
## 1460: 6 NA
You can also just replace the columns in your original dataset by
specifying inPlace = TRUE
freqDT_inPlace <- applyEncoding(catEncoding,freqEncod_TRUE, inPlace = TRUE)
print(freqDT_inPlace)
## Foundation FireplaceQu GarageCars Street FoundationFactr
## 1: 6 6 5 2 6
## 2: 5 4 5 2 5
## 3: 6 4 5 2 6
## 4: 4 5 3 2 4
## 5: 6 4 3 2 6
## ---
## 1456: 6 4 5 2 6
## 1457: 5 4 5 2 5
## 1458: 2 5 4 2 2
## 1459: 5 6 4 2 5
## 1460: 5 6 4 2 5
## floatingPoint
## 1: -0.06709947
## 2: 0.22028390
## 3: 2.34340690
## 4: 0.05794125
## 5: 1.69450183
## ---
## 1456: -0.15630265
## 1457: -0.05790603
## 1458: 0.79469596
## 1459: 1.13994491
## 1460: 0.08397666
You’ll notice that we set the allowNewLevels parameter in the above
frequencyEncode
functions. Here is what happens if you encounter new
levels on a future dataset, and try to apply the encoding:
# Add a row with a bunch of crap
catEncWithNewLevels <- rbindlist(list(catEncoding,list("New","TA",10,"Pave","New",0)))
# When allowNewLevels = TRUE, a warning is thrown.
freqDT_TnewLevels <- applyEncoding(catEncWithNewLevels,freqEncod_TRUE)
## Warning in applyRFEncoding(dt, obj, inPlace): WARNING: NEW LEVEL DETECTED
## IN VARIABLE Foundation. allowNewLevels IS SET TO TRUE, SO THESE WILL BE
## ENCODED AS newString or -1.
## Warning in applyRFEncoding(dt, obj, inPlace): WARNING: NEW LEVEL DETECTED
## IN VARIABLE GarageCars. allowNewLevels IS SET TO TRUE, SO THESE WILL BE
## ENCODED AS newString or -1.
## Warning in applyRFEncoding(dt, obj, inPlace): WARNING: NEW LEVEL DETECTED
## IN VARIABLE FoundationFactr. allowNewLevels IS SET TO TRUE, SO THESE WILL
## BE ENCODED AS newString or -1.
# When allowNewLevels = FALSE, an error is thrown. Can't show it here, or I couldn't make the vignette.
# freqDT_FnewLevels <- applyEncoding(catEncWithNewLevels,freqEncod_FALSE)
Grouping uncommon levels together is good practice. It is cumersome to do it manually, so this function unintelligently groups all uncommon categorical levels into some new group. The syntax is very similar to frequency encoding. If you want to do any other kind of specific grouping, it is better to do that seperately.
Here, we group all levels together that make up less than 5% of the data we are using. It’s up to you to be smart about what percentage you use
# Create the rareDefs object
rareEncod_TRUE <- rareEncode(catEncoding,catVars,minPerc = 0.05, encodeNA = TRUE, allowNewLevels = TRUE)
rareEncod_FALSE <- rareEncode(catEncoding,catVars,minPerc = 0.05, encodeNA = FALSE, allowNewLevels = FALSE)
You can view exactly how the algorithm is going to group these variables by looking at the ‘tables’ list again
rareEncod_TRUE$tables
## $Foundation
## freq Foundation enc
## 1: 0.443150685 PConc PConc
## 2: 0.434246575 CBlock CBlock
## 3: 0.100000000 BrkTil BrkTil
## 4: 0.016438356 Slab rareGroup
## 5: 0.004109589 Stone rareGroup
## 6: 0.002054795 Wood rareGroup
## 7: 0.000000000 <NA> rareGroup
## 8: 0.000000000 __NEWLEVEL__ rareGroup
##
## $FireplaceQu
## freq FireplaceQu enc
## 1: 0.47260274 <NA> <NA>
## 2: 0.26027397 Gd Gd
## 3: 0.21438356 TA TA
## 4: 0.02260274 Fa rareGroup
## 5: 0.01643836 Ex rareGroup
## 6: 0.01369863 Po rareGroup
## 7: 0.00000000 __NEWLEVEL__ rareGroup
##
## $GarageCars
## freq GarageCars enc
## 1: 0.564383562 2 2
## 2: 0.252739726 1 1
## 3: 0.123972603 3 3
## 4: 0.055479452 0 0
## 5: 0.003424658 4 rareGroup
## 6: 0.000000000 <NA> rareGroup
## 7: 0.000000000 __NEWLEVEL__ rareGroup
##
## $Street
## freq Street enc
## 1: 0.995890411 Pave Pave
## 2: 0.004109589 Grvl rareGroup
## 3: 0.000000000 <NA> rareGroup
## 4: 0.000000000 __NEWLEVEL__ rareGroup
##
## $FoundationFactr
## freq FoundationFactr enc
## 1: 0.443150685 PConc PConc
## 2: 0.434246575 CBlock CBlock
## 3: 0.100000000 BrkTil BrkTil
## 4: 0.016438356 Slab rareGroup
## 5: 0.004109589 Stone rareGroup
## 6: 0.002054795 Wood rareGroup
## 7: 0.000000000 <NA> rareGroup
## 8: 0.000000000 __NEWLEVEL__ rareGroup
We can apply this to the data.table the same way as our freqDefs object:
rareDT_TRUE <- applyEncoding(catEncoding,rareEncod_TRUE)
rareDT_FALSE <- applyEncoding(catEncoding,rareEncod_FALSE)
# Notice that encodeNA did nothing in this case. That's because NAs are not rare, so they do not need to be encoded. If they _were_ rare, then the algorithm would have grouped them into 'rareGroup' if encodeNA = TRUE.
data.table(
"rareDT_T$FireplaceQu" = rareDT_TRUE$FireplaceQu
, " rareDT_F$FireplaceQu" = rareDT_FALSE$FireplaceQu
)
## rareDT_T$FireplaceQu rareDT_F$FireplaceQu
## 1: <NA> <NA>
## 2: TA TA
## 3: TA TA
## 4: Gd Gd
## 5: TA TA
## ---
## 1456: TA TA
## 1457: TA TA
## 1458: Gd Gd
## 1459: <NA> <NA>
## 1460: <NA> <NA>
Dummy variables are created when you one-hot-encode your data. This requires making 1 boolean column for every possible value of a variable. This results in a massive dataset, which can be many times the size of your original set, depending on missingness, datatypes, etc etc.
dummyEncode
provides an efficient way to do this. This is a more
complex encoding strategy, with several options.
dummyEnc1 <- dummyEncode(catEncoding,catVars,"newLevel")
dummyDT1 <- applyEncoding(catEncoding,dummyEnc1)
# Look at how 1 variable was transformed, instead of printing all of the columns
dummyDT1[,dummyEnc1$lvlNames$FireplaceQu,with = FALSE]
## FireplaceQu.na FireplaceQu.Gd FireplaceQu.TA FireplaceQu.Fa
## 1: 1 0 0 0
## 2: 0 0 1 0
## 3: 0 0 1 0
## 4: 0 1 0 0
## 5: 0 0 1 0
## ---
## 1456: 0 0 1 0
## 1457: 0 0 1 0
## 1458: 0 1 0 0
## 1459: 1 0 0 0
## 1460: 1 0 0 0
## FireplaceQu.Ex
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## ---
## 1456: 0
## 1457: 0
## 1458: 0
## 1459: 0
## 1460: 0
This implementation is much more effecient than the caret implementation for large datasets. It is not more effecient for very small datasets, but who cares about speed when you’re talking about 0.01 seconds.
# Our implementation
dt2 <- catEncoding[sample(1:nrow(catEncoding),size=10000000,replace=TRUE)]
system.time(
{
dummyEnc1 <- dummyEncode(dt2, catVars)
dummyDT1 <- applyEncoding(dt2, dummyEnc1)
}
)
## user system elapsed
## 19.19 8.22 28.66
# Carat's Implementation
library(caret, quietly = TRUE) # Caret package must be loaded to use dummyVars
system.time(
{
caretDummy <- dummyVars(~.,dt2[,catVars,with=FALSE])
caretDumDT <- predict(caretDummy,dt2)
# Predict returns a named list of vectors... We still need to convert this to a data.table.
# This is where caret underperforms.
caretDumDT <- as.data.table(caretDumDT)
}
)
## user system elapsed
## 127.01 6.22 137.52
The result from this parameter combination is less obvious:
dummyEnc2 <- dummyEncode(catEncoding,catVars,treatNA = "ghost", values = c(-1,1))
dummyDT2 <- applyEncoding(catEncoding,dummyEnc2)
dummyDT2[,dummyEnc2$lvlNames$FireplaceQu,with = FALSE]
## FireplaceQu.Gd FireplaceQu.TA FireplaceQu.Fa FireplaceQu.Ex
## 1: -1 -1 -1 -1
## 2: -1 1 -1 -1
## 3: -1 1 -1 -1
## 4: 1 -1 -1 -1
## 5: -1 1 -1 -1
## ---
## 1456: -1 1 -1 -1
## 1457: -1 1 -1 -1
## 1458: 1 -1 -1 -1
## 1459: -1 -1 -1 -1
## 1460: -1 -1 -1 -1
treatNA = 'ghost'
causes any NA
values to be set to your
specification for the ‘false’ encoding, i.e. it would typically be set
to 0 for all of the dummy variables. In this case, I specified that my
‘false’ encoding shouldn’t be 0, it should be -1. You won’t need to do
this often, it just shows the conbination of effects.
Sometimes, distributions matter. If you are using basically anything besides a tree model, you probably prefer your data to be within a certain range, and have a well-behaved distribution.
# Which variables to we want to encode in numericEncodings
numerVars <- c("LotFrontage","LotArea","GarageCars","BsmtFinSF2")
# Create encoding objects
gaussEncStNorm <- gaussianEncode(numericEncodings,numerVars)
gaussEncCustom <- gaussianEncode(numericEncodings,numerVars,newMean = 2,newSD = 2)
# Apply encoding objects
gaussStNormDT <- applyEncoding(numericEncodings,gaussEncStNorm)
gaussCustomDT <- applyEncoding(numericEncodings,gaussEncCustom)
# Show mean and SD of each
lapply(gaussStNormDT[,gaussEncStNorm$vars,with=FALSE],function(x) c(mean = mean(x,na.rm=TRUE),SD = sd(x,na.rm=TRUE)))
## $LotFrontage
## mean SD
## -2.700881e-16 1.000000e+00
##
## $LotArea
## mean SD
## -6.781942e-17 1.000000e+00
##
## $GarageCars
## mean SD
## 7.421809e-17 1.000000e+00
##
## $BsmtFinSF2
## mean SD
## 3.368834e-17 1.000000e+00
lapply(gaussCustomDT[,gaussEncCustom$vars,with=FALSE],function(x) c(mean = mean(x,na.rm=TRUE),SD = sd(x,na.rm=TRUE)))
## $LotFrontage
## mean SD
## 2 2
##
## $LotArea
## mean SD
## 2 2
##
## $GarageCars
## mean SD
## 2 2
##
## $BsmtFinSF2
## mean SD
## 2 2
You can see the differences that newMean and newSD make:
require(ggplot2, quietly = TRUE)
gaussPlotDT <- melt(data.table(StandardNormal = gaussStNormDT$BsmtFinSF2, Custom = gaussCustomDT$BsmtFinSF2), measure.vars = c("StandardNormal","Custom"))
ggplot(gaussPlotDT[!is.na(value)], aes(x = value, color = variable)) +geom_density() + xlab("Gaussian Transformed BsmtFinSF2")
This transformation is useful if your data is skewed. Outliers can play hell on a model, and the box-cox transformation can reduce their negative effects. This transformation is a little complex. It is inherently risky, since passing values <= 0 will cause box-cox to fail.
# Create the box-cox object with default parameters
boxCoxEnc <- boxCoxEncode(numericEncodings,numerVars)
# Apply the transformation to numericEncodings
boxCoxDT <- applyEncoding(numericEncodings,boxCoxEnc)
# Plot results on LotArea
require(gridExtra,quietly = TRUE)
p1 <- ggplot(boxCoxDT[!is.na(LotArea)], aes(x = LotArea)) + geom_density() + ggtitle("Transformed With Box-Cox | Skewness = 0")
p2 <- ggplot(numericEncodings[!is.na(LotArea)], aes(x = LotArea)) + geom_density() + ggtitle("Original Data | Skewness = 12.18")
grid.arrange(p2,p1, ncol = 1)
The minimum LotArea in the original data is 1300. Now imagine if, in the future, we start passing unknown values as -1. All of a sudden, our box-cox fails, and that can be dangerous. There are 2 parameters in this function that allow you to be more risk averse. The first is minNormalize, which represents the number of Standard Deviations you wish the lower bound of your data to be above 0. Take LotArea for example - if you made minNormalize 0.5 (a very large value), then the formula would shift your values over by 0.5 * sd(LotArea) - min(LotArea). Shifting the values does not affect the reasonableness of the distribution too much, since you are already shifting in the box-cox transformation.
The second variable that reduces riskyness is capNegPredOutliers. This min caps the value at a certain number of standard deviations avobe 0. This should be lower than or equal to minNormalize, or else you will have negative values capped to be more than the original minimum value of your variable.
boxCoxEncCust <- boxCoxEncode(
numericEncodings
, numerVars
, minNormalize = list(LotFrontage = 1, LotArea = 0.05, GarageCars = 0.05, BsmtFinSF2 = 0.05)
, capNegPredOutliers = 0.02)
boxCoxDT <- applyEncoding(numericEncodings,boxCoxEnc)
lapply(numericEncodings[,numerVars,with=FALSE],min,na.rm=TRUE)
## $LotFrontage
## [1] 21
##
## $LotArea
## [1] 1300
##
## $GarageCars
## [1] 0
##
## $BsmtFinSF2
## [1] 28
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.