README.md

Introduction

Warning: This package contains binary RDS files in the Support/Data folder. This is tabular data to showcase the functionality of this package. I purposely avoid S3 methods, as I want to use this script to source functions instead of actually using it as a package. I set it up as a package for documentation and easy dependency management.

This publivate package includes helper functions that I commonly use while modeling. So far, it includes:

  1. Creating dummy variables (very fast and flexible)
  2. Frequency encoding
  3. Rare variable grouping
  4. Gaussian scaling
  5. Uniform (linear) Encoding
  6. Funky Box-Cox transformation (read the vignette)
  7. NA Encoding
  8. Common DBI data request/fetch functions
  9. Easy data upload to a SQL server
  10. Easy data.table -> csv tempfile

All transformations return objects that are applied to a dataset (except na-encode, no point). This helps me apply the same transformation to every dataset for a given model, which is surprisingly tricky.

# data.table must be loaded to use this functions.
require(data.table,quietly = TRUE)

# Loads the functions
source("R/encodings.R")

# Loads the data.
catEncoding <- readRDS("Support/Data/catEncoding.RDS")
numericEncodings <- readRDS("Support/Data/numericEncodings.RDS")
catEncoding$floatingPoint <- rnorm(nrow(catEncoding))

The ‘floatingPoint’ column is meant to be left out from these encodings, since you would normally apply other types of transformations on a numeric column like this. Notice that GarageCars was included, even though it is also a number. Sometimes you may wish to perform these encodings on numeric values, so it is included in this example.

There is also a ‘FoundationFactr’ variable, which shows that these functions work as intended on factors.

Frequency Encoding

Frequency encoding is a common encoding type for tree models (gradient boosting, random forest). The general premise is that it converts a categorical variable into a number by replacing each value with the number of times that value shows up in the data.

First, we use the frequencyEncode function to create a ‘freqDefs’ object. This object gives you all the information you need to know about how your data will be encoded:

# Which variables do we want to encode - Remember we don't want to encode floatingPoint
catVars <- c("Foundation","FireplaceQu","GarageCars","Street","FoundationFactr")

# Create the freqDefs object
freqEncod_TRUE <- frequencyEncode(catEncoding,catVars,encodeNA = TRUE, allowNewLevels = TRUE)
freqEncod_FALSE <- frequencyEncode(catEncoding,catVars,encodeNA = FALSE, allowNewLevels = FALSE)

Searching through these objects, you will find the parameters used to make the object, as well as some ‘tables’. The enc column is the value that will replace each level when you apply this transformation.

freqEncod_TRUE$tables

## $Foundation
##    freq   Foundation enc
## 1:  647        PConc   6
## 2:  634       CBlock   5
## 3:  146       BrkTil   4
## 4:   24         Slab   3
## 5:    6        Stone   2
## 6:    3         Wood   1
## 7:    0         <NA>   0
## 8:    0 __NEWLEVEL__  -1
## 
## $FireplaceQu
##    freq  FireplaceQu enc
## 1:  690         <NA>   6
## 2:  380           Gd   5
## 3:  313           TA   4
## 4:   33           Fa   3
## 5:   24           Ex   2
## 6:   20           Po   1
## 7:    0 __NEWLEVEL__  -1
## 
## $GarageCars
##    freq   GarageCars enc
## 1:  824            2   5
## 2:  369            1   4
## 3:  181            3   3
## 4:   81            0   2
## 5:    5            4   1
## 6:    0         <NA>   0
## 7:    0 __NEWLEVEL__  -1
## 
## $Street
##    freq       Street enc
## 1: 1454         Pave   2
## 2:    6         Grvl   1
## 3:    0         <NA>   0
## 4:    0 __NEWLEVEL__  -1
## 
## $FoundationFactr
##    freq FoundationFactr enc
## 1:  647           PConc   6
## 2:  634          CBlock   5
## 3:  146          BrkTil   4
## 4:   24            Slab   3
## 5:    6           Stone   2
## 6:    3            Wood   1
## 7:    0            <NA>   0
## 8:    0    __NEWLEVEL__  -1

Check out the freqEncod_FALSE$tables object to see how encodeNA and allowNewLevels affected the tables.

Applying The Frequency Encoding

The frequencyEncode function does not return a dataset on purpose. Often times, you will want to apply encodings to multiple datasets. This is basically a required setup if you want to run your model on new samples in the future. If frequencyEncoding only returned the dataset, you would get different results if you tried to run it again on a different dataset in the future.

Therefore, good practice is to save your encoding objects as an RDS they can be used again.

To apply the encoding, simply use applyEncoding:

freqDT_TRUE <- applyEncoding(catEncoding,freqEncod_TRUE, inPlace = TRUE)
freqDT_FALSE <- applyEncoding(catEncoding,freqEncod_FALSE, inPlace = FALSE)


# Notice the difference that encodeNA made.
data.table(
    "freqDT_TRUE$FireplaceQu" = freqDT_TRUE$FireplaceQu
  , "   freqDT_FALSE$FireplaceQu" = freqDT_FALSE$FireplaceQu
)

##       freqDT_TRUE$FireplaceQu    freqDT_FALSE$FireplaceQu
##    1:                       6                          NA
##    2:                       4                           4
##    3:                       4                           4
##    4:                       5                           5
##    5:                       4                           4
##   ---                                                    
## 1456:                       4                           4
## 1457:                       4                           4
## 1458:                       5                           5
## 1459:                       6                          NA
## 1460:                       6                          NA

You can also just replace the columns in your original dataset by specifying inPlace = TRUE

freqDT_inPlace <- applyEncoding(catEncoding,freqEncod_TRUE, inPlace = TRUE)
print(freqDT_inPlace)

##       Foundation FireplaceQu GarageCars Street FoundationFactr
##    1:          6           6          5      2               6
##    2:          5           4          5      2               5
##    3:          6           4          5      2               6
##    4:          4           5          3      2               4
##    5:          6           4          3      2               6
##   ---                                                         
## 1456:          6           4          5      2               6
## 1457:          5           4          5      2               5
## 1458:          2           5          4      2               2
## 1459:          5           6          4      2               5
## 1460:          5           6          4      2               5
##       floatingPoint
##    1:   -0.06709947
##    2:    0.22028390
##    3:    2.34340690
##    4:    0.05794125
##    5:    1.69450183
##   ---              
## 1456:   -0.15630265
## 1457:   -0.05790603
## 1458:    0.79469596
## 1459:    1.13994491
## 1460:    0.08397666

Effect of New Levels, and how to use allowNewLevels parameter

You’ll notice that we set the allowNewLevels parameter in the above frequencyEncode functions. Here is what happens if you encounter new levels on a future dataset, and try to apply the encoding:

# Add a row with a bunch of crap
catEncWithNewLevels <- rbindlist(list(catEncoding,list("New","TA",10,"Pave","New",0)))

# When allowNewLevels = TRUE, a warning is thrown.
freqDT_TnewLevels <- applyEncoding(catEncWithNewLevels,freqEncod_TRUE)

## Warning in applyRFEncoding(dt, obj, inPlace): WARNING: NEW LEVEL DETECTED
## IN VARIABLE Foundation. allowNewLevels IS SET TO TRUE, SO THESE WILL BE
## ENCODED AS newString or -1.

## Warning in applyRFEncoding(dt, obj, inPlace): WARNING: NEW LEVEL DETECTED
## IN VARIABLE GarageCars. allowNewLevels IS SET TO TRUE, SO THESE WILL BE
## ENCODED AS newString or -1.

## Warning in applyRFEncoding(dt, obj, inPlace): WARNING: NEW LEVEL DETECTED
## IN VARIABLE FoundationFactr. allowNewLevels IS SET TO TRUE, SO THESE WILL
## BE ENCODED AS newString or -1.

# When allowNewLevels = FALSE, an error is thrown. Can't show it here, or I couldn't make the vignette.
# freqDT_FnewLevels <- applyEncoding(catEncWithNewLevels,freqEncod_FALSE)

Rare Variable Encoding

Grouping uncommon levels together is good practice. It is cumersome to do it manually, so this function unintelligently groups all uncommon categorical levels into some new group. The syntax is very similar to frequency encoding. If you want to do any other kind of specific grouping, it is better to do that seperately.

Here, we group all levels together that make up less than 5% of the data we are using. It’s up to you to be smart about what percentage you use

# Create the rareDefs object
rareEncod_TRUE <- rareEncode(catEncoding,catVars,minPerc = 0.05, encodeNA = TRUE, allowNewLevels = TRUE)
rareEncod_FALSE <- rareEncode(catEncoding,catVars,minPerc = 0.05, encodeNA = FALSE, allowNewLevels = FALSE)

You can view exactly how the algorithm is going to group these variables by looking at the ‘tables’ list again

rareEncod_TRUE$tables

## $Foundation
##           freq   Foundation       enc
## 1: 0.443150685        PConc     PConc
## 2: 0.434246575       CBlock    CBlock
## 3: 0.100000000       BrkTil    BrkTil
## 4: 0.016438356         Slab rareGroup
## 5: 0.004109589        Stone rareGroup
## 6: 0.002054795         Wood rareGroup
## 7: 0.000000000         <NA> rareGroup
## 8: 0.000000000 __NEWLEVEL__ rareGroup
## 
## $FireplaceQu
##          freq  FireplaceQu       enc
## 1: 0.47260274         <NA>      <NA>
## 2: 0.26027397           Gd        Gd
## 3: 0.21438356           TA        TA
## 4: 0.02260274           Fa rareGroup
## 5: 0.01643836           Ex rareGroup
## 6: 0.01369863           Po rareGroup
## 7: 0.00000000 __NEWLEVEL__ rareGroup
## 
## $GarageCars
##           freq   GarageCars       enc
## 1: 0.564383562            2         2
## 2: 0.252739726            1         1
## 3: 0.123972603            3         3
## 4: 0.055479452            0         0
## 5: 0.003424658            4 rareGroup
## 6: 0.000000000         <NA> rareGroup
## 7: 0.000000000 __NEWLEVEL__ rareGroup
## 
## $Street
##           freq       Street       enc
## 1: 0.995890411         Pave      Pave
## 2: 0.004109589         Grvl rareGroup
## 3: 0.000000000         <NA> rareGroup
## 4: 0.000000000 __NEWLEVEL__ rareGroup
## 
## $FoundationFactr
##           freq FoundationFactr       enc
## 1: 0.443150685           PConc     PConc
## 2: 0.434246575          CBlock    CBlock
## 3: 0.100000000          BrkTil    BrkTil
## 4: 0.016438356            Slab rareGroup
## 5: 0.004109589           Stone rareGroup
## 6: 0.002054795            Wood rareGroup
## 7: 0.000000000            <NA> rareGroup
## 8: 0.000000000    __NEWLEVEL__ rareGroup

We can apply this to the data.table the same way as our freqDefs object:

rareDT_TRUE <- applyEncoding(catEncoding,rareEncod_TRUE)
rareDT_FALSE <- applyEncoding(catEncoding,rareEncod_FALSE)

# Notice that encodeNA did nothing in this case. That's because NAs are not rare, so they do not need to be encoded. If they _were_ rare, then the algorithm would have grouped them into 'rareGroup' if encodeNA = TRUE.
data.table(
    "rareDT_T$FireplaceQu" = rareDT_TRUE$FireplaceQu
  , "   rareDT_F$FireplaceQu" = rareDT_FALSE$FireplaceQu
)

##       rareDT_T$FireplaceQu    rareDT_F$FireplaceQu
##    1:                 <NA>                    <NA>
##    2:                   TA                      TA
##    3:                   TA                      TA
##    4:                   Gd                      Gd
##    5:                   TA                      TA
##   ---                                             
## 1456:                   TA                      TA
## 1457:                   TA                      TA
## 1458:                   Gd                      Gd
## 1459:                 <NA>                    <NA>
## 1460:                 <NA>                    <NA>

Dummy Variables

Dummy variables are created when you one-hot-encode your data. This requires making 1 boolean column for every possible value of a variable. This results in a massive dataset, which can be many times the size of your original set, depending on missingness, datatypes, etc etc.

dummyEncode provides an efficient way to do this. This is a more complex encoding strategy, with several options.

dummyEnc1 <- dummyEncode(catEncoding,catVars,"newLevel")

dummyDT1 <- applyEncoding(catEncoding,dummyEnc1)

# Look at how 1 variable was transformed, instead of printing all of the columns
dummyDT1[,dummyEnc1$lvlNames$FireplaceQu,with = FALSE]

##       FireplaceQu.na FireplaceQu.Gd FireplaceQu.TA FireplaceQu.Fa
##    1:              1              0              0              0
##    2:              0              0              1              0
##    3:              0              0              1              0
##    4:              0              1              0              0
##    5:              0              0              1              0
##   ---                                                            
## 1456:              0              0              1              0
## 1457:              0              0              1              0
## 1458:              0              1              0              0
## 1459:              1              0              0              0
## 1460:              1              0              0              0
##       FireplaceQu.Ex
##    1:              0
##    2:              0
##    3:              0
##    4:              0
##    5:              0
##   ---               
## 1456:              0
## 1457:              0
## 1458:              0
## 1459:              0
## 1460:              0

This implementation is much more effecient than the caret implementation for large datasets. It is not more effecient for very small datasets, but who cares about speed when you’re talking about 0.01 seconds.

# Our implementation
dt2 <- catEncoding[sample(1:nrow(catEncoding),size=10000000,replace=TRUE)]
system.time(
   {
      dummyEnc1 <- dummyEncode(dt2, catVars)
      dummyDT1 <- applyEncoding(dt2, dummyEnc1)
   }
)

##    user  system elapsed 
##   19.19    8.22   28.66

# Carat's Implementation
library(caret, quietly = TRUE) # Caret package must be loaded to use dummyVars 
system.time(
  {
    caretDummy <- dummyVars(~.,dt2[,catVars,with=FALSE])
    caretDumDT <- predict(caretDummy,dt2)

    # Predict returns a named list of vectors... We still need to convert this to a data.table.
    # This is where caret underperforms.
    caretDumDT <- as.data.table(caretDumDT)
  }
)

##    user  system elapsed 
##  127.01    6.22  137.52

The result from this parameter combination is less obvious:

dummyEnc2 <- dummyEncode(catEncoding,catVars,treatNA = "ghost", values = c(-1,1))

dummyDT2 <- applyEncoding(catEncoding,dummyEnc2)

dummyDT2[,dummyEnc2$lvlNames$FireplaceQu,with = FALSE]

##       FireplaceQu.Gd FireplaceQu.TA FireplaceQu.Fa FireplaceQu.Ex
##    1:             -1             -1             -1             -1
##    2:             -1              1             -1             -1
##    3:             -1              1             -1             -1
##    4:              1             -1             -1             -1
##    5:             -1              1             -1             -1
##   ---                                                            
## 1456:             -1              1             -1             -1
## 1457:             -1              1             -1             -1
## 1458:              1             -1             -1             -1
## 1459:             -1             -1             -1             -1
## 1460:             -1             -1             -1             -1

treatNA = 'ghost' causes any NA values to be set to your specification for the ‘false’ encoding, i.e. it would typically be set to 0 for all of the dummy variables. In this case, I specified that my ‘false’ encoding shouldn’t be 0, it should be -1. You won’t need to do this often, it just shows the conbination of effects.

Distribution Adjustments

Sometimes, distributions matter. If you are using basically anything besides a tree model, you probably prefer your data to be within a certain range, and have a well-behaved distribution.

Gaussian (normal) Scaling

# Which variables to we want to encode in numericEncodings
numerVars <- c("LotFrontage","LotArea","GarageCars","BsmtFinSF2")

# Create encoding objects
gaussEncStNorm <- gaussianEncode(numericEncodings,numerVars)
gaussEncCustom <- gaussianEncode(numericEncodings,numerVars,newMean = 2,newSD = 2)

# Apply encoding objects
gaussStNormDT <- applyEncoding(numericEncodings,gaussEncStNorm)
gaussCustomDT <- applyEncoding(numericEncodings,gaussEncCustom)

# Show mean and SD of each
lapply(gaussStNormDT[,gaussEncStNorm$vars,with=FALSE],function(x) c(mean = mean(x,na.rm=TRUE),SD = sd(x,na.rm=TRUE)))

## $LotFrontage
##          mean            SD 
## -2.700881e-16  1.000000e+00 
## 
## $LotArea
##          mean            SD 
## -6.781942e-17  1.000000e+00 
## 
## $GarageCars
##         mean           SD 
## 7.421809e-17 1.000000e+00 
## 
## $BsmtFinSF2
##         mean           SD 
## 3.368834e-17 1.000000e+00

lapply(gaussCustomDT[,gaussEncCustom$vars,with=FALSE],function(x) c(mean = mean(x,na.rm=TRUE),SD = sd(x,na.rm=TRUE)))

## $LotFrontage
## mean   SD 
##    2    2 
## 
## $LotArea
## mean   SD 
##    2    2 
## 
## $GarageCars
## mean   SD 
##    2    2 
## 
## $BsmtFinSF2
## mean   SD 
##    2    2

You can see the differences that newMean and newSD make:

require(ggplot2, quietly = TRUE)
gaussPlotDT <- melt(data.table(StandardNormal = gaussStNormDT$BsmtFinSF2, Custom = gaussCustomDT$BsmtFinSF2), measure.vars = c("StandardNormal","Custom"))
ggplot(gaussPlotDT[!is.na(value)], aes(x = value, color = variable)) +geom_density() + xlab("Gaussian Transformed BsmtFinSF2")

Box Cox Transformation

This transformation is useful if your data is skewed. Outliers can play hell on a model, and the box-cox transformation can reduce their negative effects. This transformation is a little complex. It is inherently risky, since passing values <= 0 will cause box-cox to fail.

# Create the box-cox object with default parameters
boxCoxEnc <- boxCoxEncode(numericEncodings,numerVars)

# Apply the transformation to numericEncodings
boxCoxDT <- applyEncoding(numericEncodings,boxCoxEnc)

# Plot results on LotArea
require(gridExtra,quietly = TRUE)
p1 <- ggplot(boxCoxDT[!is.na(LotArea)], aes(x = LotArea)) + geom_density() + ggtitle("Transformed With Box-Cox  |  Skewness = 0")
p2 <- ggplot(numericEncodings[!is.na(LotArea)], aes(x = LotArea)) + geom_density() + ggtitle("Original Data  |  Skewness = 12.18")
grid.arrange(p2,p1, ncol = 1)

The minimum LotArea in the original data is 1300. Now imagine if, in the future, we start passing unknown values as -1. All of a sudden, our box-cox fails, and that can be dangerous. There are 2 parameters in this function that allow you to be more risk averse. The first is minNormalize, which represents the number of Standard Deviations you wish the lower bound of your data to be above 0. Take LotArea for example - if you made minNormalize 0.5 (a very large value), then the formula would shift your values over by 0.5 * sd(LotArea) - min(LotArea). Shifting the values does not affect the reasonableness of the distribution too much, since you are already shifting in the box-cox transformation.

The second variable that reduces riskyness is capNegPredOutliers. This min caps the value at a certain number of standard deviations avobe 0. This should be lower than or equal to minNormalize, or else you will have negative values capped to be more than the original minimum value of your variable.

boxCoxEncCust <- boxCoxEncode(
    numericEncodings
  , numerVars
  , minNormalize = list(LotFrontage = 1, LotArea = 0.05, GarageCars = 0.05, BsmtFinSF2 = 0.05)
  , capNegPredOutliers = 0.02)

boxCoxDT <- applyEncoding(numericEncodings,boxCoxEnc)
lapply(numericEncodings[,numerVars,with=FALSE],min,na.rm=TRUE)

## $LotFrontage
## [1] 21
## 
## $LotArea
## [1] 1300
## 
## $GarageCars
## [1] 0
## 
## $BsmtFinSF2
## [1] 28


AnotherSamWilson/helperFuncs documentation built on Oct. 1, 2019, 8:51 p.m.