In awstringer/modellingTools: A Collection of Useful Custom Tools for D&A and Modelling

Installation

The package is available on CRAN, but you can get the latest version easily from the github page by using the install_github function in the devtools package:

install.packages("devtools")
devtools::install_github("awstringer/ModellingTools")

To use the package interactively, you must, of course, load it in your session:

library(modellingTools)

Using the development version is recommended; if something is documented in this vignette, you can expect it to work regardless of whether I have bothered to submit a recent version to CRAN.

What This Package Offers

R is a programming language designed primarily to facilitate statistical modelling. It is not a power-user language like SAS, which is designed to make common data analysis tasks easier. This is what makes R great-- it assumes that the user is dedicated enough to program their own custom solutions to problems, and it makes doing so pleasant and productive.

Sometimes though, you just want to analyze your data-- you are doing so as part of a larger project, and don't have time to spend hours programming solutions to simple problems like getting frequency distributions or discretizing continuous variables. This is where modellingTools comes in. This package offers simple, easy to use functions for performing common data analysis tasks.

The core principle behind the package is that data analysis and modelling should work fluidly together. Having data and tools that work together to perform both tasks minimizes analyst effort and errors, and provides a more engaging and repeatable experience.

The package has the following modules:

Helpful Functions: collection of (you guessed it) helpful functions for use in your day-to-day data analysis.
Descriptive Statistics: Easily view the frequency distribution of your variables (including automatic discretization of continuous variables, if you don't want to look at hundreds/thousands of unique values), compute and sort correlations of inputs with a response-- this module makes the tasks you should normally do before building a model easy.
Variable Discretization: A common task in data analysis is to convert continuous variables into discrete ones. This is surprisingly difficult to do in base R, so this module makes it simple and flexible.
Variable Clustering: While clustering observations is popular, especially in the modern machine learning literature, the clustering of predictor variables into groups is much less covered in the literature. This is often a pragmatic step in any variable selection process: group variables together into groups that provide similar information, and then select a few variables from each group. This can reduce the impact of correlated variables on the selection procedure and the model in general. This module wraps functionality from an outside package into a consistent input/output interface, making this step as fluid as possible.

All of the functions are documented, e.g. ?simple_bin. This vignette serves as a comprehensive list of what is available, and includes select commentary and examples of when certain functions might be useful.

Important Note: to run the examples in this package, you should have the dplyr and magrittr packages installed and attached to the search path:

# install.packages(c("dplyr","magrittr"))
library(dplyr)
library(magrittr)

You should always have these packages attached when running R interactively, because they are the best.

Of course, you also need the modellingTools package installed!

Modules

Helpful Functions

This module contains some common functions that I use when programming to simplify tasks that, frankly, should be simple. Most of them are simple one or two-liners, but they have been unit tested and documented for your convenience.

column_vector

This is a simple convenience function that grabs a single column of a dataframe and returns it as a vector, either numeric, factor or character depending on its type in the dataframe. Designed to work with the tbl_df class from the dplyr package but there is no reason it won't work on arbitrary dataframes.

I find that using this function makes my code cleaner and more predictable. For example,

d <- iris
d[,1]

should return a numeric vector with a name. However,

d[,c(1,2)]

returns a dataframe. Maybe this makes sense-- but how many of us would look at the above code snippets and know what would be returned without thinking? The dplyr package provides the tbl_df class (which you should always use!):

d <- tbl_df(iris)
d[,1]
d[,c(1,2)]

which provides consistent behaviour. But then, how do you get a numeric vector containing just one column?

unlist(d[,1])

Save yourself some typing and use column_vector:

column_vector(d,1)

or, equiavlently,

column_vector(d,"Sepal.Length")

This will work with both types of dataframe, as well as objects of class matrix and array, so your code will be clean and predictable.

Since I am claiming that behaviour of functions should be simple, I should point out the following. What happens if you run

column_vector(d,c(1,2))

Probably not what you would expect, or want-- I recommend against using it this way. Indeed, column_vector will give a warning if the user supplies multiple requested columns.

Descriptive Statistics

This module provides functions for performing these simple, yet sometimes tricky to implement data analysis tasks.

For example, everyone should easily be able to compute the frequency distribution of their variables. R provides many useful functions for doing this, however they are often confusing and have inconsistent output formats.

proc_freq

How do you get the frequency distribution of a variable in R? You can use table:

data(CO2)
CO2 %$% table(Plant)

This works for cross-tabulation as well:

CO2 %$% table(Plant,Type)

What about three-way tables? You can try the following yourself:

CO2 %$% table(Plant,Type,Treatment)

(by the way, the %$% operator is from the magrittr package. I highly recommend checking this out, as it will change the way you program in R).

This works well, but

the output format is a named vector, which can be hard to read
the output format changes as you provide more arguments (for two variables it is a matrix, for 3+ it is a multidimensional array)
the documentation is confusing (if you figure it out, let me know!)

Why should there be 10 arguments to a function that only does one thing? proc_freq has three: the dataset that the variable is in, the variable that you want the distribution of, and an optional awesome argument that specifies whether you want to discretize a continuous variable into bins before summarizing.

The usage is simple:

proc_freq(iris,"Species")

The output is a tbl_df, a great data structure from the dplyr package, and contains counts and percentages.

What about the mysterious third argument to proc_freq: bins. This is used to discretize a continuous variable prior to summarizing. Why might one want to do this? When dealing with smaller, evenly distributed datasets, the need for this is not apparent:

proc_freq(iris,"Sepal.Length")

This provides manageable, interpretible output. In my day to day work though, I deal with massive datasets (10,000,000+ rows), which often have

large proportions of special values-- datapoints that are missing for some known reason
large proportions of actual missing values

I don't want to know how many times each unique value occurs in a dataset with 10,000,000 rows, and I definitely don't want to exclude missing values (have you noticed that this is the default behaviour of table?). What happens when you use table?

library(foreach) # Check this out if you haven't already
# Try for 1,000 observations, for illustration purposes
d <- data_frame(v1 = times(1e04) %do% if (runif(1) < .1) NA else rnorm(1))

Try running the following yourself:

d %$% table(v1)
proc_freq(d,"v1")

The output of table is a giant vector that is essentally unreadable without some further intervention on the part of the programmer-- not ideal. The output of proc_freq is readable but useless, since there are so many unique values. There is a better way:

proc_freq(d,"v1",bins = 3)

Using bins = 3 with proc_freq gives the user insight into the distribution of the data and the proportion of missing/special values using one line of code and a single, consistent output format.

get_top_corrs

This function helps to answer the simple questions "are any input variables correlated with the response of interest and if so, which ones and how much?". You can get this information using base R by, for example:

x <- mtcars[,-1]
y <- mtcars[,1]
apply(x,2,function(a,b) cor(a,b),b = y)

This works okay, but the output is a bit verbose and difficult to work with. Oh and hard to read if you have more than a few variables. Oh and what if some of them are non-numeric? Okay, I'll build in an if statement to check and darn, what about NAs...

Stop. Use get_top_corrs instead! This function

Returns a tbl_df with two columns: the var_name and the correlation with response_var
Sorts variables in descending order of correlation, for easy viewing
Drops all non-numeric variables
Filters out NA values from correlation computations
Is paralellized, if you have a parallel backend registered (see package doParalell)

So we can do

get_top_corrs(mtcars,"mpg")

and gain immediate insights, instead of trying to do it the hard way for 5 minutes and just giving up.

Variable Discretization

Discretizing continuous variables is often useful practice for removing undesirable patterns or improving model predictions. R provides useful functions (cut) for binning vectors, but it takes some work on the part of the programmer to apply these consistently to a whole dataset.

What is meant by discretization, also known as binning, is to convert a numeric variable into a categorical one, by assigning each observation to a "bin" depending on its numeric value. For example, if you aren't interested in whether a value is 1.9 or 2.1, but only whether it is < 2 or >= 2, you might want to try binning your data.

When building a predictive model, binning can also mitigate the effect of the ubiquitous "linearity" assumption by better tracking nonlinear patterns in the data. Of course it does this at the expense of simplicity; but since the results are still highly interpretable and datasets are usually quite large these days, this isn't often a great concern.

There are four types of binning available in modellingTools:

Equal Width Binning: this will bin a continuous variable into a specified number of bins, each of which has the same length in units of the original variable. For example, binning into ranges (0,1], (1,2], (2,3] ... and so on. This is useful for analyzing the frequency distribution of a continuous variable with many unique levels-- in fact, this is exactly what you are doing when you make a histogram
Equal Height Binning: this will bin a continuous variable into a specified number of bins that each have the same number of observations in them. That is, the actual cutpoints of the bins are chosen by the data. This helps when attempting to estimate how a rate differs according to a continuous variable, but the distribution of the continuous variable is not flat. For example, electoral ridings in Canada are (roughly) chosen to have the same number of people in them. Imagine if they were chosen instead based on fixed geographic area- Toronto/Vancouver/Montreal would only get one MP each!
Optimal Binning: this is the technique that will likely be used when binning for the purpose of building a predictive model. Optimal Binning bins variables by choosing cutpoints that maximize the separation of the two levels of a binary response variable. The metric used is the Information Value, the industry standard in financial services. For the more technical users: this is essentially equivalent to fitting a piecewise constant spline to each variable independently, choosing the knots by a well-defined optimality criterion. Improvements can of course be made to this approach, but at the cost of model interprebility/ease of implementation.
Arbitrary Binning: this simply lets the user specify their own cutpoints. This is provided to ensure complete generality, but it is particularly useful when the user wishes to create bins based on one dataset, and then analyze the same variable on a different dataset using the same bins-- for example, when calculating the same descriptive statistics on datasets containing the same variables, but collected at different points in time.

simple_bin

This is currently the main function in the discretization module. The purpose is to make it simple to convert each continuous variable in a dataset to a categorical variable by applying equal-height or equal-width bins, and to make it simple to bin the same variables in a new dataset into those same bins. The primary use would be in building a model on a training set and applying it to a test set, which requires the categorical variables in both to have the same levels.

For details on the arguments to simple_bin, see the documentation ?modellingTools::simple_bin. We can observe the basic usage of simple_bin as follows:

data(Seatbelts)
d <- tbl_df(as.data.frame(Seatbelts))
d

d_bin <- simple_bin(d,bins = 3)
d_bin

Notice that the variable law retains its dbl class- simple_bin automatically detects if there are not enough unique values of a variable to propoerly bin, and leaves them untouched. Variables that are already categorical are also automatically ignored. If you would like to know the cutpoints for the bins, check out modellingTools:binned_data_cutpoints below.

When building a model, you very well may wish to fit the model on one subset of the data (a training set) and evaluate predictions on a different subset (a testing set). If dealing with variables that have been discretized (by you), this requires the variables in the testing set to be binned into the same ranges.

simple_bin automatically bins an optional test dataset into the same bins used on the training set:

d %<>% mutate(on_train = times(n()) %do% if (runif(1) < .7) 1 else 0)
d_train <- d %>% filter(on_train == 1) %>% select(-on_train)
d_test <- d %>% filter(on_train == 0) %>% select(-on_train)

d_split_binned <- simple_bin(d_train,test = d_test,bins = 3)
d_split_binned

But wait-- why can't we just bin the data before we split it up into training/test sets? This implicitly assumes that the test data will always be distributed exactly the same as the training data. This is true by definition when you are creating the train/test splits yourself, but many validation cases involve testing on samples not available at the time of fitting the model.

Even if you don't have the test set at the time you build the model, or if you wish to use the same model on future datasets, simple_bin can take custom bins. These must be passed as a named list, which is the exact format output by modellingTools::binned_data_cutpoints (see below):

d_train_bin <- simple_bin(d_train,bins = 3)
train_cutpoints <- binned_data_cutpoints(d_train_bin)
train_cutpoints

d_test_bin <- simple_bin(d_test,bins = train_cutpoints)
d_test_bin

identical(d_test_bin,d_split_binned$test)

optimal_bin

This function takes a dataset and desired input variables, and computes the optimal bins with respect to a binary response. The function doesn't actually bin the data; it outputs a named list of the optimal cutpoints for each variable, which can then be passed to simple_bin:

d <- iris %>% mutate(response = times(n()) %do% {
                                  if (species == "setosa" & runif(1) < 0.3) 1
                                  else if (species == "versicolor" & runif(1) < 0.5) 1
                                  else if (species == "virginica" & runif(1) < 0.7) 1
                                  else 0
                                })
}
opt_bins <- optimal_bin(train = d,
                        response = "response")

Note the automatic exclusion of categorical variables; no filtering necessary.

vector_bin

This function makes it easy to discretize a vector of variables into equal height/width or custom bins. This is a convenience function which is essentially just a wrapper around cut/cut_number/cut_interval, with one major difference: by default, missing values are given a bin. This can be useful when modelling data that is missing not-at-random.

Usage is straightforward:

x <- rnorm(10)
vector_bin(x,bins = 3)
vector_bin(x,bins = 3,type = "width")
vector_bin(x,bins = c(-1,0,1))

get_vector_cutpoints

I found it was surprisingly difficult in R to perform the simple task of figuring out the actual cutpoints of a vector that has been discretized using one of the cut family. What if you try

x <- vector_bin(rnorm(100),bins = 3)
levels(x)

You get a vector of character strings, containing brackets and commas. Okay, well I can use a string replace function from stringr to get rid of the brackets (don't forget to escape them twice!), then split the string on the commas, oh darn that returns a list, okay unlist it...

Stop. Use get_vector_cutpoints. This uses a regular expression to parse the unique numeric values from the above character vector and returns a numeric vector of the (sorted!) unique cutpoints.

get_vector_cutpoints(x)

binned_data_cutpoints

Typically data is stored in dataframes, not individual vectors. Rather than calling column_vector on a vector before using get_vector_cutpoints, and then doing this for all variables in your dataset (more programming!), use binned_data_cutpoints to quickly get a named list, each element being a numeric vector of cutpoints for the corresponding factor variable.

For example, remember the Seatbelts data that we binned? If you'd like to know what the actual binned values are for each variable, try:

binned_data_cutpoints(d_bin)

create_model_matrix

When using the base R functions for modelling (lm, glm, etc) and many established advanced functions in other packages (nlme::nlme, glmm:glmm, etc ), the user is able to specify a model using a formula interface:

mod <- lm(y ~ x,data = mydata)

This is very convenient because

It corresponds to how we write models down on paper
You can keep your input and target variables in the same dataframe

The base R package provides the model.matrix function, which takes a formula and a matrix/dataframe and creates the corresponding dummy variables. create_model_matrix is a wrapper around this function that takes only a dataframe, and automatically creates dummies for only the factor variables, and outputs either a matrix or a tbl.

The usage is straightforward:

# Print this in your own R console:
x <- create_model_matrix(d_train_bin)

# Prettier output:
create_model_matrix(d_train_bin,matrix_out = FALSE)

The default, matrix output is suitable for input into other R packages, such as glmnet.

awstringer/modellingTools documentation built on May 11, 2019, 4:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

awstringer/modellingTools
A Collection of Useful Custom Tools for D&A and Modelling

In awstringer/modellingTools: A Collection of Useful Custom Tools for D&A and Modelling

Installation

What This Package Offers

Modules

Helpful Functions

column_vector

Descriptive Statistics

proc_freq

get_top_corrs

Variable Discretization

simple_bin

optimal_bin

vector_bin

get_vector_cutpoints

binned_data_cutpoints

create_model_matrix

R Package Documentation

Browse R Packages

We want your feedback!

awstringer/modellingTools A Collection of Useful Custom Tools for D&A and Modelling

In awstringer/modellingTools: A Collection of Useful Custom Tools for D&A and Modelling

Installation

What This Package Offers

Modules

Helpful Functions

column_vector

Descriptive Statistics

proc_freq

get_top_corrs

Variable Discretization

simple_bin

optimal_bin

vector_bin

get_vector_cutpoints

binned_data_cutpoints

create_model_matrix

R Package Documentation

Browse R Packages

We want your feedback!

awstringer/modellingTools
A Collection of Useful Custom Tools for D&A and Modelling