Installation

The package is available on CRAN, but you can get the latest version easily from the github page by using the install_github function in the devtools package:

install.packages("devtools")
devtools::install_github("awstringer/ModellingTools")

To use the package interactively, you must, of course, load it in your session:

library(modellingTools)

Using the development version is recommended; if something is documented in this vignette, you can expect it to work regardless of whether I have bothered to submit a recent version to CRAN.

What This Package Offers

R is a programming language designed primarily to facilitate statistical modelling. It is not a power-user language like SAS, which is designed to make common data analysis tasks easier. This is what makes R great-- it assumes that the user is dedicated enough to program their own custom solutions to problems, and it makes doing so pleasant and productive.

Sometimes though, you just want to analyze your data-- you are doing so as part of a larger project, and don't have time to spend hours programming solutions to simple problems like getting frequency distributions or discretizing continuous variables. This is where modellingTools comes in. This package offers simple, easy to use functions for performing common data analysis tasks.

The core principle behind the package is that data analysis and modelling should work fluidly together. Having data and tools that work together to perform both tasks minimizes analyst effort and errors, and provides a more engaging and repeatable experience.

The package has the following modules:

All of the functions are documented, e.g. ?simple_bin. This vignette serves as a comprehensive list of what is available, and includes select commentary and examples of when certain functions might be useful.

Important Note: to run the examples in this package, you should have the dplyr and magrittr packages installed and attached to the search path:

# install.packages(c("dplyr","magrittr"))
library(dplyr)
library(magrittr)

You should always have these packages attached when running R interactively, because they are the best.

Of course, you also need the modellingTools package installed!


Modules

Helpful Functions

This module contains some common functions that I use when programming to simplify tasks that, frankly, should be simple. Most of them are simple one or two-liners, but they have been unit tested and documented for your convenience.

column_vector

This is a simple convenience function that grabs a single column of a dataframe and returns it as a vector, either numeric, factor or character depending on its type in the dataframe. Designed to work with the tbl_df class from the dplyr package but there is no reason it won't work on arbitrary dataframes.

I find that using this function makes my code cleaner and more predictable. For example,

d <- iris
d[,1]

should return a numeric vector with a name. However,

d[,c(1,2)]

returns a dataframe. Maybe this makes sense-- but how many of us would look at the above code snippets and know what would be returned without thinking? The dplyr package provides the tbl_df class (which you should always use!):

d <- tbl_df(iris)
d[,1]
d[,c(1,2)]

which provides consistent behaviour. But then, how do you get a numeric vector containing just one column?

unlist(d[,1])

Save yourself some typing and use column_vector:

column_vector(d,1)

or, equiavlently,

column_vector(d,"Sepal.Length")

This will work with both types of dataframe, as well as objects of class matrix and array, so your code will be clean and predictable.

Since I am claiming that behaviour of functions should be simple, I should point out the following. What happens if you run

column_vector(d,c(1,2))

Probably not what you would expect, or want-- I recommend against using it this way. Indeed, column_vector will give a warning if the user supplies multiple requested columns.

Descriptive Statistics

This module provides functions for performing these simple, yet sometimes tricky to implement data analysis tasks.

For example, everyone should easily be able to compute the frequency distribution of their variables. R provides many useful functions for doing this, however they are often confusing and have inconsistent output formats.

proc_freq

How do you get the frequency distribution of a variable in R? You can use table:

data(CO2)
CO2 %$% table(Plant)

This works for cross-tabulation as well:

CO2 %$% table(Plant,Type)

What about three-way tables? You can try the following yourself:

CO2 %$% table(Plant,Type,Treatment)

(by the way, the %$% operator is from the magrittr package. I highly recommend checking this out, as it will change the way you program in R).

This works well, but

Why should there be 10 arguments to a function that only does one thing? proc_freq has three: the dataset that the variable is in, the variable that you want the distribution of, and an optional awesome argument that specifies whether you want to discretize a continuous variable into bins before summarizing.

The usage is simple:

proc_freq(iris,"Species")

The output is a tbl_df, a great data structure from the dplyr package, and contains counts and percentages.

What about the mysterious third argument to proc_freq: bins. This is used to discretize a continuous variable prior to summarizing. Why might one want to do this? When dealing with smaller, evenly distributed datasets, the need for this is not apparent:

proc_freq(iris,"Sepal.Length")

This provides manageable, interpretible output. In my day to day work though, I deal with massive datasets (10,000,000+ rows), which often have

I don't want to know how many times each unique value occurs in a dataset with 10,000,000 rows, and I definitely don't want to exclude missing values (have you noticed that this is the default behaviour of table?). What happens when you use table?

library(foreach) # Check this out if you haven't already
# Try for 1,000 observations, for illustration purposes
d <- data_frame(v1 = times(1e04) %do% if (runif(1) < .1) NA else rnorm(1))

Try running the following yourself:

d %$% table(v1)
proc_freq(d,"v1")

The output of table is a giant vector that is essentally unreadable without some further intervention on the part of the programmer-- not ideal. The output of proc_freq is readable but useless, since there are so many unique values. There is a better way:

proc_freq(d,"v1",bins = 3)

Using bins = 3 with proc_freq gives the user insight into the distribution of the data and the proportion of missing/special values using one line of code and a single, consistent output format.

get_top_corrs

This function helps to answer the simple questions "are any input variables correlated with the response of interest and if so, which ones and how much?". You can get this information using base R by, for example:

x <- mtcars[,-1]
y <- mtcars[,1]
apply(x,2,function(a,b) cor(a,b),b = y)

This works okay, but the output is a bit verbose and difficult to work with. Oh and hard to read if you have more than a few variables. Oh and what if some of them are non-numeric? Okay, I'll build in an if statement to check and darn, what about NAs...

Stop. Use get_top_corrs instead! This function

So we can do

get_top_corrs(mtcars,"mpg")

and gain immediate insights, instead of trying to do it the hard way for 5 minutes and just giving up.

Variable Discretization

Discretizing continuous variables is often useful practice for removing undesirable patterns or improving model predictions. R provides useful functions (cut) for binning vectors, but it takes some work on the part of the programmer to apply these consistently to a whole dataset.

What is meant by discretization, also known as binning, is to convert a numeric variable into a categorical one, by assigning each observation to a "bin" depending on its numeric value. For example, if you aren't interested in whether a value is 1.9 or 2.1, but only whether it is < 2 or >= 2, you might want to try binning your data.

When building a predictive model, binning can also mitigate the effect of the ubiquitous "linearity" assumption by better tracking nonlinear patterns in the data. Of course it does this at the expense of simplicity; but since the results are still highly interpretable and datasets are usually quite large these days, this isn't often a great concern.

There are four types of binning available in modellingTools:

simple_bin

This is currently the main function in the discretization module. The purpose is to make it simple to convert each continuous variable in a dataset to a categorical variable by applying equal-height or equal-width bins, and to make it simple to bin the same variables in a new dataset into those same bins. The primary use would be in building a model on a training set and applying it to a test set, which requires the categorical variables in both to have the same levels.

For details on the arguments to simple_bin, see the documentation ?modellingTools::simple_bin. We can observe the basic usage of simple_bin as follows:

data(Seatbelts)
d <- tbl_df(as.data.frame(Seatbelts))
d

d_bin <- simple_bin(d,bins = 3)
d_bin

Notice that the variable law retains its dbl class- simple_bin automatically detects if there are not enough unique values of a variable to propoerly bin, and leaves them untouched. Variables that are already categorical are also automatically ignored. If you would like to know the cutpoints for the bins, check out modellingTools:binned_data_cutpoints below.

When building a model, you very well may wish to fit the model on one subset of the data (a training set) and evaluate predictions on a different subset (a testing set). If dealing with variables that have been discretized (by you), this requires the variables in the testing set to be binned into the same ranges.

simple_bin automatically bins an optional test dataset into the same bins used on the training set:

d %<>% mutate(on_train = times(n()) %do% if (runif(1) < .7) 1 else 0)
d_train <- d %>% filter(on_train == 1) %>% select(-on_train)
d_test <- d %>% filter(on_train == 0) %>% select(-on_train)

d_split_binned <- simple_bin(d_train,test = d_test,bins = 3)
d_split_binned

But wait-- why can't we just bin the data before we split it up into training/test sets? This implicitly assumes that the test data will always be distributed exactly the same as the training data. This is true by definition when you are creating the train/test splits yourself, but many validation cases involve testing on samples not available at the time of fitting the model.

Even if you don't have the test set at the time you build the model, or if you wish to use the same model on future datasets, simple_bin can take custom bins. These must be passed as a named list, which is the exact format output by modellingTools::binned_data_cutpoints (see below):

d_train_bin <- simple_bin(d_train,bins = 3)
train_cutpoints <- binned_data_cutpoints(d_train_bin)
train_cutpoints

d_test_bin <- simple_bin(d_test,bins = train_cutpoints)
d_test_bin

identical(d_test_bin,d_split_binned$test)

optimal_bin

This function takes a dataset and desired input variables, and computes the optimal bins with respect to a binary response. The function doesn't actually bin the data; it outputs a named list of the optimal cutpoints for each variable, which can then be passed to simple_bin:

d <- iris %>% mutate(response = times(n()) %do% {
                                  if (species == "setosa" & runif(1) < 0.3) 1
                                  else if (species == "versicolor" & runif(1) < 0.5) 1
                                  else if (species == "virginica" & runif(1) < 0.7) 1
                                  else 0
                                })
}
opt_bins <- optimal_bin(train = d,
                        response = "response")

Note the automatic exclusion of categorical variables; no filtering necessary.

vector_bin

This function makes it easy to discretize a vector of variables into equal height/width or custom bins. This is a convenience function which is essentially just a wrapper around cut/cut_number/cut_interval, with one major difference: by default, missing values are given a bin. This can be useful when modelling data that is missing not-at-random.

Usage is straightforward:

x <- rnorm(10)
vector_bin(x,bins = 3)
vector_bin(x,bins = 3,type = "width")
vector_bin(x,bins = c(-1,0,1))

get_vector_cutpoints

I found it was surprisingly difficult in R to perform the simple task of figuring out the actual cutpoints of a vector that has been discretized using one of the cut family. What if you try

x <- vector_bin(rnorm(100),bins = 3)
levels(x)

You get a vector of character strings, containing brackets and commas. Okay, well I can use a string replace function from stringr to get rid of the brackets (don't forget to escape them twice!), then split the string on the commas, oh darn that returns a list, okay unlist it...

Stop. Use get_vector_cutpoints. This uses a regular expression to parse the unique numeric values from the above character vector and returns a numeric vector of the (sorted!) unique cutpoints.

get_vector_cutpoints(x)

binned_data_cutpoints

Typically data is stored in dataframes, not individual vectors. Rather than calling column_vector on a vector before using get_vector_cutpoints, and then doing this for all variables in your dataset (more programming!), use binned_data_cutpoints to quickly get a named list, each element being a numeric vector of cutpoints for the corresponding factor variable.

For example, remember the Seatbelts data that we binned? If you'd like to know what the actual binned values are for each variable, try:

binned_data_cutpoints(d_bin)

create_model_matrix

When using the base R functions for modelling (lm, glm, etc) and many established advanced functions in other packages (nlme::nlme, glmm:glmm, etc ), the user is able to specify a model using a formula interface:

mod <- lm(y ~ x,data = mydata)

This is very convenient because

The base R package provides the model.matrix function, which takes a formula and a matrix/dataframe and creates the corresponding dummy variables. create_model_matrix is a wrapper around this function that takes only a dataframe, and automatically creates dummies for only the factor variables, and outputs either a matrix or a tbl.

The usage is straightforward:

# Print this in your own R console:
x <- create_model_matrix(d_train_bin)

# Prettier output:
create_model_matrix(d_train_bin,matrix_out = FALSE)

The default, matrix output is suitable for input into other R packages, such as glmnet.



awstringer/modellingTools documentation built on May 11, 2019, 4:11 p.m.