In tidyverse/modelr: Modelling Functions that Work with the Pipe

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
set.seed(1014)

modelr

Overview

The modelr package provides functions that help you create elegant pipelines when modelling. It was designed primarily to support teaching the basics of modelling for the 1st edition of R for Data Science.

We no longer recommend it and instead suggest https://www.tidymodels.org/ for a more comprehensive framework for modelling within the tidyverse.

Installation

# The easiest way to get modelr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just modelr:
install.packages("modelr")

Getting started

library(modelr)

Partitioning and sampling

The resample class stores a "reference" to the original dataset and a vector of row indices. A resample can be turned into a dataframe by calling as.data.frame(). The indices can be extracted using as.integer():

# a subsample of the first ten rows in the data frame
rs <- resample(mtcars, 1:10)
as.data.frame(rs)
as.integer(rs)

The class can be utilized in generating an exclusive partitioning of a data frame:

# generate a 30% testing partition and a 70% training partition
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
lapply(ex, dim)

modelr offers several resampling methods that result in a list of resample objects (organized in a data frame):

# bootstrap
boot <- bootstrap(mtcars, 100)
# k-fold cross-validation
cv1 <- crossv_kfold(mtcars, 5)
# Monte Carlo cross-validation
cv2 <- crossv_mc(mtcars, 100)

dim(boot$strap[[1]])
dim(cv1$train[[1]])
dim(cv1$test[[1]])
dim(cv2$train[[1]])
dim(cv2$test[[1]])

Model quality metrics

modelr includes several often-used model quality metrics:

mod <- lm(mpg ~ wt, data = mtcars)
rmse(mod, mtcars)
rsquare(mod, mtcars)
mae(mod, mtcars)
qae(mod, mtcars)

Interacting with models

A set of functions let you seamlessly add predictions and residuals as additional columns to an existing data frame:

set.seed(1014)
df <- tibble::tibble(
  x = sort(runif(100)),
  y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)

mod <- lm(y ~ x, data = df)
df %>% add_predictions(mod)
df %>% add_residuals(mod)

For visualization purposes it is often useful to use an evenly spaced grid of points from the data:

data_grid(mtcars, wt = seq_range(wt, 10), cyl, vs)

# For continuous variables, seq_range is useful
mtcars_mod <- lm(mpg ~ wt + cyl + vs, data = mtcars)
data_grid(mtcars, wt = seq_range(wt, 10), cyl, vs) %>% add_predictions(mtcars_mod)