default_formula_blueprint: Default formula blueprint

Description Usage Arguments Details Value Mold Forge Differences From Base R Examples

View source: R/blueprint-formula-default.R

Description

This pages holds the details for the formula preprocessing blueprint. This is the blueprint used by default from mold() if x is a formula.

Usage

1
2
3
4
5
6
7
8
9
default_formula_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  indicators = "traditional",
  composition = "tibble"
)

## S3 method for class 'formula'
mold(formula, data, ..., blueprint = NULL)

Arguments

intercept

A logical. Should an intercept be included in the processed data? This information is used by the process function in the mold and forge function list.

allow_novel_levels

A logical. Should novel factor levels be allowed at prediction time? This information is used by the clean function in the forge function list, and is passed on to scream().

indicators

A single character string. Control how factors are expanded into dummy variable indicator columns. One of:

  • "traditional" - The default. Create dummy variables using the traditional model.matrix() infrastructure. Generally this creates K - 1 indicator columns for each factor, where K is the number of levels in that factor.

  • "none" - Leave factor variables alone. No expansion is done.

  • "one_hot" - Create dummy variables using a one-hot encoding approach that expands unordered factors into all K indicator columns, rather than K - 1.

composition

Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown.

formula

A formula specifying the predictors and the outcomes.

data

A data frame or matrix containing the outcomes and predictors.

...

Not used.

blueprint

A preprocessing blueprint. If left as NULL, then a default_formula_blueprint() is used.

Details

While not different from base R, the behavior of expanding factors into dummy variables when indicators = "traditional" and an intercept is not present is not always intuitive and should be documented.

Offsets can be included in the formula method through the use of the inline function stats::offset(). These are returned as a tibble with 1 column named ".offset" in the $extras$offset slot of the return value.

Value

For default_formula_blueprint(), a formula blueprint.

Mold

When mold() is used with the default formula blueprint:

Forge

When forge() is used with the default formula blueprint:

Differences From Base R

There are a number of differences from base R regarding how formulas are processed by mold() that require some explanation.

Multivariate outcomes can be specified on the LHS using syntax that is similar to the RHS (i.e. outcome_1 + outcome_2 ~ predictors). If any complex calculations are done on the LHS and they return matrices (like stats::poly()), then those matrices are flattened into multiple columns of the tibble after the call to model.frame(). While this is possible, it is not recommended, and if a large amount of preprocessing is required on the outcomes, then you are better off using a recipes::recipe().

Global variables are not allowed in the formula. An error will be thrown if they are included. All terms in the formula should come from data.

By default, intercepts are not included in the predictor output from the formula. To include an intercept, set blueprint = default_formula_blueprint(intercept = TRUE). The rationale for this is that many packages either always require or never allow an intercept (for example, the earth package), and they do a large amount of extra work to keep the user from supplying one or removing it. This interface standardizes all of that flexibility in one place.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# ---------------------------------------------------------------------------

data("hardhat-example-data")

# ---------------------------------------------------------------------------
# Formula Example

# Call mold() with the training data
processed <- mold(
  log(num_1) ~ num_2 + fac_1,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(example_test, processed$blueprint)

# Use `outcomes = TRUE` to also extract the preprocessed outcome
forge(example_test, processed$blueprint, outcomes = TRUE)

# ---------------------------------------------------------------------------
# Factors without an intercept

# No intercept is added by default
processed <- mold(num_1 ~ fac_1 + fac_2, example_train)

# So, for factor columns, the first factor is completely expanded into all
# `K` columns (the number of levels), and the subsequent factors are expanded
# into `K - 1` columns.
processed$predictors

# In the above example, `fac_1` is expanded into all three columns,
# `fac_2` is not. This behavior comes from `model.matrix()`, and is somewhat
# known in the R community, but can lead to a model that is difficult to
# interpret since the corresponding p-values are testing wildly different
# hypotheses.

# To get all indicators for all columns (irrespective of the intercept),
# use the `indicators = "one_hot"` option
processed <- mold(
  num_1 ~ fac_1 + fac_2,
  example_train,
  blueprint = default_formula_blueprint(indicators = "one_hot")
)

processed$predictors

# It is not possible to construct a no-intercept model that expands all
# factors into `K - 1` columns using the formula method. If required, a
# recipe could be used to construct this model.

# ---------------------------------------------------------------------------
# Global variables

y <- rep(1, times = nrow(example_train))

# In base R, global variables are allowed in a model formula
frame <- model.frame(fac_1 ~ y + num_2, example_train)
head(frame)

# mold() does not allow them, and throws an error
try(mold(fac_1 ~ y + num_2, example_train))

# ---------------------------------------------------------------------------
# Dummy variables and interactions

# By default, factor columns are expanded
# and interactions are created, both by
# calling `model.matrix()`. Some models (like
# tree based models) can take factors directly
# but still might want to use the formula method.
# In those cases, set `indicators = "none"` to not
# run `model.matrix()` on factor columns. Interactions
# are still allowed and are run on numeric columns.

bp_no_indicators <- default_formula_blueprint(indicators = "none")

processed <- mold(
  ~ fac_1 + num_1:num_2,
  example_train,
  blueprint = bp_no_indicators
)

processed$predictors

# An informative error is thrown when `indicators = "none"` and
# factors are present in interaction terms or in inline functions
try(mold(num_1 ~ num_2:fac_1, example_train, blueprint = bp_no_indicators))
try(mold(num_1 ~ paste0(fac_1), example_train, blueprint = bp_no_indicators))

# ---------------------------------------------------------------------------
# Multivariate outcomes

# Multivariate formulas can be specified easily
processed <- mold(num_1 + log(num_2) ~ fac_1, example_train)
processed$outcomes

# Inline functions on the LHS are run, but any matrix
# output is flattened (like what happens in `model.matrix()`)
# (essentially this means you don't wind up with columns
# in the tibble that are matrices)
processed <- mold(poly(num_2, degree = 2) ~ fac_1, example_train)
processed$outcomes

# TRUE
ncol(processed$outcomes) == 2

# Multivariate formulas specified in mold()
# carry over into forge()
forge(example_test, processed$blueprint, outcomes = TRUE)

# ---------------------------------------------------------------------------
# Offsets

# Offsets are handled specially in base R, so they deserve special
# treatment here as well. You can add offsets using the inline function
# `offset()`
processed <- mold(num_1 ~ offset(num_2) + fac_1, example_train)

processed$extras$offset

# Multiple offsets can be included, and they get added together
processed <- mold(
  num_1 ~ offset(num_2) + offset(num_3),
  example_train
)

identical(
  processed$extras$offset$.offset,
  example_train$num_2 + example_train$num_3
)

# Forging test data will also require
# and include the offset
forge(example_test, processed$blueprint)

# ---------------------------------------------------------------------------
# Intercept only

# Because `1` and `0` are intercept modifying terms, they are
# not allowed in the formula and are instead controlled by the
# `intercept` argument of the blueprint. To use an intercept
# only formula, you should supply `NULL` on the RHS of the formula.
mold(
  ~ NULL,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)

# ---------------------------------------------------------------------------
# Matrix output for predictors

# You can change the `composition` of the predictor data set
bp <- default_formula_blueprint(composition = "dgCMatrix")
processed <- mold(log(num_1) ~ num_2 + fac_1, example_train, blueprint = bp)
class(processed$predictors)

DavisVaughan/hardhat documentation built on Oct. 5, 2021, 9:53 a.m.