options(cli.width = 70, width = 70, cli.unicode = FALSE) # Load them early on so package conflict messages don't show up suppressPackageStartupMessages({ library(parsnip) library(recipes) library(workflows) library(modeldata) })
Some modeling functions in R create indicator/dummy variables from categorical data when you use a model formula, and some do not. When you specify and fit a model with a workflow()
, parsnip and workflows match and reproduce the underlying behavior of the user-specified model's computational engine.
In the [modeldata::Sacramento] data set of real estate prices, the type
variable has three levels: "Residential"
, "Condo"
, and "Multi-Family"
. This base workflow()
contains a formula added via [add_formula()] to predict property price from property type, square footage, number of beds, and number of baths:
set.seed(123) library(parsnip) library(recipes) library(workflows) library(modeldata) data("Sacramento") base_wf <- workflow() %>% add_formula(price ~ type + sqft + beds + baths)
This first model does create dummy/indicator variables:
lm_spec <- linear_reg() %>% set_engine("lm") base_wf %>% add_model(lm_spec) %>% fit(Sacramento)
There are five independent variables in the fitted model for this OLS linear regression. With this model type and engine, the factor predictor type
of the real estate properties was converted to two binary predictors, typeMulti_Family
and typeResidential
. (The third type, for condos, does not need its own column because it is the baseline level).
This second model does not create dummy/indicator variables:
rf_spec <- rand_forest() %>% set_mode("regression") %>% set_engine("ranger") base_wf %>% add_model(rf_spec) %>% fit(Sacramento)
Note that there are four independent variables in the fitted model for this ranger random forest. With this model type and engine, indicator variables were not created for the type
of real estate property being sold. Tree-based models such as random forest models can handle factor predictors directly, and don't need any conversion to numeric binary variables.
When you specify a model with a workflow()
and a recipe preprocessor via [add_recipe()], the recipe controls whether dummy variables are created or not; the recipe overrides any underlying behavior from the model's computational engine.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.