library(dplyr)

By default, model.matrix() generates binary indicator variables for factor predictors. When the formula does not remove an intercept, an incomplete set of indicators are created; no indicator is made for the first level of the factor.

For example, species and island both have three levels but model.matrix() creates two indicator variables for each:

library(dplyr)
library(modeldata)
data(penguins)

levels(penguins$species)
levels(penguins$island)

model.matrix(~ species + island, data = penguins) %>% 
  colnames()

For a formula with no intercept, the first factor is expanded to indicators for all factor levels but all other factors are expanded to all but one (as above):

model.matrix(~ 0 + species + island, data = penguins) %>% 
  colnames()

For inference, this hybrid encoding can be problematic.

To generate all indicators, use this contrast:

# Switch out the contrast method
old_contr <- options("contrasts")$contrasts
new_contr <- old_contr
new_contr["unordered"] <- "contr_one_hot"
options(contrasts = new_contr)

model.matrix(~ species + island, data = penguins) %>% 
  colnames()

options(contrasts = old_contr)

Removing the intercept here does not affect the factor encodings.



topepo/parsnip documentation built on April 16, 2024, 3:23 a.m.