selections | R Documentation |
Tips for selecting columns in step functions.
When selecting variables or model terms in step
functions, dplyr
-like
tools are used. The selector functions can choose variables based on their
name, current role, data type, or any combination of these. The selectors are
passed as any other argument to the step. If the variables are explicitly
named in the step function, this might look like:
recipe( ~ ., data = USArrests) %>% step_pca(Murder, Assault, UrbanPop, Rape, num_comp = 3)
The first four arguments indicate which variables should be used in the PCA
while the last argument is a specific argument to step_pca()
about the
number of components.
Note that:
These arguments are not evaluated until the prep
function for the
step is executed.
The dplyr
-like syntax allows for negative signs to exclude
variables (e.g. -Murder
) and the set of selectors will processed in
order.
A leading exclusion in these arguments (e.g. -Murder
) has the
effect of adding all variables to the list except the excluded
variable(s), ignoring role information.
Select helpers from the tidyselect
package can also be used:
tidyselect::starts_with()
, tidyselect::ends_with()
,
tidyselect::contains()
, tidyselect::matches()
, tidyselect::num_range()
,
tidyselect::everything()
, tidyselect::one_of()
, tidyselect::all_of()
,
and tidyselect::any_of()
Note that using tidyselect::everything()
or any of the other tidyselect
functions aren't restricted to predictors. They will thus select outcomes,
ID, and predictor columns alike. This is why these functions should be used
with care, and why tidyselect::everything()
likely isn't what you need.
For example:
recipe(Species ~ ., data = iris) %>% step_center(starts_with("Sepal"), -contains("Width"))
would only select Sepal.Length
Columns of the design matrix that may not exist when the step is coded can
also be selected. For example, when using step_pca()
, the number of columns
created by feature extraction may not be known when subsequent steps are
defined. In this case, using matches("^PC")
will select all of the columns
whose names start with "PC" once those columns are created.
There are sets of recipes-specific functions that can be used to select
variables based on their role or type: has_role()
and has_type()
. For
convenience, there are also functions that are more specific. The functions
all_numeric()
and all_nominal()
select based on type, with nominal
variables including both character and factor; the functions
all_predictors()
and all_outcomes()
select based on role. The functions
all_numeric_predictors()
and all_nominal_predictors()
select
intersections of role and type. Any can be used in conjunction with the
previous functions described for selecting variables using their names.
A selection like this:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric(), -all_outcomes())
is equivalent to:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric_predictors())
Both result in all the numeric predictors: carbon, hydrogen, oxygen, nitrogen, and sulfur.
If a role for a variable has not been defined, it will never be selected using role-specific selectors.
Selectors can be used in step_interact()
in similar ways but must be
embedded in a model formula (as opposed to a sequence of selectors). For
example, the interaction specification could be ~ starts_with("Species"):Sepal.Width
. This can be useful if Species
was
converted to dummy variables previously using step_dummy()
. The
implementation of step_interact()
is special, and is more restricted than
the other step functions. Only the selector functions from recipes and
tidyselect are allowed. User defined selector functions will not be
recognized. Additionally, the tidyselect domain specific language is not
recognized here, meaning that &
, |
, !
, and -
will not work.
When creating variable selections:
If you are using column filtering steps, such as step_corr()
, try to
avoid hardcoding specific variable names in downstream steps in case
those columns are removed by the filter. Instead, use
dplyr::any_of()
and
dplyr::all_of()
.
dplyr::any_of()
will be tolerant if a column
has been removed.
dplyr::all_of()
will fail unless all of the
columns are present in the data.
For both of these functions, if you are going to save the recipe as a binary object to use in another R session, try to avoid referring to a vector in your workspace.
Preferred: any_of(!!var_names)
Avoid: any_of(var_names)
Some examples:
some_vars <- names(mtcars)[4:6] # No filter steps, OK for not saving the recipe rec_1 <- recipe(mpg ~ ., data = mtcars) %>% step_log(all_of(some_vars)) %>% prep() # No filter steps, saving the recipe rec_2 <- recipe(mpg ~ ., data = mtcars) %>% step_log(!!!some_vars) %>% prep() # This fails since `wt` is not in the data try( recipe(mpg ~ ., data = mtcars) %>% step_rm(wt) %>% step_log(!!!some_vars) %>% prep(), silent = TRUE ) # Best for filters (using any_of()) and when # saving the recipe rec_4 <- recipe(mpg ~ ., data = mtcars) %>% step_rm(wt) %>% step_log(any_of(!!some_vars)) %>% # equal to step_log(any_of(c("hp", "drat", "wt"))) prep()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.