In tidymodels/learntidymodels: Learn 'tidymodels' with Interactive 'learnr' Primers

library(learnr)
library(tidyverse)
library(tidymodels)
library(embed)
library(corrr)
library(tidytext)
library(gradethis)
library(sortable)
library(learntidymodels)

knitr::opts_chunk$set(echo = FALSE, exercise.checker = gradethis::grade_learnr)

zoo_names <- c("animal_name", "hair", "feathers", "eggs", "milk", "airborne", "aquatic", "predator", "toothed", "backbone", "breathes", "venomous", "fins", "legs", "tail", "domestic", "catsize", "class")
anim_types <- tribble(~class, ~type,
                      1, "mammal",
                      2, "bird",
                      3, "reptile",
                      4, "fish",
                      5, "amphibian",
                      6, "insect",
                      7, "other_arthropods")
zoo <- 
  read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data", 
           col_names = zoo_names) %>%
  left_join(anim_types) %>%
  select(-class) %>%
  rename(animal_type=type)


### correlation ###
zoo_corr <- zoo %>%
  select(-animal_name, -animal_type) %>%
  correlate() %>%
  rearrange()


### PCA ####
pca_rec <- recipe(data = zoo, formula = ~ .) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) %>%
  step_pca(all_predictors(), id = "pca")

pca_prep <- prep(pca_rec)
pca_loading <- tidy(pca_prep, id="pca")
pca_variances <- tidy(pca_prep, id = "pca", type = "variance")

pca_bake <- bake(pca_prep, zoo)

zoo_rec <- recipe(data = zoo, formula = ~.) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

zoo_prep <- prep(pca_rec)
zoo_bake <- pca_bake
zoo_juice <- juice(zoo_prep)

### UMAP ###
set.seed(123) 
umap_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_umap(all_predictors())

umap_prep <- prep(umap_rec)
umap_bake <- bake(umap_prep, zoo)

Welcome

Dimension reduction is a regularly used unsupervised method in exploratory data analysis and predictive models.

This tutorial will teach you how to apply these methods using the recipes package, which is a part of the tidymodels ecosystem, a collection of modeling packages designed with common APIs and a shared philosophy.

Learning objectives

This tutorial focuses on transforming whole groups of predictors together using two different dimension reduction algorithms:

Linear dimensionality reduction with Principal component analysis (PCA)
Non-linear dimensionality reduction with UMAP

Here, we will apply these methods to explore our data. These methods can also be used for feature extraction prior to modeling.

While we're applying these methods we will cover:

How to create a recipe
How to update a role
How to add steps
How to prep
How to bake or juice

Pre-requisites

If you are new to tidymodels, you can learn what you need with the five Get Started articles on tidymodels.org.

The second article, Preprocessing your data with recipes, shows how to use functions from the recipes package to pre-process your data prior to model fitting.

If you aren't familiar with the recipes functions yet, reading the Preprocessing your data with recipes article would be helpful before going through this tutorial.

Let's get started!

The zoo data

We will use the zoo dataset to explore these methods. zoo contains observations collected on r nrow(zoo) zoo animals.

To see the first ten rows of the data set click on Run Code.
You can use the black triangle that appears at the top right of the table to scroll through all of the columns in zoo.

zoo

Alternatively, use glimpse() from the dplyr package to see columns in a more compact way. You can click on the Solution button to get help.

library(tidyverse)
glimpse(___)

glimpse(zoo)

We can see that zoo has r nrow(zoo) rows and r ncol(zoo) columns, two of them (animal_name and animal_type) are characters.

Let's count the number animals for each animal_type.

zoo %>% 
  count(___)

zoo %>% 
  count(animal_type)

While looking at the numbers helps, plotting is always a good idea to get an overall view of the data, especially if many sub-categories are present.

Plot the number of animals in each animal_type category. Fill in the blanks and click on Run Code to generate the plot.

zoo %>%
  ggplot(aes(___)) +
  geom____(fill="#CA225E") +
  theme_minimal()

zoo %>%
  ggplot(aes(animal_type)) +
  geom_bar(fill="#CA225E") +
  theme_minimal()

We can also look at the distribution of animals that lay eggs across animal types. The eggs column is coded in 0 and 1 for animals that "doesn't lay eggs" and "lay eggs" respectively. Let's do some data wrangling with recode function to plot it neatly. Click on Solution if you are stuck.

zoo %>%
  mutate(eggs = recode(eggs, 0=___, 1=___)) %>%
  ggplot(aes(___, fill=___)) +
  geom___() +
  scale_fill_manual(values = c("#372F60", "#CA225E")) +
  theme_minimal() +
  theme(legend.position = "top")

zoo %>%
  mutate(eggs = recode(eggs, `0`="doesn't lay eggs", `1`="lays Eggs" )) %>%
  ggplot(aes(animal_type, fill=eggs)) +
  geom_bar() +
  scale_fill_manual(values = c("#372F60", "#CA225E")) + 
  theme_minimal() +
  theme(legend.position = "top")

Not so surprisingly there are very few mammals that lay eggs. Let's get the actual count.

zoo %>% 
  count(___, ___)

zoo %>% 
  count(animal_type, eggs)

It looks like there is one mammal that lays eggs! Can you find the name of that animal?

zoo %>%
  filter(___ == ___) %>%
  # select relevant columns for a compact view
  select(animal_name, animal_type, eggs)

zoo %>%
  filter(animal_type == "mammal",
         eggs == 1) %>%
  # select relevant columns for a compact view
  select(animal_name, animal_type, eggs)

Correlation matrix

Having some familiarity with the animal kingdom, we would expect that most animals that produce milk do not lay eggs. In other words, we would expect to see a negative correlation between these features.

Let's see how these animal features correlate with each other to get a sense of these relationships.

Run the code to plot the correlation matrix using the corrr package.

Here we are using three functions from the corrr package:

correlate() generates a correlation matrix in data frame format.
rearrange() groups highly correlated variables closer together.
shave() shaves off the upper triangle of a correlation data frame by converting its cells to NA.

library(corrr)
zoo_corr <- zoo %>%
  # drop non-numeric columns
  select(___, ___) %>%
  correlate() %>%
  rearrange() %>%
  shave()

zoo_corr

library(corrr)
zoo_corr <- zoo %>%
  # drop non-numeric columns
  select(-animal_name, -animal_type) %>%
  correlate() %>%
  rearrange() %>%
  shave() 
zoo_corr

The output is a data frame containing pair-wise correlation coefficients between variables. But it would take us a long time to get an overall sense of these relationships just by looking at raw numbers. Let's plot the correlation matrix with corrr::rplot to help our brains out.

zoo_corr %>%
  rplot(shape = 15, colours = c("#372F60", "white", "#CA225E"), print_cor=TRUE) -> g
  theme(axis.text.x = element_text(angle = 20, hjust = .6))

The `corrr::rplot` is quite handy because it returns a ggplot object, which can be further customized with ggplot2 functions.

So much better!

We can see that producing eggs or milk have a very strong negative correlation. (The odd ball platypus is one reason why it isn't equal to -1.)

Now, see if you can answer the question correctly.

question("What is the pair of animal features that has the strongest _positive_ correlation?",
         answer("Tail & Backbone"),
         answer("Fins & Aquatic"),
         answer("Milk & Hair", correct = TRUE),
         answer("Feathers & Airborne"),
         incorrect = "Incorrect. While these two features have a positive correlation it is not the strongest.",
         allow_retry = TRUE
         )

Principal component analysis

Principal component analysis (PCA) is a handy data reduction technique that uses covariance or a correlation matrix of a set of observed variables (just like the one we visualized) and summarizes it with a smaller set of linear combinations called principal components (PC). These components are statistically independent from one another and capture the maximum amount of information (i.e. variance) in the original variables. This means these components can be used to combat large inter-variable correlations in a data set statistical modeling. PCA can also help us explore the similarities between observations and groups they belong to.

Here's the scatter plot with the first two principle components (PC1 and PC2) of the zoo data:

pca_bake %>%
  ggplot(aes(PC1, PC2, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = "Animal Type") +
  theme_minimal()

Each dot on the plot represents an observation (animal) that is colored by the animal_type and labeled by animal_name.

Overall, we can see that same type of animals are clustered closely compared to the rest. This suggests that these features (having hair, feathers, or laying eggs etc.) are doing a relatively good job at identifying the clusters within the zoo data.

Create a recipe

Let's implement principal component analysis (PCA) using recipes.

First, we initiate our recipe with the zoo data.

library(tidymodels)
pca_rec <- recipe(~., data = ___)

pca_rec <- recipe(~., data = zoo)

Here, we define two arguments:

A formula with ~. tells our recipe that we did not define an outcome variable and would like to use all variables for the next steps of the analysis.
Our data zoo. We are using our entire data set here, but typically this would be a training set for predictive modeling.

Once we initiate the recipe, we can keep adding new roles and steps.

For example, we already told our recipe to include all variables with our formula; however, we want to exclude identifier column animal_name and animal_type from our analysis. On the other hand we need these variables later when we are plotting our results. By using update_role() we exclude these variables from our analysis without completely dropping them in the next steps:

pca_rec <- recipe(~., data = zoo) %>%
  # update the role for animal_name and animal_type
  update_role(___, ___, new_role = "id")

pca_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id")

Try using summary to see the defined roles in pca_rec and arrange them by role column.

summary() %>%
  arrange(___)

summary(pca_rec) %>%
  arrange(role)

We can see that the role of animal_name and animal_type is now defined as id and the remaining variables are listed as predictor.

Good job! Now, let's add some steps to our recipe.

Add steps to a recipe

Since PCA is a variance maximizing exercise, it is important to scale variables so their variance is commensurable. We will achieve this by adding two steps to our recipe:

step_scale() normalizes numeric data to have a standard deviation of one
step_center() normalizes numeric data to have a mean of zero.

We can also use the helper function all_predictors() to select all the variables that have a role defined as predictor.

pca_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  # add steps to scale and center
  step____(all_predictors()) %>%
  step____(all_predictors())

pca_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

Alternatively, we can accomplish scaling and centering in one single step. Take a look this group of step functions on the recipes reference page. See if you can answer the question below correctly:

question("What function can replace both centering and scaling steps?",
         answer("step_interact"),
         answer("step_regex"),
         answer("step_normalize", correct = TRUE),
         answer("step_date"),
         incorrect = "Incorrect. Try again.",
         allow_retry = TRUE
         )

We are ready to add our final step to compute our principle components!

Use step_pca() to tell the recipe to convert all variables (except animal_name and animal_type) into principal components.

pca_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) %>%
  # add step for PCA computation
  step____(___, id = "pca")

pca_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) %>%
  step_pca(all_predictors(), id = "pca")

Did you notice the additional argument id = "pca" there? If we take a look at the step_pca help page, we see that this argument allows us to provide a unique string to identify this step. Providing a stepid will become handy when we need to extract additional values from that step. Similarly, we could have assigned a unique id to any step we would like to work more on later.

Now, let's print the pca_rec by running the following code chunk.

pca_rec

We can see that pca_rec has our id and predictor variables as inputs and the following operations:

Scaling for all_predictors
Centering for all_predictors
No PCA components were extracted.

Are you surprised that we haven't extracted the PCA components yet? This is because so far we only defined our recipe, but did not train it. To get the results from our PCA, we need evaluate our recipe using prep().

Prep a recipe

Let's prep our recipe and print the output:

pca_prep <- prep(___)
pca_prep

pca_prep <- prep(pca_rec)
pca_prep

Can you see the difference between the outputs of pca_rec and pca_prep? After prepping we can see that scaling and centering, and PCA extraction with all columns of interest has been trained.

Let's take a look at the steps this recipe contains with tidy():

tidy(___)

tidy(pca_prep)

We can see that three steps are contained in this prepped recipe:

scale
center
pca

With tidy() we can extract the intermediate values computed in each step by providing its number as an argument to tidy().

For example, you can extract the mean values for each predictor variable from the second step of our recipe (center) using the tidy method:

tidy(___, ___)

tidy(pca_prep, 2)

Using the same method, we can also extract the variable loadings for each component from our step_pca:

tidy(___, ___)

tidy(pca_prep, 3)

You can see that these underlying values can be different for each step, but are always called `values` when extracted with the `tidy` method. You can find the definition of these underlying values under the Value section in the help page of the related step function. For example, take a look at `step_scale` help document and scroll down to see the Value.

Alternatively, we can use the id argument (the one we specifically provided for this step) and specify the type of underlying value we would like to extract.

pca_loading <- tidy(___, ___, ___)

pca_loading

pca_loading <- tidy(pca_prep, id = "pca", type = "coef")

pca_loading

How did we know what to extract? Take a look at the `step_pca` help document. The `type` argument provides more details about how to use this step with the `tidy()` method.

In the PCA setting, loadings indicate the correlation between the principal component and the variable. In other words, large loadings suggest that a variable has a strong effect on that principal component.

Let's take a look at loadings we generated with the zoo data!

We will use plot_top_loadings() function from the learntidymodels package to plot the absolute values of the loadings to easily compare them, and color it by the direction of the loading (positive or negative). The plot_top_loadings() takes three arguments:

A prepped recipe
Conditional statements to filter the PCA data before plotting. For example, to plot first 3 components, one can provide component_number <= 3.
Number of columns to plot per component.

Fill in the blanks to plot the first four principle components and top six variables with largest absolute loadings:

library(learntidymodels)
plot_top_loadings(___, component_number ___, n = ___) + 
  scale_fill_manual(values = c("#372F60", "#CA225E")) +
  theme_minimal()

library(learntidymodels)
plot_top_loadings(pca_prep, component_number <= 4, n = 6) + 
  scale_fill_manual(values = c("#372F60", "#CA225E")) +
  theme_minimal()

It looks like PC1 (first principal component) is mostly about producing milk or eggs, and having hair. Notice the loading direction for milk and hair are the same, which is the opposite of eggs. Do you remember the strongest positive correlation we found? On the other hand, PC2 seems to be about the animal having a fin or being aquatic, both of which have the opposite direction to breathing. Both tails and feathers have a strong correlation with PC3, and finally, PC4 is mostly about being domestic or a predator, which have opposite directions. Overall, we can say that PC1 is mostly about being a mammal, PC2 is being a fish or an aquatic animal, PC3 is being a bird, and PC4 is about being domesticated.

Bake a recipe

So far we:

Defined preprocessing operations with recipe
Trained our recipe with prep

Finally, in order to apply these computations to our data and extract the principal components, we will use bake by providing two arguments:

A prepped (trained) recipe
The data we would like to apply these computations to

pca_bake <- bake(___, ___)
pca_bake

pca_bake <- bake(pca_prep, zoo)
pca_bake

Now that we got our principal components, we are ready to plot them! r emojifont::emoji('tada')

Let's plot the first two principal components, while labeling our points with animal_name and coloring them by animal_type.

library(ggplot2)

pca_bake %>%
  ggplot(aes(___, ___, label=___)) +
  geom_point(aes(color = ___), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

pca_bake %>%
  ggplot(aes(PC1, PC2, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

We were able reproduce our initial plot! Let's take a look at our plot in more detail.

Look at mammals, majority of these animals are separated from the other types of animals across the PC1 axis. Recall our plot with loadings: milk and eggs were the top two features with largest loadings for PC1. Interestingly, platypus (our favorite odd ball) is placed closer to reptiles and the penguin on the PC1 axis. This is likely driven by laying eggs and not having teeth. On the other hand, seal and especially dolphin are located closer to fish and the sea snake, and separate from rest of the mammals on the PC2 axis. Do you remember the top two loadings for PC2? It was fins and aquatic! Do you see the pattern here?

A common practice when conducting PCA is to check how much variability in the data is captured by principal components. Typically, this is achieved by looking at the eigenvalues or their percent proportion for each component. Let's extract variance explained by each principal component using the tidy() method.

pca_variances <- tidy(pca_prep, id = "pca", type = "variance")
pca_variances

pca_variances <- tidy(pca_prep, id = "pca", type = "variance")
pca_variances

When we take a close look at the terms column of pca_variances, we see that various variance calculations are available for our 16 principal components.

pca_variances %>%
  count(___)

pca_variances %>%
  count(terms)

Now, let's plot them to help our brains out once more:

pca_variances %>%
  filter(terms == "percent variance") %>%
  ggplot(aes(___, ___)) +
  geom_col(fill="#372F60") +
  scale_y_continuous() +
  labs(x = "Principal Components", y = "Variance explained (%)") +
  theme_minimal()

pca_variances %>%
  filter(terms == "percent variance") %>%
  ggplot(aes(component, value)) +
  geom_col(fill="#372F60") +
  labs(x = "Principal Components", y = "Variance explained (%)") +
  theme_minimal()

We can see that first three principal components explain majority of the variance in the data. But it is difficult to see the cumulative variance explained in this plot. Let's tweak the filter() function to plot the cumulative variance explained:

pca_variances %>%
  filter(terms == "___") %>%
  ggplot(aes(___, ___)) +
  geom_col(fill="#372F60") +
  scale_y_continuous() +
  labs(x = "Principal Components", y = "Cumulative variance explained (%)") +
  theme_minimal()

pca_variances %>%
  filter(terms == "cumulative percent variance") %>%
  ggplot(aes(component, value)) +
  geom_col(fill="#372F60") +
  labs(x = "Principal Components", y = "Cumulative variance explained (%)") +
  theme_minimal()

We can see that 50% of the variance is explained by the first two components. If we were to use more components, we can capture even more variance in the data. That is why it is also common to plot multiple components to get a better idea of the data.

Try plotting PC1 and PC3, do you see other clusters in the data that wasn't as obvious with PC1 and PC2?

pca_bake %>%
  ggplot(aes(___, ___, label=___)) +
  geom_point(aes(color = ___), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

pca_bake %>%
  ggplot(aes(PC1, PC3, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

UMAP

Uniform manifold approximation and projection (UMAP) is a non-linear graph based dimension reduction algorithm. It finds local, low dimensional representations of the data and can be run unsupervised or supervised with different types of outcome data (e.g. numeric, factor, etc).

Some of the advantages of using UMAP are:

Can capture non-linear relationships in the data
Can handle large, high-dimensional data sets
Has a much shorter computation time than other non-linear graph based dimension reduction algorithms (e.g. t-SNE)
Is able to represent not only within cluster similarities but also global relationships
Can be used as a general-purpose dimensionality reduction technique for data preprocessing prior to modeling

Now that we learned how to create a recipe for PCA, we can apply our knowledge to create one for UMAP with zoo data too!

umap_ord <- c(
  "recipe(~., data = zoo) %>%",
  "update_role(animal_name, animal_type, new_role = \"id\") %>%",
  "step_normalize(all_predictors()) %>%",
  "step_umap(all_predictors())"
)

question_rank(
  "Sort the following recipe steps to create a recipe with UMAP:",
  answer(umap_ord, correct = TRUE),
  allow_retry = TRUE
)

Fill in the blanks to create a recipe with UMAP, then prep and bake the recipe to compute UMAP components:

set.seed(123) # set a seed to reproduce random number generation
library(embed) # load the library to use `step_umap()`  

## Create the recipe accordingly
umap_rec <- recipe(~., data = ___) %>%
  update____(___, ___, new_role = "id") %>%
  step_normalize(___) %>%
  step___(___)

## Train your recipe with prep
umap_prep <- prep(___)

## Apply computations with bake
umap_bake <- bake(___, ___)
umap_bake

set.seed(123) # set a seed to reproduce random number generation
library(embed) # load the library to use `step_umap()`  

## Create the recipe accordingly
umap_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_umap(all_predictors())

## Prep your recipe
umap_prep <- prep(umap_rec)

## Extract UMAP components with bake
umap_bake <- bake(umap_prep, zoo)
umap_bake

Great job!

It's time to plot our UMAP components! r emojifont::emoji('tada')

umap_bake %>%
  ggplot(aes(___, ___, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

umap_bake %>%
  ggplot(aes(UMAP1, UMAP2, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

Do you think UMAP is doing a better job than PCA?

UMAP hyperparameters

UMAP algorithm has multiple hyperparameters that can have significant impact on the results. The four major hyperparameters are:

neighbors
num_comp
min_dist
learn_rate

These hyperparameters can be specified in the step_umap function. You can find more details in step_umap help document and additional explanations on hyperparameters here.

Try setting num_comp (number of components) to 3 and plot UMAP components again, but this time use the first and the third components:

set.seed(123) # set a seed to reproduce random number generation
library(embed) # load the library to use `step_umap()`  

## Create the recipe accordingly
umap_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_umap(all_predictors(), ___ = ___)

## Prep your recipe
umap_prep <- prep(umap_rec)

## Extract UMAP components with bake
umap_bake <- bake(umap_prep, zoo)

## Plot UMAP components
umap_bake %>%
  ggplot(aes(___, ___, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

set.seed(123) # set a seed to reproduce random number generation
library(embed) # load the library to use `step_umap()`  

## Create the recipe accordingly
umap_rec <- recipe(~., data = zoo) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_umap(all_predictors(), num_comp = 3)

## Prep your recipe
umap_prep <- prep(umap_rec)

## Extract UMAP components with bake
umap_bake <- bake(umap_prep, zoo)

## Plot UMAP components
umap_bake %>%
  ggplot(aes(UMAP1, UMAP3, label=animal_name)) +
  geom_point(aes(color = animal_type), alpha = 0.7, size = 2)+
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL) +
  theme_minimal()

Looks like increasing the number of UMAP components did not change our results that much. Try other hyperparameters and see if you can produce different plots. Don't forget to modify the plot accordingly!

Bake vs juice

Good job! You completed all the steps and applied dimensionality reduction to the zoo data set using the recipes package from tidymodels! r emojifont::emoji('star2') But before you start your victory lap, let's go over what we learned one last time.

To implement dimensionality reduction with the recipes package, we took the following steps:

Create a recipe using a data set and formula with recipe()
Update variable roles with update_role()
Define pre-processing steps with step_*()
Train pre-processing steps with prep()
Apply computations and extract pre-processed data with bake()

knitr::include_graphics("https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/recipes.png")

Throughout this tutorial, we used bake to apply the computations from a trained recipe to our data set. The bake method is great because it allows us to apply a set specifications and computations generated with prep to our data of choice. This is especially handy when you are dealing with training, validation, or test sets during your modeling process.

However, if we simply want to extract the computations generated with prep() and don't need to reapply them to a new data set, we can simply use juice. For example, throughout this tutorial, we only worked with the zoo data and could have easily extracted our principal components with juice(pca_prep).

Let's create our recipe, add steps and train with prep one last time. Then extract principal components first with bake and then with juice.

zoo_rec <- recipe(data = zoo, formula = ~.) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

zoo_prep <- prep(pca_rec)

# bake
zoo_bake <- bake(___, ___)
zoo_bake

zoo_rec <- recipe(data = zoo, formula = ~.) %>%
  update_role(animal_name, animal_type, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

zoo_prep <- prep(zoo_rec)

# bake
zoo_bake <- bake(zoo_prep, zoo)
zoo_bake

Now, let's juice! r emojifont::emoji('tropical_drink')

# juice
zoo_juice <- juice(___)
zoo_juice

# juice
zoo_juice <- juice(zoo_prep)
zoo_juice

Do you see any difference in the output between bake or juice? Let's compare them with base R function all.equal(), which simply returns TRUE if the compared objects are identical:

all.equal(___, ___)

all.equal(zoo_bake, zoo_juice)

Final words

Congratulations! You've completed the tutorial! It's time for your victory lap! r emojifont::emoji('runner')

Equipped with the necessary know-how, you are now ready to apply these tools in the wild. If you ever face obstacles on your journey, don't forget to check out the following resources:

The recipes main page
Introduction to recipes package and additional examples.
The recipes reference page
A list of all available recipes functions and their help documents in a neatly categorized layout.
The search table for recipes steps
Search and find available recipes steps for pre-processing.
Ask about it in RStudio Community

tidymodels/learntidymodels documentation built on Dec. 14, 2021, 5:12 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tidymodels/learntidymodels
Learn 'tidymodels' with Interactive 'learnr' Primers

In tidymodels/learntidymodels: Learn 'tidymodels' with Interactive 'learnr' Primers

Welcome

Learning objectives

Pre-requisites

The zoo data

Correlation matrix

The `corrr::rplot` is quite handy because it returns a ggplot object, which can be further customized with ggplot2 functions.

Principal component analysis

Create a recipe

Add steps to a recipe

Prep a recipe

How did we know what to extract? Take a look at the `step_pca` help document. The `type` argument provides more details about how to use this step with the `tidy()` method.

Bake a recipe

UMAP

UMAP hyperparameters

Bake vs juice

Final words

R Package Documentation

Browse R Packages

We want your feedback!

tidymodels/learntidymodels Learn 'tidymodels' with Interactive 'learnr' Primers

In tidymodels/learntidymodels: Learn 'tidymodels' with Interactive 'learnr' Primers

Welcome

Learning objectives

Pre-requisites

The zoo data

Correlation matrix

The corrr::rplot is quite handy because it returns a ggplot object, which can be further customized with ggplot2 functions.

Principal component analysis

Create a recipe

Add steps to a recipe

Prep a recipe

How did we know what to extract? Take a look at the step_pca help document. The type argument provides more details about how to use this step with the tidy() method.

Bake a recipe

UMAP

UMAP hyperparameters

Bake vs juice

Final words

R Package Documentation

Browse R Packages

We want your feedback!

tidymodels/learntidymodels
Learn 'tidymodels' with Interactive 'learnr' Primers

The `corrr::rplot` is quite handy because it returns a ggplot object, which can be further customized with ggplot2 functions.

How did we know what to extract? Take a look at the `step_pca` help document. The `type` argument provides more details about how to use this step with the `tidy()` method.