library(tidymodels)
library(keras)
library(embed)
library(ggplot2)
library(ggiraph)
embed:::is_tf_2()
options(digits = 2)
set.seed(24566)

The approach encodes categorical data as multiple numeric variables using a word embedding approach. Originally intended as a way to take a large number of word identifiers and represent them in a smaller dimension. Good references on this are Guo and Berkhahn (2016) and Chapter 6 of Francois and Allaire (2018).

The methodology first translates the C factor levels as a set of integer values then randomly allocates them to the new D numeric columns. These columns are optionally connected in a neural network to an intermediate layer of hidden units. Optionally, other predictors can be added to the network in the usual way (via the predictors argument) that also link to the hidden layer. This implementation uses a single layer with ReLu activations. Finally, an output layer is used with either linear activation (for numeric outcomes) or softmax (for classification).

To translate this model to a set of embeddings, the coefficients of the original embedding layer are used to represent the original factor levels.

As an example, we use the Ames housing data where the sale price of houses are being predicted. One predictor, neighborhood, has the most factor levels of the predictors.

library(tidymodels)
data(ames)
length(levels(ames$Neighborhood))

The distribution of data in the neighborhood is not uniform:

#| fig.alt: "Horizontal bar chart. n along the x axis, neighborhoods along the y-axis. The lengths of the bars vary from near zero for Landmarks and Green_Hills, to almost 450 for North_Ames."
ames %>%
  count(Neighborhood) %>%
  ggplot(aes(n, reorder(Neighborhood, n))) +
  geom_col() +
  labs(y = NULL) +
  theme_bw()

Fo plotting later, we calculate the simple means per neighborhood:

means <-
  ames %>%
  group_by(Neighborhood) %>%
  summarise(
    mean = mean(log10(Sale_Price)),
    n = length(Sale_Price),
    lon = median(Longitude),
    lat = median(Latitude)
  )

We'll fit a model with 10 hidden units and 3 encoding columns:

#| fig-alt: "Line chart with 2 lines. epochs along the x-axis, loss along the y-axis. The two lines are colored according to the type of loss, red for normal loss and blue for validation loss. The lines have high values for small epochs and lower values for higher epochs, with the validation loss being lower at all times."
library(embed)
tf_embed <-
  recipe(Sale_Price ~ ., data = ames) %>%
  step_log(Sale_Price, base = 10) %>%
  # Add some other predictors that can be used by the network
  # We preprocess them first
  step_YeoJohnson(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
  step_range(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
  step_embed(
    Neighborhood,
    outcome = vars(Sale_Price),
    predictors = vars(Lot_Area, Full_Bath, Gr_Liv_Area),
    num_terms = 5,
    hidden_units = 10,
    options = embed_control(epochs = 75, validation_split = 0.2)
  ) %>%
  prep(training = ames)

theme_set(theme_bw() + theme(legend.position = "top"))

tf_embed$steps[[4]]$history %>%
  filter(epochs > 1) %>%
  ggplot(aes(x = epochs, y = loss, col = type)) +
  geom_line() +
  scale_y_log10()

The embeddings are obtained using the tidy method:

hood_coef <-
  tidy(tf_embed, number = 4) %>%
  dplyr::select(-terms, -id) %>%
  dplyr::rename(Neighborhood = level) %>%
  # Make names smaller
  rename_at(
    vars(contains("emb")),
    funs(gsub("Neighborhood_", "", ., fixed = TRUE))
  )
hood_coef

hood_coef <-
  hood_coef %>%
  inner_join(means, by = "Neighborhood")
hood_coef

We can make a simple, interactive plot of the new features versus the outcome:

#| fig-alt: "Faceted scatter chart. Embedding value along the x-axis, Mean (log scale) on the y-axis. Each facet is a diffferent embedding. Appears fairly random for facet 1, 3, and 4. Small trend in embedding 2 and 5."
tf_plot <-
  hood_coef %>%
  dplyr::select(-lon, -lat) %>%
  gather(variable, value, starts_with("embed")) %>%
  # Clean up the embedding names
  # Add a new variable as a hover-over/tool tip
  mutate(
    label = paste0(gsub("_", " ", Neighborhood), " (n=", n, ")"),
    variable = gsub("_", " ", variable)
  ) %>%
  ggplot(aes(x = value, y = mean)) +
  geom_point_interactive(aes(size = sqrt(n), tooltip = label), alpha = .5) +
  facet_wrap(~variable, scales = "free_x") +
  theme_bw() +
  theme(legend.position = "top") +
  labs(y = "Mean (log scale)", x = "Embedding")

ggiraph(ggobj = tf_plot)

However, this has induced some between-predictor correlations:

hood_coef %>%
  dplyr::select(contains("emb")) %>%
  cor() %>%
  round(2)


topepo/embed documentation built on March 26, 2024, 4:11 a.m.