title: "Example 2: Penguin dataset with torch and luz" output: rmarkdown::html_vignette: toc: true always_allow_html: yes vignette: > %\VignetteIndexEntry{Example 2: Penguin dataset with torch and luz} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} &\VignetteDepends{luz, palmerpenguins, GGally}
knitr::opts_chunk$set( fig.dpi = ifelse(Sys.getenv("RENDER_PLOTLY", unset = 0) == 1, 400, 50), collapse = TRUE, comment = "#>", fig.align = "center", out.width = "90%", eval = requireNamespace("palmerpenguins") & requireNamespace("luz") & torch::torch_is_installed() )
Sys.setenv(LANG = "en_US.UTF-8") set.seed(111) torch::torch_manual_seed(111)
r knitr::asis_output("\U1F4DD")
Note
Since the innsight package relies on the torch package for each method and requires a successful installation of LibTorch and other dependencies (torch::install_torch()
), no examples can be run in the R vignette for memory reasons. For the executed code chunks, we refer to our website.
This example is even slightly more complex than the first one, considering the Iris dataset, because we are using the penguin dataset, which has both numeric and categorical variables. Since our package innsight currently only accepts numeric values as inputs, you have to keep a few points in mind and use little tricks. But in the future, it should be possible to use categorical variables directly with the package. In addition, we want to show how to use torch models trained with the higher level API luz in this example.
The dataset includes measurements of penguin species Adelie ($162$), Chinstrap ($68$), and Gentoo ($124$) collected near Palmer Station, Antarctica. For each of the $344$ penguins, numerical values of bill length/depth, flipper length, body mass, and study year were measured, and additional categorical characteristics like sex and island were recorded. To give you a brief overview of the data, let's create a pair plot of some variables using GGally:
library(palmerpenguins) # remove NAs penguin_data <- na.omit(penguins) # create plot GGally::ggpairs(penguin_data, ggplot2::aes(color = species, alpha = 0.75), columns = 3:7, progress = FALSE )
Based on this graphic, we can already hypothesize which characteristics might be relevant for the individual classes. For example, it is visible that the Gentoo species (blue) is much heavier, has longer flippers and thinner bills than the other species. On the other hand, for the species Adelie (red), the bill length could be a decisive feature for the classification. In the following, we will see if these hypotheses match the results of the feature attribution methods.
The package luz makes things much easier for training a torch model,
e.g., by doing the for-loop over the epochs for you. However, the model must be
defined as nn_module
and the data as a torch::dataset
,
but currently, innsight only works with torch models of class nn_sequential
.
Therefore a data preparation step and a few tricks are necessary to use
sequential models trained in luz with the package innsight.
As previously mentioned, innsight needs numeric input, and luz needs a
torch dataset. For this reason, we create a dataset with already encoded
categorical variables (for more information about the dataset
defined for
torch models see the
torch vignette).
# Load packages library(torch) library(luz) # Create torch penguin dataset penguin_dataset <- dataset( name = "penguin_dataset", initialize = function(df) { df <- na.omit(df) # remove NAs # Get all numeric features and transform them to `torch_tensor` x_cont <- df[, c( "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "year" )] x_cont <- torch_tensor(as.matrix(x_cont)) # Get and encode (one-hot) categorical features and transform them to `torch_tensor` x_cat <- sapply(df[, c("island", "sex")], as.integer) x_cat <- torch_tensor(x_cat, dtype = torch_int64()) x_cat <- torch_hstack(list( nnf_one_hot(x_cat[, 1]), nnf_one_hot(x_cat[, 2]) )) # Stack and store all features together in the field 'x' self$x <- torch_hstack(list(x_cont, x_cat)) # Get, transform and store the target variable (the three penguin species) self$y <- torch_tensor(as.integer(df$species)) }, .getitem = function(i) { list(x = self$x[i, ], y = self$y[i]) }, .length = function() { self$y$size()[[1]] } )
Using the above method, we can now create a torch dataset from any subset of
instances from the penguin dataset. Consequently, we can now get a training
and validation dataset and build a data loader provided by the function
torch::dataloader
.
Beforehand, however, we normalize all numerical inputs:
# Normalize inputs scaled_penguin_ds <- penguin_data scaled_penguin_ds[, c(3:6, 8)] <- scale(penguin_data[, c(3:6, 8)]) # Create train and validation split idx <- sample(seq_len(nrow(penguin_data))) train_idx <- idx[1:250] valid_idx <- idx[251:333] # Create train and validation datasets train_ds <- penguin_dataset(scaled_penguin_ds[train_idx, ]) valid_ds <- penguin_dataset(scaled_penguin_ds[valid_idx, ]) # Create dataloaders train_dl <- dataloader(train_ds, batch_size = 32, shuffle = TRUE) valid_dl <- dataloader(valid_ds, batch_size = 32, shuffle = FALSE) # We use the whole dataset as the test data test_ds <- penguin_dataset(scaled_penguin_ds)
The package luz requires an object of the class nn_module
, which cannot be
used for innsight. But a sequential model can be nested in a nn_module
and
extracted afterward. For this reason, we define a model that uses only the
sequential network of class nn_sequential
stored as a field. In the same way,
you can assemble a model from several sequential models and, if necessary, apply
each model individually with innsight. For example, if you want to use several
data pre-processing layers before the actual model.
# Create the model net <- nn_module( initialize = function(dim_in) { # Here, we define our sequential model self$seq_model <- nn_sequential( nn_linear(dim_in, 256), nn_relu(), nn_dropout(p = 0.4), nn_linear(256, 256), nn_relu(), nn_dropout(p = 0.4), nn_linear(256, 64), nn_relu(), nn_dropout(p = 0.4), nn_linear(64, 3), nn_softmax(dim = 2) ) }, # The forward pass should only contain the call of the sequential model forward = function(x) { self$seq_model(x) } )
Now we can easily train the model with luz and extract the sequential model to be used for innsight:
# We have imbalanced classes, so we weight the output classes accordingly weight <- length(train_ds$y) / (3 * torch_stack(lapply(1:3, function(i) sum(train_ds$y == i)))) # Fit the model fitted <- net %>% setup( loss = function(input, target) nnf_nll_loss(log(input), target, weight = weight), optimizer = optim_adam, metrics = list( luz_metric_accuracy() ) ) %>% set_hparams(dim_in = 10) %>% fit(train_dl, epochs = 50, valid_data = valid_dl) # Extract the sequential model model <- fitted$model$seq_model # Show result get_metrics(fitted)[c(99, 100, 199, 200), ]
Analogous to the first
example
considering the Iris dataset, we can easily create the Converter
object by
passing the extracted model and the input dimension to the R6 class. However,
since the categorical variables have already been encoded, there are a total of
$10$ inputs ($5$ numeric features, $3$ islands and $2$ genders) and the input
names (optional argument) must be expanded accordingly.
library(innsight) # Define input and output names input_names <- c( "bill_length", "bill_depth", "flipper_length", "body_mass", "year", "island_Biscoe", "island_Dream", "island_Torgersen", "sex_female", "sex_male" ) output_names <- c("Adelie", "Chinstrap", "Gentoo") # Create the `Converter` object converter_1 <- convert(model, input_dim = 10, input_names = input_names, output_names = output_names )
But it is also possible, with the help of a small hack, to combine the results of the categorical variables, for example, to get only one relevance value for the overall feature island and not for each of the three islands. This will be explained in the next step and we create a converter for this variant here:
# Create a second `Converter` object for combined categorical features converter_2 <- convert(model, input_dim = 10, output_names = output_names )
Now we can apply the implemented methods to our trained model. Although these
methods are different, they are called more or less the same way in
innsight. Essential arguments are, of course, the converter object
(converter
) and the data (data
) to which the method is to be applied. There
are also other basic and model-specific arguments, but they are already
explained in the other more detailed
vignette.
Now two possibilities are presented with the categorical variables: either
considering the categorical expressions individually or summarizing them.
In this variant, nothing needs to be changed, because it already corresponds to the default result, i.e., you can use the method of your choice as usual:
# Data to be analyzed (in this case, we use the whole dataset) data <- test_ds$x # Apply method 'LRP' with rule alpha-beta lrp_ab_1 <- run_lrp(converter_1, data, rule_name = "alpha_beta", rule_param = 2) # the result for 333 instances, 10 inputs and all 3 outputs dim(get_result(lrp_ab_1))
In this case, we need to use a few tricks to make it clear to the method
object that instead of $10$ inputs, there are only $7$ ($5$ numeric and $2$
categorical). On the one hand, the input dimension and the input names have
to be adjusted in the passed converter object; On the other hand, the results
of the categorical features have to be combined so that the field
result
fits the input dimensions again:
# Apply method as in the other case lrp_ab_2 <- run_lrp(converter_2, data, rule_name = "alpha_beta", rule_param = 2) # Adjust input dimension and input names in the method converter object lrp_ab_2$converter$input_dim[[1]] <- 7 lrp_ab_2$converter$input_names[[1]][[1]] <- c( "bill_length", "bill_depth", "flipper_length", "body_mass", "year", "island", "sex" ) # Combine (sum) the results for feature 'island' and 'sex' lrp_ab_2$result[[1]][[1]] <- torch_cat(list( lrp_ab_2$result[[1]][[1]][, 1:5, ], # results for all numeric features lrp_ab_2$result[[1]][[1]][, 6:8, ]$sum(2, keepdim = TRUE), # results for feature island lrp_ab_2$result[[1]][[1]][, 9:10, ]$sum(2, keepdim = TRUE) # results for feature sex ), dim = 2 ) # Now we have the desired output shape with combined categorical features dim(get_result(lrp_ab_2)) #> [1] 333 7 3
r knitr::asis_output("\U1F4DD")
Note
Even if the input names can be passed to the converter in other ways, they will be stored internally as a list (input layer) of lists (for each dimension). Since, in this case, we have only one input layer and only one feature dimension, it is sufficient to modify onlylrp_ab_2$converter$input_names[[1]][[1]]
. In the same way, regardless of the model's architecture, the result is stored as a list (output layer) of lists (input layer), which is why in our case two lists are indexed by1
before the results are modified.
The package innsight provides two ways to visualize the results of a method,
namely as innsight_ggplot2
or innsight_plotly
object. Both are S4 classes to
combine multiple plots nicely and to be able to make visual modifications or
adjustments to the selection of plots even after the object has been created.
The first class is based on ggplot2 and
behaves partly like an ordinary ggplot2 object. Whereas the other one is
based on the plotly package and creates an
interactive graph with more detailed information about each variable. For more
information on the S4 classes innsight_ggplot
and innsight_plotly
see the
in-depth
vignette
or the respective R documentation (?innsight_ggplot2
or ?innsight_plotly
).
For each of these classes and thus, of course, also for each method, there are two plot functions:
plot()
shows only individual data points andplot_global()
visualizes summaries of multiple data points using summary
statistics (you can also use the alias boxplot()
for tabular and signal data).The function plot()
is implemented for each of the available methods. You can
select your desired data points and output nodes/classes with the data_idx
and output_idx
arguments, respectively. To switch between a ggplot2 and
plotly based plot, you can use the logical as_plotly
argument, but
this requires a successful installation of the suggested plotly package.
Now we plot the results for the first instance and the classes Adelie and Gentoo (output indices $1$ and $3$), but once with separated and once with merged categorical variables:
library(ggplot2) # Separated categorical features plot(lrp_ab_1, output_idx = c(1, 3)) + coord_flip() # Combined categorical features plot(lrp_ab_2, output_idx = c(1, 3)) + coord_flip()
Interpretation
In this case, we analyze the first data point of the test data test_ds
,
which is also the first instance in the penguin dataset penguin_data
.
Unnormalized, it looks like this:
knitr::kable(penguin_data[1, ])
This is an Adelie penguin, and it exactly shows what we thought at the
beginning of this example: The key feature that argues for the class Adelie
is the bill length (bill_length_mm
). As you can see in the figures above,
it has by far the highest positive relevance for the class Adelie and a very
high negative relevance for the class Gentoo. In other words, the comparatively
short bill length strongly argues for the class Adelie and, on the other hand,
prevents the instance from being the species Gentoo.
Of course, you can also investigate the other features more deeply, such as
the island (island
) on which the penguin was spotted. The Gentoo penguins were only
sighted on Biscoe Island and Chinstrap penguins only on Dream, whereas the
Adelie penguins were equally sighted on all three islands. For this reason,
it makes sense that for our example penguin sighted on Torgersen Island, a
positive relevance for the species Adelie and a negative relevance for Gentoo
resulted.
The function plot_global()
is implemented for each of the available local methods.
Since we are dealing with tabular data in this case, the method boxplot()
can
alternatively be used instead of plot_global()
.
You can select your desired data points (default is 'all'
) and output
nodes/classes with the data_idx
and output_idx
arguments, respectively.
To switch between a ggplot2 and plotly based plot, you can use the
logical as_plotly
argument, but this requires a successful installation of
the suggested plotly package. In addition, you can use ref_data_idx
to
select a single data point that will be visualized in red as a reference
value, and preprocess_FUN
to select a function that will be applied to all
relevances in advance (e.g., the absolute value).
library(ggplot2) boxplot(lrp_ab_2, output_idx = c(1, 3), preprocess_FUN = identity, ref_data_idx = 1 ) + coord_flip() + facet_grid(rows = vars(output_node))
Interpretation
This global result also reflects the ideas from the introduction: The feature
bill_length
seems to have a very strong positive influence on the species
Adelie and a negative one on the class Gentoo. On the other hand, the variables
of the penguin's body weight (body_mass
), flipper length (flipper_length
)
and bill depth (bill_depth
) have mostly a positive relevance on the class
Gentoo.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.