knitr::opts_chunk$set( collapse = FALSE, comment = "#>", warning = FALSE, message = FALSE, fig.align = 'center' )
Vignette presents the predict_aspects() function on the datasets: titanic_imputed
and apartments
(both are available in the DALEX
package).
At the beginning, we download titanic_imputed
dataset and build logistic regression model.
library("DALEX") titanic <- titanic_imputed head(titanic) model_titanic_glm <- glm(survived == 1 ~ class + gender + age + sibsp + parch + fare + embarked, titanic, family = "binomial")
Before using predict_aspects() we need to:
aspects_titanic <- list( wealth = c("class", "fare"), family = c("sibsp", "parch"), personal = c("age", "gender"), embarked = "embarked" ) passenger <- data.frame( class = factor( "3rd", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")), gender = factor("male", levels = c("female", "male")), age = 8, sibsp = 0, parch = 0, fare = 18, embarked = factor( "Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton") ) ) passenger predict(model_titanic_glm, passenger, type = "response")
Now we can call predict_aspects() function and see that features included in wealth
(that is class
and fare
) have the biggest contribution on survival prediction for the passenger. That contribution is of negative type. Personal
and family
have smaller, positive influence. Aspect embarked
with single feature has very small contribution.
library("ggplot2") library("triplot") set.seed(123) titanic_without_target <- titanic[,colnames(titanic) != "survived"] explain_titanic_glm <- explain(model_titanic_glm, data = titanic_without_target, y = titanic$survived == 1, predict_function = predict, label = "Logistic Regression", verbose = FALSE) titanic_glm_ai <- predict_aspects(explain_titanic_glm, new_observation = passenger, variable_groups = aspects_titanic, N = 1000) print(titanic_glm_ai, show_features = TRUE) plot(titanic_glm_ai) + ggtitle("Aspect importance for the selected passenger")
Now, we prepare random forest model for the titanic
dataset.
library("randomForest") model_titanic_rf <- randomForest(factor(survived) == 1 ~ gender + age + class + embarked + fare + sibsp + parch, data = titanic) predict(model_titanic_rf, passenger)
After calling predict_aspects() we can see why the survival prediction for the passenger in random forest model was much higher (0.5) than in logistic regression case (0.18).
In this example personal
features (age
and gender
) have the biggest positive influence. Aspects wealth
(class
, fare
) and embarked
have both much smaller contribution and those are negative ones. Aspect family
has very small influence on the prediction.
explain_titanic_rf <- explain(model_titanic_rf, data = titanic_without_target, y = titanic$survived == 1, predict_function = predict, label = "Random Forest", verbose = FALSE) titanic_rf_ai <- predict_aspects(explain_titanic_rf, new_observation = passenger, variable_groups = aspects_titanic, N = 1000) print(titanic_rf_ai, show_features = TRUE) plot(titanic_rf_ai) + ggtitle("Aspect importance for the selected passenger")
Function predict_aspects() can calculate coefficients (that is aspects' importance) by using either linear regression or lasso regression. Using lasso, we can control how many nonzero coefficients (nonzero aspects importance values) are present in the final explanation.
To use predict_aspects() with lasso, we have to provide n_var
parameter, which declares how many aspects importance nonzero values we would like to get in predict_aspects() results.
For this example, we use titanic_imputed
dataset again and random forest model. With the help of lasso technique, we would like to check the importance of variables' aspects, while controlling that two of them should be equal to 0. Therefore we call predict_aspects() with n_var
parameter set to 2.
titanic_rf_ai_lasso <- predict_aspects(explain_titanic_rf, new_observation = passenger, variable_groups = aspects_titanic, N = 1000, n_var = 2) print(titanic_rf_ai_lasso, show_features = TRUE)
In examples described above, we had to manually group features into aspects.
On apartments
dataset, we will test the function that automatically groups features for us (grouping is based on the features correlation). Function only works on numeric variables.
We import apartments
from DALEX
package and choose columns with numeric features. Then we fit linear model to the data and choose observation to be explained. Target variable is m2.price
.
library(DALEX) data("apartments") apartments_num <- apartments[,unlist(lapply(apartments, is.numeric))] #excluding non numeric features head(apartments_num) new_observation_apartments <- apartments_num[6,] model_apartments <- lm(m2.price ~ ., data = apartments_num) new_observation_apartments predict(model_apartments, new_observation_apartments)
We run group_variables() function with cut off level set on 0.6. As a result, we get a list of variables groups (aspects) where absolute value of features' pairwise correlation is at least at 0.6.
Afterwards, we call print predict_aspects() results with parameter show_cor = TRUE
, to check how features are grouped into aspects, what is minimal value of pairwise correlation in each group and to check whether any pair of features is negatively correlated (neg
) or not (pos
).
apartments_no_target <- apartments_num[,-1] #excluding target variable aspects_apartments <- group_variables(apartments_no_target, 0.6) explain_apartments_lm <- explain(model_apartments, data = apartments_no_target, y = apartments_num$m2.price, predict_function = predict, label = "Linear Model", verbose = FALSE) apartments_ai <- predict_aspects(x = explain_apartments_lm, new_observation = new_observation_apartments[-1], variable_groups = aspects_apartments, N = 1000) print(apartments_ai, show_features = TRUE, show_cor = TRUE)
Triplot
is one more tool that allows us to better understand the inner workings a of black box model. It illustrates, in one place:
group_variables()
.Hierarchical aspects importance allows us to check the values of aspects importance for the different levels of variables grouping. Method starts with looking at the aspect importance where every aspect has one, single variable. Afterwards, it iteratively creates bigger aspects by merging the ones with the highest level of absolute correlation into one aspect and calculating it's contribution to the prediction.
It should be noted that similarly to group_variables()
, calculate_triplot()
works for the datasets with only numerical variables.
Looking at the triplot, we can observe that for a given observation:
set.seed(123) apartments_tri <- predict_triplot(explain_apartments_lm, new_observation = new_observation_apartments[-1], N = 1000, clust_method = "complete") plot(apartments_tri, absolute_value = FALSE, cumulative_max = FALSE, add_importance_labels = FALSE, abbrev_labels = 15, add_last_group = TRUE, margin_mid = 0)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.