General introduction: Survival on the RMS Titanic

knitr::opts_chunk$set(
  collapse = FALSE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

Data for Titanic survival

Let's see an example for DALEX package for classification models for the survival problem for Titanic dataset. Here we are using a dataset titanic_imputed avaliable in the DALEX package. Note that this data was copied from the stablelearner package and changed for practicality.

library("DALEX")
head(titanic_imputed)

Model for Titanic survival

Ok, now it's time to create a model. Let's use the Random Forest model.

# prepare model
library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked +
                           fare + sibsp + parch,
                           data = titanic_imputed, probability = TRUE)
model_titanic_rf

Explainer for Titanic survival

The third step (it's optional but useful) is to create a DALEX explainer for random forest model.

library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
                              data = titanic_imputed[,-8],
                              y = titanic_imputed[,8],
                              label = "Random Forest")

Model Level Feature Importance

Use the feature_importance() explainer to present importance of particular features. Note that type = "difference" normalizes dropouts, and now they all start in 0.

library("ingredients")

fi_rf <- feature_importance(explain_titanic_rf)
head(fi_rf)
plot(fi_rf)

Feature effects

As we see the most important feature is gender. Next three importnat features are class, age and fare. Let's see the link between model response and these features.

Such univariate relation can be calculated with partial_dependence().

age

Kids 5 years old and younger have much higher survival probability.

Partial Dependence Profiles

pp_age  <- partial_dependence(explain_titanic_rf, variables =  c("age", "fare"))
head(pp_age)
plot(pp_age)

Conditional Dependence Profiles

cp_age  <- conditional_dependence(explain_titanic_rf, variables =  c("age", "fare"))
plot(cp_age)

Accumulated Local Effect Profiles

ap_age  <- accumulated_dependence(explain_titanic_rf, variables =  c("age", "fare"))
plot(ap_age)

Instance level explanations

Let's see break down explanation for model predictions for 8 years old male from 1st class that embarked from port C.

First Ceteris Paribus Profiles for numerical variables

new_passanger <- data.frame(
  class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
  gender = factor("male", levels = c("female", "male")),
  age = 8,
  sibsp = 0,
  parch = 0,
  fare = 72,
  embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)

sp_rf <- ceteris_paribus(explain_titanic_rf, new_passanger)
plot(sp_rf) +
  show_observations(sp_rf)

And for selected categorical variables. Note, that sibsp is numerical but here is presented as a categorical variable.

plot(sp_rf,
     variables = c("class", "embarked", "gender", "sibsp"),
     variable_type = "categorical")

It looks like the most important feature for this passenger is age and sex. After all his odds for survival are higher than for the average passenger. Mainly because of the young age and despite of being a male.

Profile clustering

passangers <- select_sample(titanic, n = 100)

sp_rf <- ceteris_paribus(explain_titanic_rf, passangers)
clust_rf <- cluster_profiles(sp_rf, k = 3)
head(clust_rf)
plot(sp_rf, alpha = 0.1) +
  show_aggregated_profiles(clust_rf, color = "_label_", size = 2)

Session info

sessionInfo()


Try the ingredients package in your browser

Any scripts or data that you put into this service are public.

ingredients documentation built on April 10, 2021, 5:06 p.m.