discretizeDecisions: Discretize numerical variables in decision ensemble
In aruaud/endoR: Interpret and Visualize Stable Tree Ensemble Models

discretizeDecisions

R Documentation

Discretize numerical variables in decision ensemble

Description

This function replaces in a decision ensemble the boundaries of numerical features by their corresponding levels when the variable is discretized. If discretized data are not passed, data are first discretized into Kmax categories based on their quantiles (see discretizeData). The error, prediction, importance and multiplicity of decisions are updated after discretization.

Usage

discretizeDecisions(
  rules,
  data = NULL,
  target,
  mode = "data",
  K = 2,
  splitV = NULL,
  classPos = NULL,
  in_parallel = FALSE,
  n_cores = detectCores() - 1,
  cluster = NULL
)

Arguments

`rules`	a data frame with a column "condition".
`data`	data to discretize.
`target`	response variable.
`mode`	whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').
`K`	numeric, number of categories to create from numeric variables (default: K = 2).
`splitV`	instead of running internally discretizeData, one can provide a list with, for each variable to discretize in rules, the thresholds delimiting each new category.
`classPos`	for classification, the positive class.
`in_parallel`	if TRUE, the function is run in parallel.
`n_cores`	if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.
`cluster`	the cluster to use to run the function in parallel.

Value

Decision ensemble with discretized variables in the condition. Decisions with the same condition are aggregated: their importances are summed, and all other metrics are averaged.

Examples

library(randomForest)
library(caret)
library(data.table)

# import data and fit model
data(iris)
mod <- randomForest(Species ~ ., data = iris)

# Let's get the decision ensemble. One could use the wrapping function
# model2DE() but, we will run each function separately.

# Get the raw decision ensemble
de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
        , target = iris$Species, classPos = "setosa"
        , times = 1 # number of bootstraps, here just one
        , discretize = FALSE) # we will discretize outside for the example
summary(de)
# exec = the decision ensemble
# partitions = list of sample indexes for boostrapping
# if we had done discretization, the new data would be in data_ctg
de <- de$exec

# Discretize variables in 3 categories - optional
de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
        , K = 3, classPos = "setosa", mode = "data")
data_ctg <- de$data_ctg
de <- de$rules_ctg

# Homogenize the decision ensemble
de <- de[, condition := sapply(condition, function(x) {
  paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
})]
de <- unique(
          as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
          )

# Calculate decision metrics, we don't need the importances yet since we will
# do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
            , classPos = "setosa", importances = FALSE)
de <- de[de_met, on = "condition"]

# Pruning - optional
de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
        , classPos = "setosa")

# Decision importances
de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
        , classPos = "setosa")

# Filter out decisions with the lowest importance: min_imp = the minimal
# importance in the decision ensemble compared to the maximal one.
# E.g., if min_imp = 0.5, then at least all decisions with an
# importance > 0.5*max(importance) will be kept.
# This ensures that we don't throw out too much.
# Since the decision ensemble is quite small, we don't need to filter much...
de <- filterDecisionsImportances(rules = de, min_imp = 0.1)

# Get the network
de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
            , classPos = "setosa")

# Plot the feature importance/influence and the network
plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")

aruaud/endoR documentation built on Jan. 25, 2025, 2:20 a.m.