model2DE_resampling: Run model2DE on several bootstrap resamples.
In leylabmpi/endoR: Interpret and Visualize Stable Tree Ensemble Models

model2DE_resampling

R Documentation

Run model2DE on several bootstrap resamples.

Description

Wrapper around the model2DE function to run it on several bootstrap resamples.

Usage

model2DE_resampling(
  model,
  model_type,
  data,
  target,
  classPos = NULL,
  times = 10,
  p = 0.5,
  sample_weight = NULL,
  ntree = "all",
  maxdepth = Inf,
  dummy_var = NULL,
  prune = TRUE,
  maxDecay = 0.05,
  typeDecay = 2,
  discretize = TRUE,
  K = 2,
  mode = "data",
  filter = TRUE,
  min_imp = 0.9,
  seed = 0,
  in_parallel = FALSE,
  n_cores = detectCores() - 1,
  cluster = NULL
)

Arguments

`model`	model to extract rules from.
`model_type`	character string: 'RF', 'random forest', 'rf', 'xgboost', 'XGBOOST', 'xgb', 'XGB', 'ranger', 'Ranger', 'gbm' or 'GBM'.
`data`	data with the same columns than data used to fit the model.
`target`	response variable.
`classPos`	the positive class predicted by decisions
`times`	number of bootstraps
`p`	fraction of data to resample.
`sample_weight`	numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).
`ntree`	number of trees to use from the model (default = all)
`maxdepth`	maximal node depth to use for extracting rules (by default, full branches are used).
`dummy_var`	if multiclass variables were transformed into dummy variables before fitting the model, one can pass their names in a vector here to avoid multiple levels to be used in a same rule (recommended).
`prune`	should unimportant features be removed from decisions (= pruning)? Features are removed by calculating the difference in prediction error of the decision with versus without the feature. If the difference is small (< maxDecay), then the feature is removed. The difference can be absolute (typeDecay = 1) or relative (typeDecay = 2, default). See pruneDecisions() for details.
`maxDecay`	when pruning, threshold for the increase in error; if maxDecay = -Inf, no pruning is done; if maxDecay = 0, only variables increasing the error are pruned from decisions.
`typeDecay`	if typeDecay = 1, the absolute increase in error is computed, and the relative one is computed if typeDecay = 2 (default).
`discretize`	should numeric variables be transformed to categorical variables? If TRUE, K categories are created for each variable based on their distribution in data (mode = 'data') or based on the thresholds used in the decision ensemble (mode = 'model')
`K`	numeric, number of categories to create from numeric variables (default: K = 2).
`mode`	whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').
`filter`	should decisions with low importance be removed from the decision ensemble? If TRUE, then decisions are filtered in a heuristic manner according to their importance and multiplicity (see filterDecisionsImportances() ).
`min_imp`	minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.
`seed`	which seed to use to make the random bootstraps - it is fixed for reproducibility
`in_parallel`	if TRUE, the function is run in parallel
`n_cores`	if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1
`cluster`	the cluster to use to run the function in parallel

Value

A list with the row numbers of partitioned data, the rules originally extracted from the model, a list with results from each bootstrap (use stabilitySelection to obtain the stable decison ensemble).

Examples

library(randomForest)
library(caret)

# import data and fit model
data(iris)
mod <- randomForest(Species ~ ., data = iris)

# Get decision ensemble with bootstrapping.

# Run 1 bootstrap after the other (times = 2 bootstraps)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Same but use different sample weights for bootstrapping
n_setosa <- sum(iris$Species == "setosa")
n_samp <- length(iris$Species)
samp_weight <- round(
    ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp)
    , digits = 2)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, sample_weight = samp_weight
    , in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Run the bootstraps in parallel
# First do all steps before bootstrapping
preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
    , target = iris$Species, classPos = "setosa", times = 2
    , discretize = TRUE, in_parallel = FALSE)

# Remove the special characters from column names
colnames(preclu$data) <- compatibleNames(colnames(preclu$data))

# Parameters for clustermq: can also run on HPC environment
library(clustermq)
options(clustermq.scheduler = "multiprocess")
# ... and run in parallel on each bootstrap
# (preclu$partitions = list of sample indexes for each bootstraps)
endo_setosa <- Q(model2DE_cluster
    , partition = preclu$partitions
    , export = list(data = preclu$data
                , target = iris$Species
                , exec = preclu$exec
                , classPos = "setosa"
                , prune = TRUE, filter = FALSE
                , maxDecay = 0.05 # values needed for maxDecay and typeDecay
                , typeDecay = 2 # here default ones, see pruneDecisions()
                , in_parallel = FALSE # can parallelize within each boostrap!
           )
    , n_jobs = 2 # max number of bootstraps that can be ran in parallel
    , pkgs = c("data.table", "parallel", "caret", "stringr", "scales"
                , "dplyr", "inTrees", "endoR")
    , log_worker = FALSE # to keep a log of the runs, e.g. if it fails..
)


# Stability selection
# First we can look at the effect of the alpha parameter on selection;
# alpha = expected number of false decisions
alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10)
                        , data = preclu$data)
alphas$summary_table

# perform stability selection with alpha = 1
de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7)

# Plot the decision ensemble:
# Plants from the setosa species have small petal and narrow long sepals.
plotFeatures(de_final, levels_order = c("Low", "Medium", "High"))

# there is no interaction between variables (all decisions with len = 1,
# the number of variables in the rules)
de_final$rules_summary
# hence the network would be empty and couldn't be plotted...
# plotNetwork(de_final, hide_isolated_nodes = FALSE)

leylabmpi/endoR documentation built on Oct. 20, 2023, 10:53 p.m.