model2DE_resampling: Run model2DE on several bootstrap resamples.

View source: R/model2DE_resampling.R

model2DE_resamplingR Documentation

Run model2DE on several bootstrap resamples.

Description

Wrapper around the model2DE function to run it on several bootstrap resamples.

Usage

model2DE_resampling(
  model,
  model_type,
  data,
  target,
  classPos = NULL,
  times = 10,
  p = 0.5,
  sample_weight = NULL,
  ntree = "all",
  maxdepth = Inf,
  dummy_var = NULL,
  prune = TRUE,
  maxDecay = 0.05,
  typeDecay = 2,
  discretize = TRUE,
  K = 2,
  mode = "data",
  filter = TRUE,
  min_imp = 0.9,
  seed = 0,
  in_parallel = FALSE,
  n_cores = detectCores() - 1,
  cluster = NULL
)

Arguments

model

model to extract rules from.

model_type

character string: 'RF', 'random forest', 'rf', 'xgboost', 'XGBOOST', 'xgb', 'XGB', 'ranger', 'Ranger', 'gbm' or 'GBM'.

data

data with the same columns than data used to fit the model.

target

response variable.

classPos

the positive class predicted by decisions

times

number of bootstraps

p

fraction of data to resample.

sample_weight

numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).

ntree

number of trees to use from the model (default = all)

maxdepth

maximal node depth to use for extracting rules (by default, full branches are used).

dummy_var

if multiclass variables were transformed into dummy variables before fitting the model, one can pass their names in a vector here to avoid multiple levels to be used in a same rule (recommended).

prune

should unimportant features be removed from decisions (= pruning)? Features are removed by calculating the difference in prediction error of the decision with versus without the feature. If the difference is small (< maxDecay), then the feature is removed. The difference can be absolute (typeDecay = 1) or relative (typeDecay = 2, default). See pruneDecisions() for details.

maxDecay

when pruning, threshold for the increase in error; if maxDecay = -Inf, no pruning is done; if maxDecay = 0, only variables increasing the error are pruned from decisions.

typeDecay

if typeDecay = 1, the absolute increase in error is computed, and the relative one is computed if typeDecay = 2 (default).

discretize

should numeric variables be transformed to categorical variables? If TRUE, K categories are created for each variable based on their distribution in data (mode = 'data') or based on the thresholds used in the decision ensemble (mode = 'model')

K

numeric, number of categories to create from numeric variables (default: K = 2).

mode

whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').

filter

should decisions with low importance be removed from the decision ensemble? If TRUE, then decisions are filtered in a heuristic manner according to their importance and multiplicity (see filterDecisionsImportances() ).

min_imp

minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.

seed

which seed to use to make the random bootstraps - it is fixed for reproducibility

in_parallel

if TRUE, the function is run in parallel

n_cores

if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1

cluster

the cluster to use to run the function in parallel

Value

A list with the row numbers of partitioned data, the rules originally extracted from the model, a list with results from each bootstrap (use stabilitySelection to obtain the stable decison ensemble).

Examples

library(randomForest)
library(caret)

# import data and fit model
data(iris)
mod <- randomForest(Species ~ ., data = iris)

# Get decision ensemble with bootstrapping.

# Run 1 bootstrap after the other (times = 2 bootstraps)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Same but use different sample weights for bootstrapping
n_setosa <- sum(iris$Species == "setosa")
n_samp <- length(iris$Species)
samp_weight <- round(
    ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp)
    , digits = 2)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, sample_weight = samp_weight
    , in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Run the bootstraps in parallel
# First do all steps before bootstrapping
preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
    , target = iris$Species, classPos = "setosa", times = 2
    , discretize = TRUE, in_parallel = FALSE)

# Remove the special characters from column names
colnames(preclu$data) <- compatibleNames(colnames(preclu$data))

# Parameters for clustermq: can also run on HPC environment
library(clustermq)
options(clustermq.scheduler = "multiprocess")
# ... and run in parallel on each bootstrap
# (preclu$partitions = list of sample indexes for each bootstraps)
endo_setosa <- Q(model2DE_cluster
    , partition = preclu$partitions
    , export = list(data = preclu$data
                , target = iris$Species
                , exec = preclu$exec
                , classPos = "setosa"
                , prune = TRUE, filter = FALSE
                , maxDecay = 0.05 # values needed for maxDecay and typeDecay
                , typeDecay = 2 # here default ones, see pruneDecisions()
                , in_parallel = FALSE # can parallelize within each boostrap!
           )
    , n_jobs = 2 # max number of bootstraps that can be ran in parallel
    , pkgs = c("data.table", "parallel", "caret", "stringr", "scales"
                , "dplyr", "inTrees", "endoR")
    , log_worker = FALSE # to keep a log of the runs, e.g. if it fails..
)


# Stability selection
# First we can look at the effect of the alpha parameter on selection;
# alpha = expected number of false decisions
alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10)
                        , data = preclu$data)
alphas$summary_table

# perform stability selection with alpha = 1
de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7)

# Plot the decision ensemble:
# Plants from the setosa species have small petal and narrow long sepals.
plotFeatures(de_final, levels_order = c("Low", "Medium", "High"))

# there is no interaction between variables (all decisions with len = 1,
# the number of variables in the rules)
de_final$rules_summary
# hence the network would be empty and couldn't be plotted...
# plotNetwork(de_final, hide_isolated_nodes = FALSE)

leylabmpi/endoR documentation built on Oct. 20, 2023, 10:53 p.m.