evaluateAlpha: Calculate the number fo decisions and of predicted samples...
In aruaud/endoR: Interpret and Visualize Stable Tree Ensemble Models

evaluateAlpha

R Documentation

Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values

Description

The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated. The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble. The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble.

Usage

evaluateAlpha(
  rules,
  alphas = c(5, 10, 15, 20, 30, 50, 75),
  pi_thr = 0.7,
  data = NULL,
  decision_ensembles = TRUE,
  aggregate_taxa = FALSE,
  taxa = NULL
)

Arguments

`rules`	list of bootstrap results
`alphas`	expected number of false positive decision selected (default = 1).
`pi_thr`	fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).
`data`	data for which to calculate how many samples follow each decision. Columns should be the same as for fitting the decision ensemble but samples can be any.
`decision_ensembles`	should the decision ensemble be returned?
`aggregate_taxa`	should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)
`taxa`	if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)

Value

A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes.

Examples

library(randomForest)
library(caret)

# import data and fit model
data(iris)
mod <- randomForest(Species ~ ., data = iris)

# Get decision ensemble with bootstrapping.

# Run 1 bootstrap after the other (times = 2 bootstraps)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Same but use different sample weights for bootstrapping
n_setosa <- sum(iris$Species == "setosa")
n_samp <- length(iris$Species)
samp_weight <- round(
    ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp)
    , digits = 2)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, sample_weight = samp_weight
    , in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Run the bootstraps in parallel
# First do all steps before bootstrapping
preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
    , target = iris$Species, classPos = "setosa", times = 2
    , discretize = TRUE, in_parallel = FALSE)

# Remove the special characters from column names
colnames(preclu$data) <- compatibleNames(colnames(preclu$data))

# Parameters for clustermq: can also run on HPC environment
library(clustermq)
options(clustermq.scheduler = "multiprocess")
# ... and run in parallel on each bootstrap
# (preclu$partitions = list of sample indexes for each bootstraps)
endo_setosa <- Q(model2DE_cluster
    , partition = preclu$partitions
    , export = list(data = preclu$data
                , target = iris$Species
                , exec = preclu$exec
                , classPos = "setosa"
                , prune = TRUE, filter = FALSE
                , maxDecay = 0.05 # values needed for maxDecay and typeDecay
                , typeDecay = 2 # here default ones, see pruneDecisions()
                , in_parallel = FALSE # can parallelize within each boostrap!
           )
    , n_jobs = 2 # max number of bootstraps that can be ran in parallel
    , pkgs = c("data.table", "parallel", "caret", "stringr", "scales"
                , "dplyr", "inTrees", "endoR")
    , log_worker = FALSE # to keep a log of the runs, e.g. if it fails..
)


# Stability selection
# First we can look at the effect of the alpha parameter on selection;
# alpha = expected number of false decisions
alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10)
                        , data = preclu$data)
alphas$summary_table

# perform stability selection with alpha = 1
de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7)

# Plot the decision ensemble:
# Plants from the setosa species have small petal and narrow long sepals.
plotFeatures(de_final, levels_order = c("Low", "Medium", "High"))

# there is no interaction between variables (all decisions with len = 1,
# the number of variables in the rules)
de_final$rules_summary
# hence the network would be empty and couldn't be plotted...
# plotNetwork(de_final, hide_isolated_nodes = FALSE)

aruaud/endoR documentation built on Jan. 25, 2025, 2:20 a.m.