stabilitySelection: Obtain a stable decision ensemble

View source: R/stabilitySelection.R

stabilitySelectionR Documentation

Obtain a stable decision ensemble

Description

Performs stability selection after bootstrapping with the model2DE_cluster or model2DE_resampling functions. The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble. The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble.

Usage

stabilitySelection(
  rules,
  alpha_error = 1,
  pi_thr = 0.7,
  aggregate_taxa = FALSE,
  taxa = NULL
)

Arguments

rules

list of bootstrap results

alpha_error

expected number of false positive decision selected (default = 1).

pi_thr

fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).

aggregate_taxa

should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)

taxa

if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)

Value

A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes.

Examples

library(randomForest)
library(caret)

# import data and fit model
data(iris)
mod <- randomForest(Species ~ ., data = iris)

# Get decision ensemble with bootstrapping.

# Run 1 bootstrap after the other (times = 2 bootstraps)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Same but use different sample weights for bootstrapping
n_setosa <- sum(iris$Species == "setosa")
n_samp <- length(iris$Species)
samp_weight <- round(
    ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp)
    , digits = 2)
endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
    , data = iris[, -5], target = iris$Species, classPos = "setosa"
    , times = 2, sample_weight = samp_weight
    , in_parallel = TRUE, n_cores = 2, filter = FALSE)

# Run the bootstraps in parallel
# First do all steps before bootstrapping
preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
    , target = iris$Species, classPos = "setosa", times = 2
    , discretize = TRUE, in_parallel = FALSE)

# Remove the special characters from column names
colnames(preclu$data) <- compatibleNames(colnames(preclu$data))

# Parameters for clustermq: can also run on HPC environment
library(clustermq)
options(clustermq.scheduler = "multiprocess")
# ... and run in parallel on each bootstrap
# (preclu$partitions = list of sample indexes for each bootstraps)
endo_setosa <- Q(model2DE_cluster
    , partition = preclu$partitions
    , export = list(data = preclu$data
                , target = iris$Species
                , exec = preclu$exec
                , classPos = "setosa"
                , prune = TRUE, filter = FALSE
                , maxDecay = 0.05 # values needed for maxDecay and typeDecay
                , typeDecay = 2 # here default ones, see pruneDecisions()
                , in_parallel = FALSE # can parallelize within each boostrap!
           )
    , n_jobs = 2 # max number of bootstraps that can be ran in parallel
    , pkgs = c("data.table", "parallel", "caret", "stringr", "scales"
                , "dplyr", "inTrees", "endoR")
    , log_worker = FALSE # to keep a log of the runs, e.g. if it fails..
)


# Stability selection
# First we can look at the effect of the alpha parameter on selection;
# alpha = expected number of false decisions
alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10)
                        , data = preclu$data)
alphas$summary_table

# perform stability selection with alpha = 1
de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7)

# Plot the decision ensemble:
# Plants from the setosa species have small petal and narrow long sepals.
plotFeatures(de_final, levels_order = c("Low", "Medium", "High"))

# there is no interaction between variables (all decisions with len = 1,
# the number of variables in the rules)
de_final$rules_summary
# hence the network would be empty and couldn't be plotted...
# plotNetwork(de_final, hide_isolated_nodes = FALSE)

aruaud/endoR documentation built on Jan. 25, 2025, 2:20 a.m.