sirus.fit: Fit SIRUS.

View source: R/sirus.R

sirus.fitR Documentation

Fit SIRUS.

Description

Fit SIRUS for a given number of rules (10 by default) or a given p0.
SIRUS is a regression and classification algorithm, based on random forests (Breiman, 2001), that takes the form of a short list of rules. SIRUS combines the simplicity of rule algorithms or decision trees with an accuracy close to random forests. More importantly, the rule selection is stable with respect to data perturbation. SIRUS for classification is defined in (Benard et al. 2021a), and the extension to regression is provided in (Benard et al. 2021b).

Usage

sirus.fit(
  data,
  y,
  type = "auto",
  num.rule = 10,
  p0 = NULL,
  num.rule.max = 25,
  q = 10,
  discrete.limit = 10,
  num.trees.step = 1000,
  alpha = 0.05,
  mtry = NULL,
  max.depth = 2,
  num.trees = NULL,
  num.threads = NULL,
  replace = TRUE,
  sample.fraction = ifelse(replace, 1, 0.632),
  verbose = TRUE,
  seed = NULL
)

Arguments

data

Input dataframe, each row is an observation vector. Each column is an input variable and is numeric or factor.

y

Numeric response variable. For classification, y takes only 0 and 1 values.

type

'reg' for regression, 'classif' for classification and 'auto' for automatic detection (classification if y takes only 0 and 1 values).

num.rule

Number of rules in SIRUS model. Default is 10. Ignored if a p0 value is provided. For regression, the effective number of rules can be smaller than num.rule because of null coefficients in the final linear aggregation of the rules.

p0

Selection threshold on the frequency of appearance of a path in the forest to set the number of rules. Default is NULL and num.rule is used to select rules. sirus.cv provides the optimal p0 by cross-validation.

num.rule.max

Maximum number of rules in SIRUS model. Ignored if num.rule is provided.

q

Number of quantiles used for node splitting in the forest construction. Default and recommended value is 10.

discrete.limit

Maximum number of distinct values for a variable to be considered discrete. If higher, variable is continuous.

num.trees.step

Number of trees grown between two evaluations of the stopping criterion. Ignored if num.trees is provided.

alpha

Parameter of the stopping criterion for the number of trees: stability has to reach 1-alpha to stop the growing of the forest. Ignored if num.trees is provided. Default value is 0.05.

mtry

Number of variables to possibly split at each node. Default is the number of variables divided by 3.

max.depth

Maximal tree depth. Default and recommended value is 2.

num.trees

Number of trees grown in the forest. Default is NULL. If NULL (recommended), the number of trees is automatically set using a stability based stopping criterion.

num.threads

Number of threads used to grow the forest. Default is number of CPUs available.

replace

Boolean. If true (default), sample with replacement.

sample.fraction

Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement.

verbose

Boolean. If true, information messages are printed.

seed

Random seed. Default is NULL, which generates the seed from R. Set to 0 to ignore the R seed.

Details

If the output y takes only 0 and 1 values, a classification model is fit, otherwise a regression model is fit. SIRUS algorithm proceeds the following steps:

  1. Discretize data

  2. Fit a random forest

  3. Extract rules from tree nodes

  4. Select the most frequent rules (which occur in at least a fraction p0 of the trees)

  5. Filter rules to remove linear dependence between them

  6. Aggregate the selected rules

    • Classification: rules are averaged

    • Regression: rules are linearly combined via a ridge regression (constrained to have all coefficients positive)

The hyperparameter p0 can be tuned using sirus.cv to set the optimal number of rules.
The number of trees is automatically set with a stopping criterion based on stability: the forest growing is stopped when the number of trees is high enough to ensure that 95% of the rules in average are identical over two runs of SIRUS on the provided dataset.
Data is discretized depending on variable types: numerical variables are binned using q-quantiles, categorical variables are transformed in ordered variables as in ranger (standard method to handle categorical variables in trees), while discrete variables (numerical variables with less than discrete.limit distinct values) are left untouched. Notice that categorical variables with a high number of categories should be discarded or transformed, as SIRUS is likely to identify associated irrelevant rules.

Value

SIRUS model with elements

rules

List of rules in SIRUS model.

rules.out

List of rule outputs. rule.out: the output mean whether the rule is satisfied or not. supp.size: the number of points inside and outside the rule.

proba

Frequency of occurence of paths in the forest.

paths

List of selected paths (symbolic representation with quantile order for continuous variables).

rule.weights

Vector of positive or null coefficients assigned to each rule for the linear aggregation (1/number of rules for classification).

rule.glm

Fitted glmnet object for regression (linear rule aggregation with ridge penalty).

type

Type of SIRUS model: 'reg' for regression, 'classif' for classification.

num.trees

Number of trees used to build SIRUS.

data.names

Names of input variables.

mean

Mean output over the full training data. Default model output if no rule is selected.

bins

List of type and possible split values for all input variables.

References

  • Benard, C., Biau, G., Da Veiga, S. & Scornet, E. (2021a). SIRUS: Stable and Interpretable RUle Set for Classification. Electronic Journal of Statistics, 15:427-505. doi: 10.1214/20-EJS1792.

  • Benard, C., Biau, G., Da Veiga, S. & Scornet, E. (2021b). Interpretable Random Forests via Rule Extraction. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:937-945. http://proceedings.mlr.press/v130/benard21a.

  • Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

  • Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1-17. doi: 10.18637/jss.v077.i01.

Examples

## load SIRUS
require(sirus)

## prepare data
data <- iris
y <- rep(0, nrow(data))
y[data$Species == 'setosa'] = 1
data$Species <- NULL

## fit SIRUS
sirus.m <- sirus.fit(data, y)


sirus documentation built on June 13, 2022, 5:07 p.m.