cv.prune: Optimal pruning via cross-validation

View source: R/pruning.R

cv.pruneR Documentation

Optimal pruning via cross-validation

Description

Using a fitted logicDT model, its logic decision tree can be optimally (post-)pruned utilizing k-fold cross-validation.

Usage

cv.prune(
  model,
  nfolds = 10,
  scoring_rule = "deviance",
  choose = "1se",
  simplify = TRUE
)

Arguments

model

A fitted logicDT model

nfolds

Number of cross-validation folds

scoring_rule

The scoring rule for evaluating the cross-validation error and its standard error. For classification tasks, "deviance" or "Brier" should be used.

choose

Model selection scheme. If the model that minimizes the cross-validation error should be chosen, choose = "min" should be set. Otherwise, choose = "1se" leads to simplest model in the range of one standard error of the minimizing model.

simplify

Should the pruned model be simplified with regard to the input terms, i.e., should terms that are no longer in the tree contained be removed from the model?

Details

Similar to Breiman et al. (1984), we implement post-pruning by first computing the optimal pruning path and then using cross-validation for identifying the best generalizing model.

In order to handle continuous covariables with fitted regression models in each leaf, similar to the likelihood-ratio splitting criterion in logicDT, we propose using the log-likelihood as the impurity criterion in this case for computing the pruning path. In particular, for each node t, the weighted node impurity p(t)i(t) has to be calculated and the inequality

Δ i(s,t) := i(t) - p(t_L | t)i(t_L) - p(t_R | t)i(t_R) ≥q 0

has to be fulfilled for each possible split s splitting t into two subnodes t_L and t_R. Here, i(t) describes the impurity of a node t, p(t) the proportion of data points falling into t, and p(t' | t) the proportion of data points falling from t into t'. Since the regression models are fitted using maximum likelihood, the maximum likelihood criterion fulfills this property and can also be seen as an extension of the entropy impurity criterion in the case of classification or an extension of the MSE impurity criterion in the case of regression.

The default model selection is done by choosing the most parsimonious model that yields a cross-validation error in the range of \mathrm{CV}_{\min} + \mathrm{SE}_{\min} for the minimal cross-validation error \mathrm{CV}_{\min} and its corresponding standard error \mathrm{SE}_{\min}. For a more robust standard error estimation, the scores are calculated per training observation such that the AUC is no longer an appropriate choice and the deviance or the Brier score should be used in the case of classification.

Value

A list containing

model

The new logicDT model containing the optimally pruned tree

cv.res

A data frame containing the penalties, the cross-validation scores and the corresponding standard errors

best.beta

The ideal penalty value

References

  • Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470


logicDT documentation built on Jan. 14, 2023, 5:06 p.m.