tree.control: Control parameters for fitting decision trees

View source: R/tree.control.R

tree.controlR Documentation

Control parameters for fitting decision trees

Description

Configure the fitting process of individual decision trees.

Usage

tree.control(
  nodesize = 10,
  split_criterion = "gini",
  alpha = 0.05,
  cp = 0.001,
  smoothing = "none",
  mtry = "none",
  covariable = "final_4pl"
)

Arguments

nodesize

Minimum number of samples contained in a terminal node. This parameter ensures that enough samples are available for performing predictions which includes fitting 4pL models.

split_criterion

Splitting criterion for deciding when and how to split. The default is "gini"/"mse" which utilizes the Gini splitting criterion for binary risk estimation tasks and the mean squared error as impurity measure in regression tasks. Alternatively, "4pl" can be used if a quantitative covariable is supplied and the parameter covariable is chosen such that 4pL model fitting is enabled, i.e., covariable = "final_4pl" or covariable = "full_4pl". A fast modeling alternative is given by "linear" which also requires the parameter covariable to be properly chosen, i.e., covariable = "final_linear" or covariable = "full_linear".

alpha

Significance threshold for the likelihood ratio tests when using split_criterion = "4pl". Only splits that achieve a p-value smaller than alpha are eligible.

cp

Complexity parameter. This parameter determines by which amount the impurity has to be reduced to further split a node. Here, the total tree impurity is considered. See details for a concrete formula. Only used if split_criterion = "gini" or "mse".

smoothing

Shall the leaf predictions for risk estimation be smoothed? "laplace" yields Laplace smoothing. The default is "none" which does not employ smoothing.#'

mtry

Shall the tree fitting process be randomized as in random forests? Currently, only "sqrt" for using √{p} random predictors at each node for splitting and "none" (default) for fitting conventional decision trees are supported.

covariable

How shall optional quantitative covariables be handled? "constant" ignores them. Alternatively, they can be considered as splitting variables ("_split"), used for fitting 4pL models in each leaf ("_4pl"), or used for fitting linear models in each leaf ("_linear"). If either splitting or model fitting is chosen, one should state if this should be handled over the whole search ("full_", computationally expensive) or just the final trees ("final_"). Thus, "final_4pl" would lead to fitting 4pL in each leaf but only for the final fitting of trees.

Details

For the Gini or MSE splitting criterion, if any considered split s leads to

P(t) \cdot Δ I(s,t) > \texttt{cp}

for a node t, the empirical node probability P(t) and the impurity reduction Δ I(s,t), then the node is further splitted. If not, the node is declared as a leaf. For continuous outcomes, cp will be scaled by the empirical variance of y to ensure the right scaling, i.e., cp <- cp * var(y). Since the impurity measure for continuous outcomes is the mean squared error, this can be interpreted as controlling the minimum reduction of the normalized mean squared error (NRMSE to the power of two).

If one chooses the 4pL or linear splitting criterion, likelihood ratio tests testing the alternative of better fitting individual models are employed. The corresponding test statistic asymptotically follows a χ^2 distribution where the degrees of freedom are given by the difference in the number of model parameters, i.e., leading to 2 \cdot 4 - 4 = 4 degrees of freedom in the case of 4pL models and to 2 \cdot 2 - 2 = 2 degrees of freedom in the case of linear models.

For binary outcomes, choosing to fit linear models for evaluating the splits or for modeling the leaves actually leads to fitting LDA (linear discriminant analysis) models.

Value

An object of class tree.control which is a list of all necessary tree parameters.


logicDT documentation built on Jan. 14, 2023, 5:06 p.m.