Control for Conditional Inference Trees

Description

Various parameters that control aspects of the ‘ctree’ fit.

Usage

1
2
3
4
5
6
ctree_control(teststat = c("quad", "max"),
    testtype = c("Bonferroni", "Univariate", "Teststatistic"),
    mincriterion = 0.95, minsplit = 20L, minbucket = 7L, 
    minprob = 0.01, stump = FALSE, maxsurrogate = 0L, mtry = Inf, 
    maxdepth = Inf, multiway = FALSE, splittry = 2L, majority = FALSE,
    applyfun = NULL, cores = NULL)

Arguments

teststat

a character specifying the type of the test statistic to be applied.

testtype

a character specifying how to compute the distribution of the test statistic.

mincriterion

the value of the test statistic or 1 - p-value that must be exceeded in order to implement a split.

minsplit

the minimum sum of weights in a node in order to be considered for splitting.

minbucket

the minimum sum of weights in a terminal node.

minprob

proportion of observations needed to establish a terminal node.

stump

a logical determining whether a stump (a tree with three nodes only) is to be computed.

maxsurrogate

number of surrogate splits to evaluate. Note the currently only surrogate splits in ordered covariables are implemented.

mtry

number of input variables randomly sampled as candidates at each node for random forest like algorithms. The default mtry = Inf means that no random selection takes place.

maxdepth

maximum depth of the tree. The default maxdepth = Inf means that no restrictions are applied to tree sizes.

multiway

a logical indicating if multiway splits for all factor levels are implemented for unordered factors.

splittry

number of variables that are inspected for admissible splits if the best split doesn't meet the sample size constraints.

majority

if FALSE, observations which can't be classified to a daughter node because of missing information are randomly assigned (following the node distribution). If FALSE, they go with the majority (the default in ctree).

applyfun

an optional lapply-style function with arguments function(X, FUN, ...). It is used for computing the variable selection criterion. The default is to use the basic lapply function unless the cores argument is specified (see below).

cores

numeric. If set to an integer the applyfun is set to mclapply with the desired number of cores.

Details

The arguments teststat, testtype and mincriterion determine how the global null hypothesis of independence between all input variables and the response is tested (see ctree). The variable with most extreme p-value or test statistic is selected for splitting. If this isn't possible due to sample size constraints explained in the next paragraph, up to splittry other variables are inspected for possible splits.

A split is established when all of the following criteria are met: 1) the sum of the weights in the current node is larger than minsplit, 2) a fraction of the sum of weights of more than minprob will be contained in all daughter nodes, 3) the sum of the weights in all daughter nodes exceeds minbucket, and 4) the depth of the tree is smaller than maxdepth. This avoids pathological splits deep down the tree. When stump = TRUE, a tree with at most two terminal nodes is computed.

The argument mtry > 0 means that a random forest like 'variable selection', i.e., a random selection of mtry input variables, is performed in each node.

In each inner node, maxsurrogate surrogate splits are computed (regardless of any missing values in the learning sample). Factors in test samples whose levels were empty in the learning sample are treated as missing when computing predictions (in contrast to ctree. Note also the different behaviour of majority in the two implementations.

Value

A list.