Description Details Attribute/feature evaluation Decision/regression tree construction Stop tree building Models in the tree leaves Constructive induction aka. feature construction Attribute discretization and binarization Tree pruning Prediction Random forests General tree ensembles Read data directly from files Miscellaneous Author(s) References See Also

The behavior of CORElearn is controlled by several parameters. This is a short overview.

There are many different parameters available. Some are general and can be used in many
learning, or feature evaluation algorithms. All the values actually used by
the classifier / regressor can be written to file (or read from it) using
`paramCoreIO`

.
The parameters for the methods are split into several groups and documented below.

The parameters in this group may be used inside model construction
via `CoreModel`

and feature evaluation in `attrEval`

. See `attrEval`

for description of relevant evaluation methods.

Parameters `attrEvaluationInstances`

, `binaryEvaluation`

,

`binarySplitNumericAttributes`

are applicable to all attribute evaluation methods. In models which need feature evaluation (e.g., trees,
random forests) they affect the selection of splits in the nodes.
Other parameters may be used only in context sensitive measures, i.e., ReliefF in classification
and RReliefF in regression and their variants.

- binaryEvaluation
type: logical, default value: FALSE

Shall we treat all attributes as binary and binarize them before evaluation if necessary. If`TRUE`

, then for all multivalued discrete and all numeric features a search for the best binarization is performed. The evaluation of the best binarization found is reported. If`FALSE`

, then multivalued discrete features are evaluated "as is" with multivalued versions of estimators. With ReliefF-type measures, numeric features are also evaluated "as is". For evaluation of numeric features with other (non-ReliefF-type) measures, they are first binarized or discretized. The choice between binarization and discretization is controlled by`binaryEvaluateNumericAttributes`

. Due to performance reasons it is recommended that`binaryEvaluation=FALSE`

is used. See also`discretizationSample`

.- binaryEvaluateNumericAttributes
type: logical, default value: TRUE

ReliefF like measures can evaluate numeric attributes intrinsically, others have to discretize or binarize them before evaluation; for those measures this parameter selects binarization (default) or discretization (computationally more demanding).- multiclassEvaluation
type: integer, default value: 1, value range: 1, 4

multi-class extension for two-class-only evaluation measures (1-average of all-pairs, 2-best of all-pairs, 3-average of one-against-all, 4-best of one-against-all).- attrEvaluationInstances
type: integer, default value: 0, value range: 0, Inf

number of instances for attribute evaluation (0=all available).- minNodeWeightEst
type: numeric, default value: 2, value range: 0, Inf

minimal number of instances (weight) in resulting split to take it in consideration.- ReliefIterations
type: integer, default value: 0, value range: -2, Inf

number of iterations for all variants of Relief (0=DataSize, -1=ln(DataSize) -2=sqrt(DataSize)).- numAttrProportionEqual
type: numeric, default value: 0.04, value range: 0, 1

used in ramp function, proportion of numerical attribute's range to consider two values equal.- numAttrProportionDifferent
type: numeric, default value: 0.1, value range: 0, 1

used in ramp function, proportion of numerical attribute's range to consider two values different.- kNearestEqual
type: integer, default value: 10, value range: 0, Inf

number of neighbors to consider in equal k-nearest attribute evaluation.- kNearestExpRank
type: integer, default value: 70, value range: 0, Inf

number of neighbors to consider in exponential rank distance attribute evaluation.- quotientExpRankDistance
type: numeric, default value: 20, value range: 0, Inf

quotient in exponential rank distance attribute evaluation.

There are several parameters controlling a construction of the tree model. Some are described here,
but also attribute evaluation, stop building, model, constructive induction, discretization,
and pruning options described in this document are applicable.
Splits in trees are always binary, however, the option `binaryEvaluation`

has influence on the
feature selection for the split. Namely, selecting the best feature for the split is done with the given
value of `binaryEvaluation`

. If `binaryEvaluation=FALSE`

, the features are first evaluated and
the best one is finally binarized. If `binaryEvaluation=TRUE`

, the features are binarized before
selection. In this case, a search for the best binarization for all considered features is performed and
the best binarizations found are used for splits. The latter option is computationally more intensive,
but typically does not produce better trees.

- selectionEstimator
type: character, default value: "MDL", possible values: all from

`attrEval`

, section classification

estimator for selection of attributes and binarization in classification.- selectionEstimatorReg
type: character, default value: "RReliefFexpRank", possible values: all from

`attrEval`

, section regression

estimator for selection of attributes and binarization in regression.- minReliefEstimate
type: numeric, default value: 0, value range: -1, 1

for all variants of Relief attribute estimator: the minimal evaluation of attribute to consider the attribute useful in further processing.- minInstanceWeight
type: numeric, default value: 0.05, value range: 0, 1

minimal weight of an instance to use it further in splitting.

During tree construction the node is recursively split, until certain condition is fulfilled.

- minNodeWeightTree
type: numeric, default value: 5, value range: 0, Inf

minimal number of instances (weight) of a leaf in the decision or regression tree model.- minNodeWeightRF
type: numeric, default value: 2, value range: 0, Inf

minimal number of instances (weight) of a leaf in the random forest tree.- relMinNodeWeight
type: numeric, default value: 0, value range: 0, 1

minimal proportion of training instances in a tree node to split it further.- majorClassProportion
type: numeric, default value: 1, value range: 0, 1

proportion of majority class in a classification tree node to stop splitting it.- rootStdDevProportion
type: numeric, default value: 0, value range: 0, 1

proportion of root's standard deviation in a regression tree node to stop splitting it.

In leaves of the tree model there can be various prediction models controlling prediction. For example instead of classification with majority of class values one can use naive Bayes in classification, or a linear model in regression, thereby expanding expressive power of the tree model.

- modelType
type: integer, default value: 1, value range: 1, 4

type of models used in classification tree leaves (1=majority class, 2=k-nearest neighbors, 3=k-nearest neighbors with kernel, 4=naive Bayes).- modelTypeReg
type: integer, default value: 5, value range: 1, 8

type of models used in regression tree leaves (1=mean predicted value, 2=median predicted value, 3=linear by MSE, 4=linear by MDL, 5=linear reduced as in M5, 6=kNN, 7=Gaussian kernel regression, 8=locally weighted linear regression).- kInNN
type: integer, default value: 10, value range: 0, Inf

number of neighbors in k-nearest neighbors models (0=all).- nnKernelWidth
type: numeric, default value: 2, value range: 0, Inf

kernel width in k-nearest neighbors models.- bayesDiscretization
type: integer, default value: 2, value range: 1, 3

type of discretization for naive Bayesian models (1=greedy with selection estimator, 2=equal frequency, 3=equal width).- discretizationIntervals
type: integer, default value: 4, value range: 1, Inf

number of intervals in equal frequency or equal width discretizations.

The expressive power of tree models can be increased by incorporating additional types of splits. Operator based constructive induction is implemented in both classification and regression. The best construct is searched with beam search. At each step new constructs are evaluated with selected feature evaluation measure. With different types of operators one can control expressions in the interior tree nodes.

- constructionMode
type: integer, default value: 15, value range: 1, 15

sum of constructive operators (1=single attributes, 2=conjunction, 4=addition, 8=multiplication); all=1+2+4+8=15- constructionDepth
type: integer, default value: 0, value range: 0, Inf

maximal depth of the tree for constructive induction (0=do not do construction, 1=only at root, ...).- noCachedInNode
type: integer, default value: 5, value range: 0, Inf

number of cached attributes in each node where construction was performed.- constructionEstimator
type: character, default value: "MDL", possible values: all from

`attrEval`

, section classification

estimator for constructive induction in classification.- constructionEstimatorReg
type: character, default value: "RReliefFexpRank", possible values: all from

`attrEval`

, section regression

estimator for constructive induction in regression.- beamSize
type: integer, default value: 20, value range: 1, Inf

size of the beam in search for best feature in constructive induction.- maxConstructSize
type: integer, default value: 3, value range: 1, Inf

maximal size of constructs in constructive induction.

Some algorithms cannot deal with numeric attributes directly, so we have to discretize them. Also the tree models use
binary splits in nodes. The discretization algorithm evaluates split candidates and forms intervals of values.
Note that setting `discretizationSample=1`

will force random selection of splitting point, which will speed-up the algorithm
and may be perfectly acceptable for random forest ensembles.

CORElearn builds binary trees so multivalued discrete attributes have to be binarized i.e., values have to be split into
twoa subset, one going left and the other going right in a node. The method used depends on the parameters
and the number of attribute values. Possible methods are exhaustive (if the number of attribute values is less or equal
`maxValues4Exhaustive`

), greedy ((if the number of attribute values is less or equal `maxValues4Greedy`

)
and random ((if the number of attribute values is more than `maxValues4Exhaustive`

).
Setting `maxValues4Greedy=2`

will always randomly selet splitting point.

- discretizationLookahead
type: integer, default value: 3, value range: 0, Inf

Discretization is performed with a greedy algorithm which adds a new boundary, until there is no improvement in evaluation function for`discretizationLookahead`

number of times (0=try all possibilities). Candidate boundaries are chosen from a random sample of boundaries, whose size is`discretizationSample`

.- discretizationSample
type: integer, default value: 50, value range: 0, Inf

Maximal number of points to try discretization (0=all sensible). For ReliefF-type measures, binarization of numeric features is performed with`discretizationSample`

randomly chosen splits. For other measures, the split is searched among all possible splits.- maxValues4Exhaustive
type: integer, default value: 7, value range: 2, Inf

Maximal number of values of a discrete attribute to try finding split exhaustively. If the attribute has more values the split will be searched greedily or selected ranomly based on the value of parameter`maxValues4Greedy`

.- maxValues4Greedy
type: integer, default value: 30, value range: 2, Inf

Maximal number of values of a discrete attribute to try finding split greedily. If the attribute has more values the split will be selected ranomly. Setting this parameter to 2 will force random but balanced selection of splits which may be acceptable for random forest ensembles and will greatly speed-up tree construction.

After the tree is constructed, to reduce noise it is beneficial to prune it.

- selectedPruner
type: integer, default value: 1, value range: 0, 1

decision tree pruning method used (0=none, 1=with m-estimate).- selectedPrunerReg
type: integer, default value: 2, value range: 0, 4

regression tree pruning method used (0=none, 1=MDL, 2=with m-estimate, 3=as in M5, 4=error complexity as in CART (fixed alpha)).- mdlModelPrecision
type: numeric, default value: 0.1, value range: 0, Inf

precision of model coefficients in MDL tree pruning.- mdlErrorPrecision
type: numeric, default value: 0.01, value range: 0, Inf

precision of errors in MDL tree pruning.- mEstPruning
type: numeric, default value: 2, value range: 0, Inf

m-estimate for pruning with m-estimate.- alphaErrorComplexity
type: numeric, default value: 0, value range: 0, Inf

alpha for error complexity pruning.

For some models (decision trees, random forests, naive Bayes, and regression trees) one can smoothe the output predictions. In classification models output probabilities are smoothed and in case of regression prediction value is smoothed.

- smoothingType
type: integer, default value: 0, value range: 0, 4

default value 0 means no smoothing (in case classification one gets relative frequencies), value 1 stands for additive smoothing, 2 is pure Laplace's smoothing, 3 is m-estimate smoothing, and 4 means Zadrozny-Elkan type of m-estimate smoothing where`smoothingValue`

is interpreted as*m * Pc*and*Pc*is the prior probability of the least probable class value; for regression`smoothingType`

has no effect, as the smoothing is controlled solely by`smoothingValue`

.- smoothingValue
type: numeric, default value: 0, value range: 0, Inf

additional parameter for some sorts of smoothing; in classification it is needed for additive, m-estimate, and Zadrozny-Elkan type of smoothing; in case of regression trees 0 means no smoothing and values larger than 0 change prediction value towards the prediction of the models in ascendant nodes.

Random forest is quite complex model, whose construction one can control with several parameters. Momentarily only classification version of the algorithm is implemented. Besides parameters in this section one can apply majority of parameters for control of decision trees (except constructive induction and tree pruning).

- rfNoTrees
type: integer, default value: 100, value range: 1, Inf

number of trees in the random forest.- rfNoSelAttr
type: integer, default value: 0, value range: -2, Inf

number of randomly selected attributes in the node (0=sqrt(numOfAttr), -1=log2(numOfAttr)+1, -2=all).- rfMultipleEst
type: logical, default value: FALSE

use multiple attribute estimators in the forest? If TRUE the algorithm uses some preselected attribute evaluation measures on different trees.- rfkNearestEqual
type: integer, default value: 30, value range: 0, Inf

number of nearest intances for weighted random forest classification (0=no weighing).- rfPropWeightedTrees
type: numeric, default value: 0, value range: 0, 1

Proportion of trees where attribute probabilities are weighted with their quality. As attribute weighting might reduce the variance between the models, the default value switches the weighing off.- rfPredictClass
type: logical, default value: FALSE

shall individual trees predict with majority class (otherwise with class distribution).

In the same manner as random forests more general tree ensembles can be constructed. Additional options control sampling, tree size and regularization.

- rfSampleProp
type: numeric, default value: 0, value range: 0, 1

proportion of the training set to be used in learning (0=bootstrap replication).- rfNoTerminals
type: integer, default value: 0, value range: 0, Inf

maximal number of leaves in each tree (0=build the whole tree).- rfRegType
type: integer, default value: 2, value range: 0, 2

type of regularization (0=no regularization, 1=global regularization, 2=local regularization).- rfRegLambda
type: numeric, default value: 0, value range: 0, Inf

regularization parameter lambda (0=no regularization).

In case of very large data sets it is useful to bypass **R** and read data directly from files as the standalone learning system CORElearn
does. Supported file formats are C4.5, M5, and native format of CORElearn. See documentation at http://lkm.fri.uni-lj.si/rmarko/software/.

- domainName
type: character,

name of a problem to read from files with suffixes .dsc, .dat, .names, .data, .cm, and .costs- dataDirectory
type: character,

folder where data files are stored.- NAstring
type: character, default value: "?"

character string which represents missing and NA values in the data files.

- maxThreads
type: integer, default value: 0, value range: 0, Inf

maximal number of active threads (0=allow OpenMP to set its defaults).

As side effect, this parameter changes the number of active threads in all subsequent execution (till`maxThreads`

is set again).

Marko Robnik-Sikonja, Petr Savicky

B. Zadrozny, C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, 2001.

`CORElearn`

,
`CoreModel`

,
`predict.CoreModel`

,
`attrEval`

,
`ordEval`

,
`paramCoreIO`

.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.