selectBestFeature-RFTree: selectBestFeature

Description Usage Arguments Format Value Note

Description

Find the best 'splitfeature', 'splitValue' pair where 'splitfeature' is one of the features specified by 'featureList'. The 'splitFeature' and its corresponding 'splitValue' minimizes the specified 'splitrule'. The implementation is slightly different from the original implementation as the tree contains both averaging and splitting dataset. To check the minimum split, the method checks for both datasets according to 'sampleIndex' and 'nodesize'.

Usage

1
2
3
4
selectBestFeature(x, y, se, featureList,
  sampleIndex = list(averagingSampleIndex = 1:length(y), splittingSampleIndex
  = 1:length(y)), nodesize = list(splittingNodeSize = 5, averagingNodeSize =
  5), splitrule = "variance", categoricalFeatureCols = list())

Arguments

x

A data frame of all training predictors.

y

A vector of all training responses.

featureList

A list of candidate variables at the current split.

sampleIndex

A list of index of dataset used in this node and its children. 'sampleIndex' contains two keys 'averagingSampleIndex' and 'splittingSampleIndex'. 'averagingSampleIndex' is used to generate aggregated prediction for the node. 'splittingSampleIndex' is used for 'honestRF' which stores the splitting data when creating the tree. In default, 'splittingSampleIndex' is the same as 'averagingSampleIndex'.

nodesize

The minimum observations contained in terminal nodes. This parameter is actually a list containing the values for both 'splittingNodeSize' and 'averagingNodeSize'.

splitrule

A string to specify how to find the best split among all candidate feature values. The current version only supports 'variance' which minimizes the overall MSE after splitting. The default value is 'variance'.

categoricalFeatureCols

A list of index for all categorical data. Used for trees to detect categorical columns.

mtry

The number of variables randomly selected at each split point. The default value is set to be one third of total number of features of the training data.

Format

An object of class NULL of length 0.

Value

A list of two outputs: "splitFeature" is the best feature to split in order to minimize the split loss, "splitValue" is its corresponding split value.

Note

This function is currently depreciated. It has been replaced by the C++ version in the package. Although the functionality and parameters are exactly the same. Update the oldFeature value before proceeding


soerenkuenzel/hte documentation built on June 12, 2018, 4:26 p.m.