EvaluateFeaturesUsingClassifier: Feature selection for node split using RF/rpart models

Description Usage Arguments Details Value

View source: R/EvaluateFeaturesUsingClassifier.R

Description

This function determines which feature to use to split a node in the tree. The intuition behind it is, if a particular feature can be well-predicted using the other features then it means that there is a clear demarcation of behaviour for various values of this feature. So the way we do it is to consider various node candidates and see if they can be predicted well as a function of the others. The one that gives the best predictive ability is returned.

Usage

1
2
3
EvaluateFeaturesUsingClassifier(dtDataset, vRows, vNodeCandidates,
  vPredictorFeatures, vKeyFeatures, dtFeatureChunking, cClassifier,
  bUseOptimalChunkPerformance = TRUE)

Arguments

dtDataset

todo

vRows

todo

vNodeCandidates

todo

vPredictorFeatures

todo

vKeyFeatures

todo

dtFeatureChunking

todo

cClassifier

todo

bUseOptimalChunkPerformance

todo

Details

Caveats: 1. If there is only one feature left to use as a predictor, then RF cannot be used. This is because the randomForest package does not consider this degenerate case – it makes no sense to build a forest of stumps, I guess? So, in those cases, we use rpart with 3-fold CV. 2. If a feature has a lot of categories, or for ordinal categories like APbin, we currently do not do any grouping. For massively categorical features, grouping makes sense. For ordinal features, grouping of adjacent categories makes sense. This needs to be figured out – how exactly can we do this in an efficient manner? Would analysis of 'confused' groups of classes in the confusion matrix help?

Value

todo


thecomeonman/CURD documentation built on May 20, 2019, 7:37 a.m.