CURD: Implementation of the CURD algorithm for clustering

Description Usage Arguments Details Value

View source: R/EvaluateFeaturesUsingClassifier.R

This function determines which feature to use to split a node in the tree. The intuition behind it is, if a particular feature can be well-predicted using the other features then it means that there is a clear demarcation of behaviour for various values of this feature. So the way we do it is to consider various node candidates and see if they can be predicted well as a function of the others. The one that gives the best predictive ability is returned.

1
2
3

EvaluateFeaturesUsingClassifier(dtDataset, vRows, vNodeCandidates,
  vPredictorFeatures, vKeyFeatures, dtFeatureChunking, cClassifier,
  bUseOptimalChunkPerformance = TRUE)

`dtDataset`	todo
`vRows`	todo
`vNodeCandidates`	todo
`vPredictorFeatures`	todo
`vKeyFeatures`	todo
`dtFeatureChunking`	todo
`cClassifier`	todo
`bUseOptimalChunkPerformance`	todo

Caveats: 1. If there is only one feature left to use as a predictor, then RF cannot be used. This is because the randomForest package does not consider this degenerate case – it makes no sense to build a forest of stumps, I guess? So, in those cases, we use rpart with 3-fold CV. 2. If a feature has a lot of categories, or for ordinal categories like APbin, we currently do not do any grouping. For massively categorical features, grouping makes sense. For ordinal features, grouping of adjacent categories makes sense. This needs to be figured out – how exactly can we do this in an efficient manner? Would analysis of 'confused' groups of classes in the confusion matrix help?

todo

thecomeonman/CURD documentation built on May 20, 2019, 7:37 a.m.