OptimizeNominalFeatureChunks: Figure out how to divide nominal features into chunks
In thecomeonman/CURD: Implementation of the CURD algorithm for clustering

Description Usage Arguments Details Value

View source: R/OptimizeNominalFeatureChunks.R

The higher the number of categories, the more splits on a node – this not only results in too many segments (which not generally not a desirable characteristic), but also possibly in cases where some segments have too little data to proceed. The ideal approach, therefore, might be to arrive at groups of categories wherein the data points whose category values for the given feature fall in the same group are relatively homogenous.

1 2	OptimizeNominalFeatureChunks(mValConfMat, nMinSplits = 2, nMaxSplits = 3)

`mValConfMat`	todo
`nMinSplits`	todo
`nMaxSplits`	todo

A simple heuristic to determine this grouping can be based on the confusion matrix of a classifier built to predict the given categorical feature. Let $C_p \times p$ be the confusion matrix for a predictive model for a feature with $p$ category values, where $C_i,j$ represents the proportion of instances in the validation sample where a data point in category $i$ has been classified as being in category $j$. Let $\tildeC = 1 - (C + C^T)/2$ (where $(.)^T$ is the transpose operator) be a symmetric matrix that represents the pairwise distance between category values. The intuition here is that, if the classifier cannot distinguish between categories $i$ and $j$, then they can be combined – the pairwise distance matrix $\tildeC$ represents how close/far the category values are in this respect. A hierarchical clustering algorithm applied on the category values based on the distances in $\tildeC$ would yield a grouping that we can use to split the node. The grouping can first be applied to the confusion matrix $C$ so that its order is now the same as the number of groups, and this can then be used to evaluate the goodness of the split.

todo

thecomeonman/CURD documentation built on May 20, 2019, 7:37 a.m.