woe.tree.binning: Binning via Tree-Like Segmentation

Description Usage Arguments Value Binning of Numeric Variables Binning of Factors Adjustment of 0 Frequencies Handling of Missing Data See Also Examples

Description

woe.tree.binning generates a supervised tree-like segmentation of numeric variables and factors with respect to a dichotomous target variable. Its parameters provide flexibility in finding a binning that fits specific data characteristics and practical needs.

Usage

1
2
woe.tree.binning(df, target.var, pred.var, min.perc.total,
                min.perc.class, stop.limit, abbrev.fact.levels, event.class)

Arguments

df

Name of data frame with input data.

target.var

Name of dichotomous target variable in quotes. Only target variables with two distinct values (e.g. 0, 1 or “Y”, “N”) are accepted; cases with NAs in the target variable will be ignored.

pred.var

Name of predictor variable(s) to be binned in quotes. A single variable name can be provided, e.g. “varname1”, or a list of variable names, e.g. c(“varname1”, “varname2”). Alternatively one can repeat the name of the input data frame; the function will be applied to all its variables apart from the target variable then. Numeric variables and factors are supported and may contain NAs.

min.perc.total

For numeric variables this parameter defines the number of initial classes before any merging or tree-like splitting is applied. For example min.perc.total=0.05 (5%) will result in 20 initial classes. For factors the original levels with a percentage below this limit are collected in a ‘miscellaneous’ level before the merging based on the min.perc.class and the tree-like splitting based on the WOE values starts. Increasing the min.perc.total parameter will avoid sparse bins. Accepted range: 0.0001-0.2; default: 0.01.

min.perc.class

If a column percentage of one of the target classes within a bin is below this limit (e.g. below 0.01=1%) then the respective bin will be joined with others. In case of numeric variables adjacent predictor classes are merged. For factors respective levels (including sparse NAs) are assigned to a ‘miscellaneous’ level. Setting min.perc.class>0 may provide more reliable WOE values. Accepted range: 0-0.2; default: 0, i.e. no merging with respect to sparse target classes is applied.

stop.limit

Stops WOE based segmentation of the predictor's classes/levels in case the resulting information value (IV) increases less than x% (e.g. 0.05 = 5%) compared to the preceding binning step. Increasing the stop.limit will simplify the binning solution and may avoid overfitting. Accepted range: 0-0.5; default: 0.1.

abbrev.fact.levels

Abbreviates the names of new (merged) factor levels via the base R abbreviate function in case the specified number of characters is exceeded. Accepted range: 0-1000; default: 200. 0 will prevent applying any abbreviation, i.e. only factor levels with more than 1000 characters will be truncated then. This option is particularly relevant in case one wants to generate dummy variables via the woe.binning.deploy function, because the factor levels will be part of the dummy variable names then.

event.class

Optional parameter for specifying the class of the target event. This class typically indicates a negative event like a loan default or a disease. Use integers (e.g. 1) or characters in quotes (e.g. “bad”). This class will be represented by negative WOE values then.

Value

woe.tree.binning generates an object with the information necessary for studying and applying the realized binning solution. When saved it can be used with the functions woe.binning.plot, woe.binning.table and woe.binning.deploy.

Binning of Numeric Variables

Numeric variables (continuous and ordinal) are binned beginning with initial classes with similar frequencies. The number of initial bins results from the min.perc.total parameter: min.perc.total will result in trunc(1/min.perc.total) initial bins, whereby trunc is needed to guarantee bins with similar frequencies. For example min.perc.total=0.07 will cause trunc(14.3)=14 initial classes. Next, if min.perc.class>0, bins with sparse target classes will be merged with the next upper bin, and in case of the last bin with the next lower one. NAs have their own bin and will not be merged with others. Finally the actual tree-like procedure starts: binary splits iteratively assign nearby classes with similar weight of evidence (WOE) values to segments in a way that maximizes the resulting information value (IV). The procedure stops when the IV increases less then specified by a percentage value (stop.limit parameter).

Binning of Factors

Factors (categorical variables) are binned via factor levels. As a start sparse levels (defined via the min.perc.total and min.perc.class parameters) are merged to a ‘miscellaneous’ level: if possible, respective levels (including sparse NAs) are bundled as ‘misc. level pos.’ (associated with positive WOE values), respectively as ‘misc. level neg.’ (associated with negative WOE values). In case a misc. level contains only NAs it will be named ‘Missing’. Afterwards the actual tree-like procedure starts: binary splits iteratively assign levels with similar WOE values to segments in a way that maximizes the resulting information value (IV). The procedure stops when the IV increases less then specified by a percentage value (stop.limit parameter).

Adjustment of 0 Frequencies

In case the crosstab of the bins with the target classes contains frequencies = 0 the column percentages are adjusted to be able to compute the WOE and IV values: the offset 0.0001 (=0.01%) is added to each column percentage cell and the column percentages are recomputed then. This allows considering bins associated with one target class only, but may cause extreme WOE values for these bins. If a correction is not appropriate choose min.perc.class>0; bins with sparse target classes will be merged then before computing any WOE or IV value.

Handling of Missing Data

Cases with NAs in the target variable will be ignored. For predictor variables the following applies: in case NAs already occurred when generating the binning solution the code ‘Missing’ is displayed and a corresponding WOE value can be computed. (Note that factor NAs may be joined with other sparse levels to a ‘miscellaneous’ level - see above; only this ‘miscellaneous’ level will be displayed then.) In case NAs occur in the deployment scenario only ‘Missing’ is displayed for numeric variables and ‘unknown’ for factors; and the corresponding WOE values will be NA then, as well.

See Also

Other binning functions: woe.binning

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Load German credit data and create subset
data(germancredit)
df <- germancredit[, c('creditability', 'credit.amount', 'duration.in.month',
                  'savings.account.and.bonds', 'purpose')]

# Bin a single numeric variable
binning <- woe.tree.binning(df, 'creditability', 'duration.in.month',
                           min.perc.total=0.01, min.perc.class=0.01,
                           stop.limit=0.1, event.class='bad')

# Bin a single factor
binning <- woe.tree.binning(df, 'creditability', 'purpose',
                           min.perc.total=0.05, min.perc.class=0, stop.limit=0.1,
                           abbrev.fact.levels=50, event.class='bad')

# Bin two variables (one numeric and one factor)
# with default parameter settings
binning <- woe.tree.binning(df, 'creditability', c('credit.amount','purpose'))

# Bin all variables of the data frame (apart from the target variable)
# with default parameter settings
binning <- woe.tree.binning(df, 'creditability', df)

woeBinning documentation built on May 2, 2019, 9:23 a.m.