A numeric independent variable is discretized and returned as a factor. A binary dependent variable is used to select the bins using a simple, fast algorithm based on quantiles.

1 2 3 |

`iv` |
A numeric independent variable that will be cut into bins. Missing values will be ignored during
binning and replaced using |

`dv` |
The dependent variable must be an array of values with the same length as |

`nbins` |
The number of bins to break |

`minBin` |
Each bin will have at least |

`woeDelta` |
If the absolute value of the Weight Of Evidence for adjacent
bins falls below this threshold, then the bins are merged.
See |

`bins` |
If TRUE the breaks are returned, along with the factor, in a list. |

`debug` |
If TRUE debug information will be printed to the screen. |

This function is similar to cut, but it uses a dependent variable to inform the binning. The algorithm is designed to be fast and simple; it is a slightly modified version of an equal frequency approach (quantiles).

The algorithm works as follows:

The independent variable is filtered to include only non-missing values, and values from the smaller class of the dependent varaible.

The filtered independent variable is used to compute

`nbins`

quantiles. For the special case where there are fewer unique values than bins the unique values are used as the quantiles.The first and last quantiles are adjusted, if necessary, to include all independent variable values regardless of their dependent variable class.

The independent variable is cut into bins using the quantiles as boundaries.

Each class of the dependent variable is counted in each bin. If the count is below

`minBin`

for either class then the bin is merged with the smallest adjacent bin. This merge process continues until all bins have a sufficient count of dependent variable values, or until there are 2 bins left.The Weight of Evidence is calculated for each bin. If the difference in the WOE for adjacent bins falls below a threshold defined in terms of

`woeDelta`

then the bins are merged.

If `bins`

is FALSE then a factor with up to `nbins`

levels is returned,
where the level names are as found from cut. Missing values in the independent
variable are returned as missing values in the output, and are not counted as a bin.

If `bins`

is TRUE then a list is returned with two elements:

`fiv`

A factor representation of the independent variable, as described above.`breaks`

A vector of breaks or cutpoints used to discretize the independent variable.

Justin Hemann <support@causata.com>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ```
data(df.causata)
dv <- df.causata$has.responded.mobile.logoff_next.hour_466
iv <- df.causata$online.number.of.page.views_last.30.days_3
f <- BinaryCut(iv,dv)
# compute the weight of evidence for each bin
woe <- Woe(f, dv)
# adjust plot margins to increase space for bin labels
par(oma=c(1,8,1,1))
# plot the bins against the weight of evidence
barplot(woe$woe.levels, names.arg=levels(f), horiz=TRUE, las=1,
main="Weight of Evidence for clicking a banner for a mobile app.",
sub="WOE vs. Page View Count, Last 30 Days" )
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.