Cuts a numeric independent variable into bins.
A numeric independent variable is discretized and returned as a factor. A binary dependent variable is used to select the bins using a simple, fast algorithm based on quantiles.
1 2 3
A numeric independent variable that will be cut into bins. Missing values will be ignored during
binning and replaced using
The dependent variable must be an array of values with the same length as
The number of bins to break
Each bin will have at least
If the absolute value of the Weight Of Evidence for adjacent
bins falls below this threshold, then the bins are merged.
If TRUE the breaks are returned, along with the factor, in a list.
If TRUE debug information will be printed to the screen.
This function is similar to cut, but it uses a dependent variable to inform the binning. The algorithm is designed to be fast and simple; it is a slightly modified version of an equal frequency approach (quantiles).
The algorithm works as follows:
The independent variable is filtered to include only non-missing values, and values from the smaller class of the dependent varaible.
The filtered independent variable is used to compute
nbinsquantiles. For the special case where there are fewer unique values than bins the unique values are used as the quantiles.
The first and last quantiles are adjusted, if necessary, to include all independent variable values regardless of their dependent variable class.
The independent variable is cut into bins using the quantiles as boundaries.
Each class of the dependent variable is counted in each bin. If the count is below
minBinfor either class then the bin is merged with the smallest adjacent bin. This merge process continues until all bins have a sufficient count of dependent variable values, or until there are 2 bins left.
The Weight of Evidence is calculated for each bin. If the difference in the WOE for adjacent bins falls below a threshold defined in terms of
woeDeltathen the bins are merged.
bins is FALSE then a factor with up to
nbins levels is returned,
where the level names are as found from cut. Missing values in the independent
variable are returned as missing values in the output, and are not counted as a bin.
bins is TRUE then a list is returned with two elements:
fivA factor representation of the independent variable, as described above.
breaksA vector of breaks or cutpoints used to discretize the independent variable.
Justin Hemann <email@example.com>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
data(df.causata) dv <- df.causata$has.responded.mobile.logoff_next.hour_466 iv <- df.causata$online.number.of.page.views_last.30.days_3 f <- BinaryCut(iv,dv) # compute the weight of evidence for each bin woe <- Woe(f, dv) # adjust plot margins to increase space for bin labels par(oma=c(1,8,1,1)) # plot the bins against the weight of evidence barplot(woe$woe.levels, names.arg=levels(f), horiz=TRUE, las=1, main="Weight of Evidence for clicking a banner for a mobile app.", sub="WOE vs. Page View Count, Last 30 Days" )
Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.