plotYXbin: Response (y) Statistics Within Levels/Bins of a Predictor (x)

Description Usage Arguments Details Value Examples

Description

Statistics for y are plotted with respect to each level or bin of x. Plotted statistics can be proportions, log-odds, or weight-of-evidence values. Bins can be created using raw factor levels, quantile breakpoints, uniform breakpoints, or recursive partitioning. Additional arguments may be passed to rpart.control() to fine-tune recursive partitioning. Plots showing the ymetric for each value of xsplit as well as the total volume in each bin are printed to the current graphics device. In addition, two measures of the overall strength of the predictive relationship (Information Value & ChiSq) are calculated and returned.

Usage

1
2
plotYXbin(y, x, ymetric = "proportion", xsplit = "quantile", nbins = 10,
  nabin = TRUE, yticks = 6, ...)

Arguments

y

(numeric) binary response vector

x

(numeric) numeric or factor predictor vector

ymetric

(character) statistic to calculate for y: c('proportion', 'logodds', 'woe')

xsplit

(character) method used to bin x: c('quantile', 'uniform', 'rpart')

nbins

(numeric) number of bins to create from x

nabin

(logical) whether to include an additional bin for missing x values

yticks

(numeric) number of tick marks to display on the y-axis of plots

...

(args) additional arguments to pass to rpart.control()

Details

If xsplit='rpart' bins will be created based on recursive partitioning for both numeric and factor variables and the nbins argument will be ignored. Pass additional control parameters (e.g. cp, minbucket) in the function call to control partitioning behavior. If zero or greater than 20 bins are created using the rpart control settings passed the function will throw an error. If x is a factor variable the x-axis labels on the returned plots will correspond to the index positions of the levels of x (and not the factor labels themselves) in each bin. It's generally not a good idea to use recursive partitioning with more than 50 factor levels. If x is a numeric variable the x-axis labels will be the range cutpoints for each bin created via recursive partitioning.

If xsplit=c('uniform','quantile') and x is a factor variable its levels are used directly as bins and the nbins argument will be ignored. If x is a numeric variable bins are calculated by dividing the range of x into buckets of either equal size (uniform) or equal count (quantile). If quantile breakpoints are not unique then adjacent identical bins will be combined.

If bins get created which have either zero volume or zero variance then log-odds and woe cannot be calculated. Any such bins will be excluded from both the displayed plots and also the calculation of information value for the variable. This problem can typically be solved by using quantile binning and/or reducing the number of bins created.

Value

a list containing the following elements:

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
data(diamonds, package = 'ggplot2')
y  <- as.numeric(diamonds$price > mean(diamonds$price))
x1 <- diamonds$carat
x2 <- diamonds$clarity
x3 <- diamonds$y

res <- plotYXbin(y, x1) 
res <- plotYXbin(y, x1, nbins = 8, nabin = FALSE)
res <- plotYXbin(y, x2, ymetric = 'woe')
res <- plotYXbin(y, x3, ymetric = 'proportion', xsplit = 'rpart', cp = 1e-4, minbucket = 100)

etlundquist/eRic documentation built on May 16, 2019, 9:07 a.m.