BinaryPredictor: Univariate analysis for binary classification.
In Causata: Analysis utilities for binary classification and Causata users.

Description Usage Arguments Details Value Author(s) See Also Examples

An independent variable is evaluated as a predictor for a binary dependent variable. The independent variable may be numeric, a factor, or a data frame containing numeric and factor columns.

## S3 method for class 'factor'
BinaryPredictor(iv, dv, min.power=0.01, min.robustness=0.5, 
  max.missing=0.99, max.levels=20, civ=NULL, copy.data=FALSE, name=NULL, ...)
  
## S3 method for class 'numeric'
BinaryPredictor(iv, dv, min.power=0.01, min.robustness=0.5, 
  max.missing=0.99, copy.data=FALSE, name=NULL, ...)
  
## S3 method for class 'data.frame'
BinaryPredictor(iv, dv, min.power=0.01, min.robustness=0.5, 
  max.missing=0.99, verbose=FALSE, copy.data=FALSE, ...)

## Default S3 method:
BinaryPredictor(iv, dv, ...)

## S3 method for class 'BinaryPredictor'
plot(x, y=NULL, type="bin", plot.missing=TRUE, ...)

## S3 method for class 'BinaryPredictorList'
print(x, file=NULL, silent=FALSE, ...)

`iv`	The independent variable(s). May be a factor, numeric, or a data frame.
`dv`	The dependent variable, which may have only two unique values. The length / number of rows in `iv` must match the length of `dv`.
`min.power`	The minimum predictive power from `PredictivePowerCv` for a variable to be kept.
`min.robustness`	The minimum robustness from `PredictivePowerCv` for a variable to be kept.
`max.missing`	The maxmimum allowable fraction of missing values for a variable to be kept.
`max.levels`	For factors, this controls the merging of small bins using `MergeLevels`.
`civ`	When a continuous variable is discretized, the original continuous data can be provided in `civ` so that linearity can be computed. See `Woe` for more information.
`copy.data`	Reserved for future use, indicates if the data should be copied.
`name`	The variable name. If NULL it will be extracted from the deparsed input `iv`.
`...`	For the `BinaryPredictor` functions the extra arguments are passed to `PredictivePowerCv`. If `iv` is numeric then extra arguments are also passed to `BinaryCut`. For `plot` the extra arguments are passed to `ShortenStrings`, which is used to shorten the names of factor levels in plots.
`verbose`	If true then calculation information is printed.
`x`	Output from one of the `BinaryPredictor` functions.
`y`	Unused argument for the generic `plot` function.
`plot.missing`	When plotting numeric variables a `TRUE` value will add a horizontal line representing the log odds associated with missing values.
`type`	Reserved for future use, indicates the type of plot to be generated. The only valid value now is 'bin'.
`file`	If a filename is provided then summary information will be written to a text file.
`silent`	If set to `TRUE` then nothing is printed to the screen.

The BinaryPredictor family of functions are used to evaluate predictors of a binary outcome. Checks are executed for the variable class (only numeric, integer, and factor are allowed), missing values, predictive power, and robustness. If any checks fail then a "keep" flag is set to FALSE, otherwise it's TRUE.

The plot function generates a summary plot of the predictor. Predictive power and robustness are printed in the plot title, along with the smallest and largest bin sizes used during discretization. For numeric variables a count of missing values is also printed.

The print function writes a table of variable summary information to the screen or to a file.

If iv is a vector then an object of class BinaryPredictor is returned with the folowing items:

`name`	The variable name.
`keep`	A boolean indicating if the variable meets the criteria for missing values, predictive power, etc.
`reason`	If `keep=FALSE` then this field contains a text string indicating the first criteria the variable failed to meet.
`missing`	The fraction of values that are missing / NA.
`class`	The variable class.
`predictivePower`	Results from `PredictivePowerCv`.
`woe`	Results from `Woe`.

If iv is a data frame then a list of BinaryPredictor objects is returned with class BinaryPredictorList.

The print.BinaryPredictorList function returns a data frame with columns for the values in the BinaryPredictor output. The values include the variable name, predictive power, robustness, etc.

Justin Hemann <support@causata.com>

PredictivePowerCv, BinaryCut, MergeLevels, Woe, ShortenStrings.

library(ggplot2)
data(diamonds)
# set a dependent variable that is TRUE when the price is above $5000
dv <- diamonds$price > 5000

# convert ordered to factor
diamonds$cut <- as.factor(as.character(diamonds$cut))
diamonds$color <- as.factor(as.character(diamonds$color))
diamonds$clarity <- as.factor(as.character(diamonds$clarity))

# evaluate diamond cut and carats, and generate a plot for each
bp.cut <- BinaryPredictor(diamonds$cut, dv)
plot(bp.cut)
bp.carat <- BinaryPredictor(diamonds$carat, dv)
plot(bp.carat)

# Evaluate all predictors, print summary to screen
# note that price does not have 100% predictive
# power since the discreatization boundary is not $5000.
# Using a sample of 10k records and 3 folds of cross validation
# for greater speed.
set.seed(98765)
idx <- sample.int(nrow(diamonds), 10000)
bpList <- BinaryPredictor(diamonds[idx, ], dv[idx], folds=3)
df.summary <- print(bpList)