BinaryPredictor: Univariate analysis for binary classification.

Description Usage Arguments Details Value Author(s) See Also Examples

Description

An independent variable is evaluated as a predictor for a binary dependent variable. The independent variable may be numeric, a factor, or a data frame containing numeric and factor columns.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
## S3 method for class 'factor'
BinaryPredictor(iv, dv, min.power=0.01, min.robustness=0.5, 
  max.missing=0.99, max.levels=20, civ=NULL, copy.data=FALSE, name=NULL, ...)
  
## S3 method for class 'numeric'
BinaryPredictor(iv, dv, min.power=0.01, min.robustness=0.5, 
  max.missing=0.99, copy.data=FALSE, name=NULL, ...)
  
## S3 method for class 'data.frame'
BinaryPredictor(iv, dv, min.power=0.01, min.robustness=0.5, 
  max.missing=0.99, verbose=FALSE, copy.data=FALSE, ...)

## Default S3 method:
BinaryPredictor(iv, dv, ...)

## S3 method for class 'BinaryPredictor'
plot(x, y=NULL, type="bin", plot.missing=TRUE, ...)

## S3 method for class 'BinaryPredictorList'
print(x, file=NULL, silent=FALSE, ...)

Arguments

iv

The independent variable(s). May be a factor, numeric, or a data frame.

dv

The dependent variable, which may have only two unique values. The length / number of rows in iv must match the length of dv.

min.power

The minimum predictive power from PredictivePowerCv for a variable to be kept.

min.robustness

The minimum robustness from PredictivePowerCv for a variable to be kept.

max.missing

The maxmimum allowable fraction of missing values for a variable to be kept.

max.levels

For factors, this controls the merging of small bins using MergeLevels.

civ

When a continuous variable is discretized, the original continuous data can be provided in civ so that linearity can be computed. See Woe for more information.

copy.data

Reserved for future use, indicates if the data should be copied.

name

The variable name. If NULL it will be extracted from the deparsed input iv.

...

For the BinaryPredictor functions the extra arguments are passed to PredictivePowerCv. If iv is numeric then extra arguments are also passed to BinaryCut. For plot the extra arguments are passed to ShortenStrings, which is used to shorten the names of factor levels in plots.

verbose

If true then calculation information is printed.

x

Output from one of the BinaryPredictor functions.

y

Unused argument for the generic plot function.

plot.missing

When plotting numeric variables a TRUE value will add a horizontal line representing the log odds associated with missing values.

type

Reserved for future use, indicates the type of plot to be generated. The only valid value now is 'bin'.

file

If a filename is provided then summary information will be written to a text file.

silent

If set to TRUE then nothing is printed to the screen.

Details

The BinaryPredictor family of functions are used to evaluate predictors of a binary outcome. Checks are executed for the variable class (only numeric, integer, and factor are allowed), missing values, predictive power, and robustness. If any checks fail then a "keep" flag is set to FALSE, otherwise it's TRUE.

The plot function generates a summary plot of the predictor. Predictive power and robustness are printed in the plot title, along with the smallest and largest bin sizes used during discretization. For numeric variables a count of missing values is also printed.

The print function writes a table of variable summary information to the screen or to a file.

Value

If iv is a vector then an object of class BinaryPredictor is returned with the folowing items:

name

The variable name.

keep

A boolean indicating if the variable meets the criteria for missing values, predictive power, etc.

reason

If keep=FALSE then this field contains a text string indicating the first criteria the variable failed to meet.

missing

The fraction of values that are missing / NA.

class

The variable class.

predictivePower

Results from PredictivePowerCv.

woe

Results from Woe.

If iv is a data frame then a list of BinaryPredictor objects is returned with class BinaryPredictorList.

The print.BinaryPredictorList function returns a data frame with columns for the values in the BinaryPredictor output. The values include the variable name, predictive power, robustness, etc.

Author(s)

Justin Hemann <support@causata.com>

See Also

PredictivePowerCv, BinaryCut, MergeLevels, Woe, ShortenStrings.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(ggplot2)
data(diamonds)
# set a dependent variable that is TRUE when the price is above $5000
dv <- diamonds$price > 5000

# convert ordered to factor
diamonds$cut <- as.factor(as.character(diamonds$cut))
diamonds$color <- as.factor(as.character(diamonds$color))
diamonds$clarity <- as.factor(as.character(diamonds$clarity))

# evaluate diamond cut and carats, and generate a plot for each
bp.cut <- BinaryPredictor(diamonds$cut, dv)
plot(bp.cut)
bp.carat <- BinaryPredictor(diamonds$carat, dv)
plot(bp.carat)

# Evaluate all predictors, print summary to screen
# note that price does not have 100% predictive
# power since the discreatization boundary is not $5000.
# Using a sample of 10k records and 3 folds of cross validation
# for greater speed.
set.seed(98765)
idx <- sample.int(nrow(diamonds), 10000)
bpList <- BinaryPredictor(diamonds[idx, ], dv[idx], folds=3)
df.summary <- print(bpList)

Causata documentation built on May 2, 2019, 3:26 a.m.