PredictivePower: Predictive power for a single variable. In Causata: Analysis utilities for binary classification and Causata users.

Description

This function computes predictive power for a single independent variable and a binary dependent variable.

Usage

 1 2 3 4 5 6 7 ## S3 method for class 'factor' PredictivePower(iv, dv, warn.levels=30, cv=NULL, debug=FALSE, ...) ## S3 method for class 'numeric' PredictivePower(iv, dv, warn.levels=30, cv=NULL, debug=FALSE, ...) PredictivePowerCv(iv, dv, warn.levels=30, debug=FALSE, folds=10, ...)

Arguments

 iv The independent variable. dv The dependent variable, which may have only two unique values. warn.levels If the number of levels in iv exceeds this value then a warning will be issued. debug If set to TRUE then debugging information is printed to the screen. cv If NULL then all data are used to compute the predictive power. If an index of boolean values is provided then they are used to separate the data into two parts for cross validation. See the Details below for more information. ... Additional arguments are passed to BinaryCut. folds This argument is used to specify the folds used for cross validation. If a number between 2 and 10 is provided then data will be assigned to the selected number of folds at random. If a vector of values is provided then it will be used as an index to assign data to folds. The number of unique values must be between 2 to 10, and the vector length must match iv.

Details

Predictive power is defined as the area under the gains chart for the provided independent variable divided by the area under the gains chart for a perfect predictor. A random predictor would have a predictive power value of 0, and a perfect predictor would have a value of 1.

The power calculation is derived from a discretized gains chart. As such it only works with categorical variables. Numeric variables are discretized before power is computed. The PredictivePower.numeric function discretizes continuous data using the BinaryCut function. Note that the predictive power will depend, in part, on the discretization method.

By default the second level of dv is used as the "positive" class during power calculations. This can be controlled by ordering the levels in a factor supplied as dv.

Missing values in iv are allowed in PredictivePower.factor – they are ignored during the calculations, as are the corresponding dependent variable values. The missing values can be used in the power calculations if the missing values are mapped to a non-missing level in the factor. See CleanNaFromFactor. Missing values are not allowed in dv.

Cross validation is executed using the PredictivePowerCv function as a wrapper for the PredictivePower functions. When constructing the gains chart the bins are ordered by the odds for a "positve" within each bin. During cross validation the ordering is derived from one set of data, and the area under the curve is calculated with the other set.

Value

The PredictivePower functions returns a numeric value representing the predictive power, between 0 and 1.

PredictivePowerCv returns a list as follows:

 predictive.power An array of predictive power values, one for each fold of cross validation. mean The mean predictive power value. sd The standard deviation of predictive power values. robustness A measure of stability defined as 1-sd/mean. Values will be between zero (unstable) and 1 (stable).

Author(s)

Justin Hemann <support@causata.com>

References

Inspired by Miller, H. (2009) Predicting customer behaviour: The University of Melbourne's KDD Cup report.

Examples

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 library(stringr) # Power is 1/3 where levels differ by 1/3, missing values in iv are ignored. PredictivePower(factor(c(str_split("a a a b b b", " ")[], NA,NA)), c( 1,1,0,0,0,1, 1, 1 ) ) # Power is 1.0 for perfect predictor PredictivePower(factor(c(str_split("a a a a a b b b b b", " "))[]), factor(c(str_split("1 1 1 1 1 0 0 0 0 0", " "))[]) ) # Power is 0 for random predictor PredictivePower(factor(c(str_split("a a a a b b b b", " "))[]), factor(c(str_split("1 1 0 0 1 1 0 0", " "))[]) ) # compute power for random data, power and robustness should be low set.seed(1234) fl <- as.factor(sample(letters, size=1e5, replace=TRUE)) dv <- sample(c(0,1), size=1e5, replace=TRUE) PredictivePowerCv(fl,dv) # compute power for numeric data, send nbins arguments to BinaryCut ivn <- rnorm(1e5) dvn <- rep(0, 1e5) dvn[(ivn + rnorm(1e5, sd=0.5))>0] <- 1 PredictivePower(ivn,dvn, nbins=10)

Example output

 0.3333333
 1
 0
\$predictive.power
 0.010157211 0.000000000 0.000000000 0.008199210 0.000000000 0.000000000
 0.000000000 0.008762443 0.016981207 0.019199225

\$mean
 0.006329929

\$sd
 0.007479364

\$robustness
 0

 0.8087834

Causata documentation built on May 2, 2019, 3:26 a.m.