Weight of evidence for each level of a factor.

Share:

Description

Computes the weight of evidence for each level of a factor and a dependent variable.

Usage

1
2
## S3 method for class 'factor'
Woe(iv, dv, maxOdds=10000, civ=NULL, ...)

Arguments

iv

A factor, the independent variable. Missing values, if present, are replaced using CleanNaFromFactor.

dv

The dependent variable, which may have only two unique values. Missing values are not allowed.

maxOdds

When the odds are greater than maxOdds or less than 1/maxOdds then the odds are replaced with the threshold value.

civ

If iv is a discretized version of a continuous variable, then the original continuos variable can be provided in this argument so that linearity can be calculated. See the Value section below for more information.

...

Extra unused arguments.

Details

This function computes the log odds (aka weight of evidence) for each level in a factor as follows:

woe = \log \frac{nPositive}{nNegative}

where nPositive is the number of "positive" values in the dependent variable, and nNegative is the number of "negative" values.

By default the second level of dv is used as the "positive" class during power calculations. This can be controlled by ordering the levels in a factor supplied as dv.

Other metrics returned include the information value and the log density ratio.

Value

A list with the following elements:

woe.levels

A vector of WOE values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.

woe

A vector of WOE values with the same length as iv. Essentially each factor value is replaced with the associated log odds.

odds

A vector of odds values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.

bin.count

A count of data points in each level of the factor iv.

true.count

A count of "true" dependent variable values in each level of the factor iv. The number of "false" values is bin.count - true.count.

log.density.ratio

A vector of log density ratio values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.

information.value

A vector of information values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.

linearity

A measure of correlation between the log-odds of the dependent variable and the binned values of the continuous independent variable civ. This is calculated if the civ argument was provided, otherwise it's NA.

Author(s)

Justin Hemann <support@causata.com>

See Also

CleanNaFromFactor.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(stringr)

# create a factor with three levels
# - odds of 1 for a:  1:2 = 2.0
# - odds of 1 for b:  2:1 = 0.5
# - odds of 1 for NA: 1:1 = 1.0
f1  <- factor(c(str_split("a a a b b b", " ")[[1]], NA,NA))
dv1 <- c(                  1,1,0,0,0,1,              1, 0 )
fw1 <- Woe(f1,dv1)
fw1$odds

# discretize a continuous variable into a factor with 10 levels and compute WOE,
data(df.causata)
dv <- df.causata$has.responded.mobile.logoff_next.hour_466
f2 <- BinaryCut(df.causata$online.average.authentications.per.month_all.past_406, dv)
fw2 <- Woe(f2, dv, civ=df.causata$online.average.authentications.per.month_all.past_406)
fw2$odds
fw2$linearity

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.