Information: Data exploration with information theory (weight-of-evidence...

Description Details Author(s) Examples

Description

The information package performs exploratory data analysis and variable screening for binary classification models using information theory (WOE and IV).

The package also supports exploratory analysis and variable screening for uplift models (NWOE and NIV).

Note that the only functions you will need to use are create_infotables() and plot_infotables():

- create_infotables() creates WOE or NWOE tables and outputs a variable-strength summary data.frame (IV or NIV)

- plot_infotables() creates WOE or NWOE bar charts for one or more variables

Details

Given a data.frame with a set of predictive variables and a binary response variable, create_infotables() will cycle through all variables and create NWOE or WOE tables. It will also rank all variables by their respective IV or NIV values and return the results in a data.frame.

The package needs minimal inputs. You do not have to explicitly specify which variables to evaluate or provide bins: create_infotables() will process all variables in the dataset and generate appropriate bins for WOE/NWOE analysis.

If requested, calculations can be distributed across multiple cores for better performance.

Note that NWOE analysis is only for uplift models. Thus, for NWOE analysis, you must have a "treatment" and a "control" group in your dataset. The treatment and control groups should identified by a binary indicator variable (1/0).

For regular WOE analysis, on the other hand, all you need is a binary response variable (dependent variable).

You can cross validate your IV or NIV values by supplying a validation dataset. This will produce penalized IV/NIV values.

#' To learn more about the Information package, start with the vignette: browseVignettes(package = "Information")

Author(s)

Kim Larsen (kblarsen4 at gmail.com)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
##------------------------------------------------------------
## WOE analysis, no validation
##------------------------------------------------------------
library(Information)

data(train, package="Information")
train <- subset(train, TREATMENT==1)
IV <- Information::create_infotables(data=train, y="PURCHASE", parallel=FALSE)

print(head(IV$Summary), row.names=FALSE)
print(IV$Tables$N_OPEN_REV_ACTS, row.names=FALSE)

# Plotting a single variable
Information::plot_infotables(IV, "N_OPEN_REV_ACTS")

# Plotting multiple variables
Information::plot_infotables(IV, IV$Summary$Variable[1:4], same_scale=TRUE)

# If the goal is to plot multiple variables individually, as opposed to a comparison-grid, we can
# loop through the variable names and create individual plots
## Not run: 
names <- names(IV$Tables)
plots <- list()
for (i in 1:length(names)){
  plots[[i]] <- plot_infotables(IV, names[i])
}
# Showing the top 18 variables
plots[1:18]

## End(Not run)

# We can speed up create_infotables() by setting parallel=TRUE (default setting)
# If we leave ncore as the default, ncore is set to available clusters - 1
## Not run: 
train <- subset(train, TREATMENT==1)
IV <- Information::create_infotables(data=train, y="PURCHASE")

## End(Not run)
closeAllConnections()

Example output

[1] "Variable TREATMENT was removed because it has only 1 unique level"
                    Variable        IV
             N_OPEN_REV_ACTS 1.0107695
        TOT_HI_CRDT_CRDT_LMT 0.9345902
        RATIO_BAL_TO_HI_CRDT 0.8232539
 D_NA_M_SNC_MST_RCNT_ACT_OPN 0.6355466
  M_SNC_OLDST_RETAIL_ACT_OPN 0.5573438
      M_SNC_MST_RCNT_ACT_OPN 0.5026402
 N_OPEN_REV_ACTS    N    Percent        WOE        IV
           [0,0] 1469 0.29545455 -2.0465968 0.6401443
           [1,2]  958 0.19267900 -0.5900120 0.6958705
           [3,3]  310 0.06234916  0.2033085 0.6986029
           [4,5]  583 0.11725664  0.4419768 0.7244762
           [6,8]  632 0.12711183  0.6148243 0.7810611
          [9,11]  453 0.09111022  0.8815772 0.8692672
         [12,48]  567 0.11403862  0.9883818 1.0107695
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

[[9]]

[[10]]

[[11]]

[[12]]

[[13]]

[[14]]

[[15]]

[[16]]

[[17]]

[[18]]

[1] "Variable TREATMENT was removed because it has only 1 unique level"

Information documentation built on May 2, 2019, 7:15 a.m.