pct.bin: Monotonic binning based on percentiles

View source: R/01_PCT_BINNING.R

pct.binR Documentation

Monotonic binning based on percentiles

Description

pct.bin implements percentile-based monotonic binning by the iterative discretization.

Usage

pct.bin(
  x,
  y,
  sc = c(NA, NaN, Inf, -Inf),
  sc.method = "together",
  g = 15,
  y.type = NA,
  woe.trend = TRUE,
  force.trend = NA
)

Arguments

x

Numeric vector to be binned.

y

Numeric target vector (binary or continuous).

sc

Numeric vector with special case elements. Default values are c(NA, NaN, Inf, -Inf). Recommendation is to keep the default values always and add new ones if needed. Otherwise, if these values exist in x and are not defined in the sc list some statistics cannot be calculated properly.

sc.method

Define how special cases will be treated, all together or in separate bins. Possible values are "together", "separately".

g

Number of starting groups. Default is 15.

y.type

Type of y, possible options are "bina" (binary) and "cont" (continuous). If default value is passed, then algorithm will identify if y is 0/1 or continuous variable.

woe.trend

Applied only for a continuous target (y) as weights of evidence (WoE) trend check. Default is TRUE.

force.trend

If the expected trend should be forced. Possible values: "i" for increasing trend (y increases with increase of x), "d" for decreasing trend (y decreases with decrease of x). Default value is NA. If the default value is passed, algorithm will stop if perfect negative or positive correlation (Spearman) is achieved between average y and average x per bin. Otherwise, it will stop only if the forced trend is achieved.

Value

The command pct.bin generates a list of two objects. The first object, data frame summary.tbl presents a summary table of final binning, while x.trans is a vector of discretized values. In case of single unique value for x or y of complete cases (cases different than special cases), it will return data frame with info.

Examples

suppressMessages(library(monobin))
data(gcd)
#binary target
mat.bin <- pct.bin(x = gcd$maturity, y = gcd$qual)
mat.bin[[1]]
table(mat.bin[[2]])
#continuous target, separate groups for special cases
set.seed(123)
gcd$age.d <- gcd$age
gcd$age.d[sample(1:nrow(gcd), 10)] <- NA
gcd$age.d[sample(1:nrow(gcd), 3)] <- 9999999999
age.d.bin <- pct.bin(x = gcd$age.d, 
			   	y = gcd$qual, 
			   	sc = c(NA, NaN, Inf, -Inf, 9999999999), 
			  	sc.method = "separately",
			   	force.trend = "d")
age.d.bin[[1]]
gcd$age.d.bin <- age.d.bin[[2]]
gcd %>% group_by(age.d.bin) %>% summarise(n = n(), y.avg = mean(qual))


monobin documentation built on July 21, 2022, 5:11 p.m.