smote: Synthetic Minority Oversampling Technique (SMOTE)
In PDtoolkit: Collection of Tools for PD Rating Model Development and Validation

smote

R Documentation

Synthetic Minority Oversampling Technique (SMOTE)

Description

smote performs type of data augmentation for the selected (usually minority). In order to process continuous and categorical risk factors simultaneously, Heterogeneity Euclidean Overlapping Metric (HEOM) is used in nearest neighbors algorithm.

Usage

smote(
  db,
  target,
  minority.class,
  osr,
  ordinal.rf = NULL,
  num.rf.const = NULL,
  k = 5,
  seed = 81000
)

Arguments

`db`	Data set of risk factors and target variable.
`target`	Name of target variable within `db` argument.
`minority.class`	Value of minority class. It can be numeric or character value, but it has to exist in target variable.
`osr`	Oversampling rate. It has to be numeric value greater than 0 (for example 0.2 for 20% oversampling).
`ordinal.rf`	Character vector of ordinal risk factors. Default value is `NULL`.
`num.rf.const`	Data frame with constrains for numeric risk factors. It has to contain the following columns: `rf`(numeric risk factor names from `db`), `lower` (lower bound of numeric risk factor), `upper` (upper bound of numeric risk factor), `type` (type of numeric risk factor - `"numeric"` or `"integer"`). Constrains are used for correction of synthetic data for selected numeric risk factors. Default value is `NULL` which means that no corrections are assumed.
`k`	Number of nearest neighbors. Default value is 5.
`seed`	Random seed needed for ensuring the result reproducibility. Default is 81000.

Value

The command smote returns a data frame with added synthetic observations for selected minority class. The data frame contains all variables from db data frame plus additional variable (smote) that serves as indicator for distinguishing between original and synthetic observations.

Examples

suppressMessages(library(PDtoolkit))
data(loans)
#check numeric variables (note that one of variables is target not a risk factor)
names(loans)[sapply(loans, is.numeric)]
#define constains of numeric risk factors
num.rf.const <- data.frame(rf = c("Duration of Credit (month)", "Credit Amount", "Age (years)"),
			   lower = c(4, 250, 19),
			   upper = c(72, 20000, 75),
			   type = c("integer", "numeric", "integer"))
num.rf.const

#loans$"Account Balance"[990:1000] <- NA
#loans$"Credit Amount"[900:920] <- NA

loans.s <- smote(db = loans,
	     target = "Creditability",
	     minority.class = 1,  
	     osr = 0.05,
	     ordinal.rf = NULL, 
	     num.rf.const = num.rf.const, 
	     k = 5, 
	     seed = 81000)
str(loans.s)
table(loans.s$Creditability, loans.s$smote)
#select minority class
loans.mc <- loans.s[loans.s$Creditability%in%1, ]
head(loans.mc)

PDtoolkit documentation built on Sept. 20, 2023, 9:06 a.m.