sampling_target: Extract the data to fit the model

View source: R/sampling.R

sampling_targetR Documentation

Extract the data to fit the model

Description

To solve the imbalanced class, perform sampling in the train set of split_df.

Usage

sampling_target(
  .data,
  method = c("ubUnder", "ubOver", "ubSMOTE"),
  seed = NULL,
  perc = 50,
  k = ifelse(method == "ubSMOTE", 5, 0),
  perc.over = 200,
  perc.under = 200
)

Arguments

.data

an object of class "split_df", usually, a result of a call to split_df().

method

character. sampling methods. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique).

seed

integer. random seed used for sampling

perc

integer. The percentage of positive class in the final dataset. It is used only in under-sampling. The default is 50. perc can not exceed 50.

k

integer. It is used only in over-sampling and SMOTE. If over-sampling and if K=0: sample with replacement from the minority class until we have the same number of instances in each class. under-sampling and if K>0: sample with replacement from the minority class until we have k-times the original number of minority instances. If SMOTE, the number of neighbours to consider as the pool from where the new examples are generated

perc.over

integer. It is used only in SMOTE. per.over/100 is the number of new instances generated for each rare instance. If perc.over < 100 a single instance is generated.

perc.under

integer. It is used only in SMOTE. perc.under/100 is the number of "normal" (majority class) instances that are randomly selected for each smoted observation.

Details

In order to solve the problem of imbalanced class, sampling is performed by under sampling, over sampling, SMOTE method.

Value

An object of train_df.

attributes of train_df class

The attributes of the train_df class are as follows.:

  • sample_seed : integer. random seed used for sampling

  • method : character. sampling methods.

  • perc : integer. perc argument value

  • k : integer. k argument value

  • perc.over : integer. perc.over argument value

  • perc.under : integer. perc.under argument value

  • binary : logical. whether the target variable is a binary class

  • target : character. target variable name

  • minority : character. the level of the minority class

  • majority : character. the level of the majority class

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

# under-sampling with random seed
under <- sb %>%
  sampling_target(seed = 1234L)

under %>%
  count(default)

# under-sampling with random seed, and minority class frequency is 40%
under40 <- sb %>%
  sampling_target(seed = 1234L, perc = 40)

under40 %>%
  count(default)

# over-sampling with random seed
over <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L)

over %>%
  count(default)

# over-sampling with random seed, and k = 10
over10 <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L, k = 10)

over10 %>%
  count(default)

# SMOTE with random seed
smote <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L)

smote %>%
  count(default)

# SMOTE with random seed, and perc.under = 250
smote250 <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250)

smote250 %>%
  count(default)


alookr documentation built on May 29, 2024, 10:38 a.m.