ImbC: Synthetic Imbalanced Data Set for a Multi-class Task

Description Usage Format Author(s) Examples

Description

Synthetic imbalanced data set for a multi-class task. The data set has a numeric feature ("X1"), a nominal feature ("X2") and a target class named "Class". The three classes of the problem ("normal", "rare1" and "rare2") are assigned according to the rules described below. These rules depend of the two features ("X1" and "X2").

Usage

1

Format

The data set has one continuous feature (X1) and one nominal feature (X2). The target class (denoted as Class) has three possible values ("normal" , "rare1" and "rare2"). Classes "rare1" and "rare2" are the minority classes. Examples of class "rare1" occur in 1% of the data while those of class "rare2" occur in 13.1% of the data. The remaining class, "normal", is the majority class and occurs in about 85.9% of the data. Data set ImbC has 1000 examples distributed in classes "rare1", "rare2" and "normal" with 10, 131 and 859 examples respectively.

ImbC data has been simulated as follows:

-

X1\sim \mathbf{N} ≤ft(0, 4\right)

-

X2 labels "cat", "fish" and "dog" where randomly distributed with the restriction of having a frequency of 30%, 30% and 40% respectively.

-

To obtain the target variable Class, we have define the following sets:

  • S_1=\{(X1, X2) : X1 > 9 \wedge (X2 \in \{"cat", "dog"\})\}

  • S_2=\{(X1, X2) : X1 > 7 \wedge X2 = "fish" \}

  • S_3=\{(X1, X2) :-1 < X1 < 0.5\}

  • S_4=\{(X1, X2) : X1 < -7 \wedge X2 = "fish"\}

-

The following conditions define the target variable distribution of the ImbC synthetic data set:

  • Assign class label "rare1" to: a random sample of 90% of set S_1 and a random sample of 40% of set S_2

  • Assign class label "rare2" to: a random sample of 80% of set S_3 and a random sample of 70% of set S_4

  • Assign class label "normal" to the remaing examples.

Author(s)

Paula Branco paobranco@gmail.com, Rita Ribeiro rpribeiro@dcc.fc.up.pt and Luis Torgo ltorgo@dcc.fc.up.pt

Examples

1
2
3
4
require(ggplot2)
data(ImbC)
summary(ImbC)
ggplot(data=ImbC, aes(x=X2, y=X1, color=Class))+geom_jitter()

paobranco/UBL documentation built on May 6, 2021, 6:57 p.m.