ImbC: Synthetic Imbalanced Data Set for a Multi-class Task

Description Usage Format Author(s) Examples

Description

Synthetic imbalanced data set for a multi-class task. The data set has a numeric feature ("X1"), a nominal feature ("X2") and a target class named "Class". The three classes of the problem ("normal", "rare1" and "rare2") are assigned according to the rules described below. These rules depend of the two features ("X1" and "X2").

Usage

1

Format

The data set has one continuous feature (X1) and one nominal feature (X2). The target class (denoted as Class) has three possible values ("normal" , "rare1" and "rare2"). Classes "rare1" and "rare2" are the minority classes. Examples of class "rare1" occur in 1% of the data while those of class "rare2" occur in 13.1% of the data. The remaining class, "normal", is the majority class and occurs in about 85.9% of the data. Data set ImbC has 1000 examples distributed in classes "rare1", "rare2" and "normal" with 10, 131 and 859 examples respectively.

ImbC data has been simulated as follows:

-

X1\sim \mathbf{N} ≤ft(0, 4\right)

-

X2 labels "cat", "fish" and "dog" where randomly distributed with the restriction of having a frequency of 30%, 30% and 40% respectively.

-

To obtain the target variable Class, we have define the following sets:

  • S_1=\{(X1, X2) : X1 > 9 \wedge (X2 \in \{"cat", "dog"\})\}

  • S_2=\{(X1, X2) : X1 > 7 \wedge X2 = "fish" \}

  • S_3=\{(X1, X2) :-1 < X1 < 0.5\}

  • S_4=\{(X1, X2) : X1 < -7 \wedge X2 = "fish"\}

-

The following conditions define the target variable distribution of the ImbC synthetic data set:

  • Assign class label "rare1" to: a random sample of 90% of set S_1 and a random sample of 40% of set S_2

  • Assign class label "rare2" to: a random sample of 80% of set S_3 and a random sample of 70% of set S_4

  • Assign class label "normal" to the remaing examples.

Author(s)

Paula Branco [email protected], Rita Ribeiro [email protected] and Luis Torgo [email protected]

Examples

1
2
3
4
require(ggplot2)
data(ImbC)
summary(ImbC)
ggplot(data=ImbC, aes(x=X2, y=X1, color=Class))+geom_jitter()

Example output

Loading required package: MBA
Loading required package: gstat
Loading required package: automap
Loading required package: sp
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: ggplot2

Attaching package: 'ggplot2'

The following object is masked from 'package:randomForest':

    margin

       X1              X2         Class    
 Min.   :-13.5843   cat :300   normal:859  
 1st Qu.: -2.6930   dog :400   rare1 : 10  
 Median : -0.1592   fish:300   rare2 :131  
 Mean   : -0.1064                          
 3rd Qu.:  2.4633                          
 Max.   : 12.7836                          

UBL documentation built on July 13, 2017, 5:02 p.m.