ImbC: Synthetic Imbalanced Data Set for a Multi-class Task In UBL: An Implementation of Re-Sampling Approaches to Utility-Based Learning for Both Classification and Regression Tasks

Description

Synthetic imbalanced data set for a multi-class task. The data set has a numeric feature ("X1"), a nominal feature ("X2") and a target class named "Class". The three classes of the problem ("normal", "rare1" and "rare2") are assigned according to the rules described below. These rules depend of the two features ("X1" and "X2").

Usage

 1 data(ImbC) 

Format

The data set has one continuous feature (X1) and one nominal feature (X2). The target class (denoted as Class) has three possible values ("normal" , "rare1" and "rare2"). Classes "rare1" and "rare2" are the minority classes. Examples of class "rare1" occur in 1% of the data while those of class "rare2" occur in 13.1% of the data. The remaining class, "normal", is the majority class and occurs in about 85.9% of the data. Data set ImbC has 1000 examples distributed in classes "rare1", "rare2" and "normal" with 10, 131 and 859 examples respectively.

ImbC data has been simulated as follows:

-

X1\sim \mathbf{N} ≤ft(0, 4\right)

-

X2 labels "cat", "fish" and "dog" where randomly distributed with the restriction of having a frequency of 30%, 30% and 40% respectively.

-

To obtain the target variable Class, we have define the following sets:

• S_1=\{(X1, X2) : X1 > 9 \wedge (X2 \in \{"cat", "dog"\})\}

• S_2=\{(X1, X2) : X1 > 7 \wedge X2 = "fish" \}

• S_3=\{(X1, X2) :-1 < X1 < 0.5\}

• S_4=\{(X1, X2) : X1 < -7 \wedge X2 = "fish"\}

-

The following conditions define the target variable distribution of the ImbC synthetic data set:

• Assign class label "rare1" to: a random sample of 90% of set S_1 and a random sample of 40% of set S_2

• Assign class label "rare2" to: a random sample of 80% of set S_3 and a random sample of 70% of set S_4

• Assign class label "normal" to the remaing examples.

Author(s)

Paula Branco [email protected], Rita Ribeiro [email protected] and Luis Torgo [email protected]

Examples

 1 2 3 4 require(ggplot2) data(ImbC) summary(ImbC) ggplot(data=ImbC, aes(x=X2, y=X1, color=Class))+geom_jitter() 

Example output

Loading required package: MBA
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'ggplot2'

The following object is masked from 'package:randomForest':

margin

X1              X2         Class
Min.   :-13.5843   cat :300   normal:859
1st Qu.: -2.6930   dog :400   rare1 : 10
Median : -0.1592   fish:300   rare2 :131
Mean   : -0.1064
3rd Qu.:  2.4633
Max.   : 12.7836


UBL documentation built on July 13, 2017, 5:02 p.m.