# Artificial data for testing classification algorithms

### Description

The generator produces classification data with 2 classes, 7 discrete and 3 numeric attributes.

### Usage

1 2 | ```
classDataGen(noInst, t1=0.7, t2=0.9, t3=0.34, t4=0.32,
p1=0.5, classNoise=0)
``` |

### Arguments

`noInst` |
Number of instances to generate. |

`t1, t2, t3` |
Parameters, which control the hardness of the discrete attributes. |

`t4` |
Parameter, which controls the hardness of the numeric attributes.. |

`p1` |
Probability of class 1. |

`classNoise` |
Proportion of noise in the class variable for classification or virtual class variable for regression. |

### Details

Class probabilities are `p1`

and `1 - p1`

, respectively. The conditional distribution of attributes
under each of the classes depends on parameters `t1, t2, t3, t4`

from [0,1].
Attributes a7 and x3 are irrelevant for all values of parameters.

Examples of extreme settings of the parameters.

Setting satisfying t1*t2 = t3 implies no difference between the distributions of individual discrete attributes among the two classes. However, if t1 < 1, then the joint distribution of them is different for the two classes.

Setting t1 = 1 and t2 = t3 implies no difference between the joint distribution of the discrete attributes among the two classes.

Setting t1 = 1, t2 = 1, t3 = 0 implies disjoint supports of the distributions of a1, a2, a4, a5, so this allows exact classification.

Setting t4 = 1 implies no difference between the distribution of x1, x2 between the classes. Setting t4 = 0 allows correct classification with probability one only using x1 and x2.

For class 1 the attributes have distributions

(a1, a2, a3) | D_1(t1, t2) |

a4, a5, a6 | D_2(t3) |

a7 | irrelevant attribute, probabilities of {a,b,c,d} are (1/2, 1/6, 1/6, 1/6) |

x1, x2, x3 | independent normal variables with mean 0 and standard deviation 1, t4, 1 |

x4, x5 | independent uniformly distributed variables on [0,1] |

For class 2 the attributes have distributions

a1, a2, a3 | D_2(t3) |

(a4, a5, a6) | D_1(t1, t2) |

a7 | irrelevant attribute, probabilities of {a,b,c,d} are (1/2, 1/6, 1/6, 1/6) |

x1, x2, x3 | independent normal variables with mean 0 and st. dev. t4, 1, 1 |

x4, x5 | independent uniformly distributed variables on [0,1] |

x3 is irrelevant for classification, since it has the same distribution under both classes.

Attributes in a bracket are mutually dependent. Otherwise, the attributes are conditionally independent for each of the two classes. This means that if we consider groups of the attributes such that the attributes in each of the two brackets form a group and each of the remaining attributes forms a group with one element, then for each class, we have 7 groups, which are conditionally independent for the given class. Note that the splitting into groups differs for class 1 and 2.

Distribution *D_1(t1,t2)* consists of three dependent attributes. The
distribution of individual attributes depends only on t1*t2. For a given t1*t2,
the level of dependence decreases with t1 and increases with t2. There are
two extreme settings:
Setting t1 = 1, t2 = t1*t2 has the largest t1 and the smallest t2 and all three
attributes are independent.
Setting t1 = t1*t2, t2 = 1 has the smallest t1 and the largest t2 and also the
largest dependence between attributes.

Distribution *D_2(t3)* is equal to *D_1(1, t3)*, so it contains three independent
attributes, whose distributions are the same as in *D_1(t1,t2)* for every
setting satifying t1*t2 = t3.

In other words, if t3 = t1*t2, then the distributions *D_1(t1, t2)* and *D_2(t3)*
have the same distributions of individual attributes and may differ only
in the dependences. There are no in *D_2(t3)* and there are some in *D_1(t1, t2)*
if t1 < 1.

*Hardness of the discrete part*

Setting t1 = 1 and t2 = t3 implies no difference between the discrete attributes among the two classes.

Setting satisfying t1*t2 = t3 implies no difference between the distributions of individual discrete attributes among the two classes. However, there may be a difference in dependences.

Setting t1 = 1, t2 = 1, t3 = 0 implies disjoint supports of the distributions of a1, a2, a4, a5, so this allows exact classification.

*Hardness of the continuous part*

Depends monotonically on t4. Setting t4 = 1 implies no difference between the classes. Setting t4 = 0 allows correct classification with probability one.

### Value

The method `classDataGen`

returns a `data.frame`

with `noInst`

rows and 11 columns.
Range of values of the attributes and class are

`a1` |
0,1 |

`a2` |
0,1 |

`a3` |
a,b,c,d |

`a4` |
0,1 |

`a5` |
0,1 |

`a6` |
a,b,c,d |

`a7` |
a,b,c,d |

`x1` |
numeric |

`x2` |
numeric |

`x3` |
numeric |

`class` |
1,2 |

For detailed specification of attributes (columns) see details section below.

### Author(s)

Petr Savicky

### See Also

`regDataGen`

, `ordDataGen`

,`CoreModel`

.

### Examples

1 2 3 4 5 6 7 8 9 | ```
#prepare a classification data set
classData <-classDataGen(noInst=200)
# build random forests model with certain parameters
modelRF <- CoreModel(class~., classData, model="rf",
selectionEstimator="MDL", minNodeWeightRF=5,
rfNoTrees=100, maxThreads=1)
print(modelRF)
destroyModels(modelRF) # clean up
``` |