Artificial data for testing classification algorithms

Description

The generator produces classification data with 2 classes, 7 discrete and 3 numeric attributes.

Usage

1
2
  classDataGen(noInst, t1=0.7, t2=0.9, t3=0.34, t4=0.32, 
               p1=0.5, classNoise=0)

Arguments

noInst

Number of instances to generate.

t1, t2, t3

Parameters, which control the hardness of the discrete attributes.

t4

Parameter, which controls the hardness of the numeric attributes..

p1

Probability of class 1.

classNoise

Proportion of noise in the class variable for classification or virtual class variable for regression.

Details

Class probabilities are p1 and 1 - p1, respectively. The conditional distribution of attributes under each of the classes depends on parameters t1, t2, t3, t4 from [0,1]. Attributes a7 and x3 are irrelevant for all values of parameters.

Examples of extreme settings of the parameters.

  • Setting satisfying t1*t2 = t3 implies no difference between the distributions of individual discrete attributes among the two classes. However, if t1 < 1, then the joint distribution of them is different for the two classes.

  • Setting t1 = 1 and t2 = t3 implies no difference between the joint distribution of the discrete attributes among the two classes.

  • Setting t1 = 1, t2 = 1, t3 = 0 implies disjoint supports of the distributions of a1, a2, a4, a5, so this allows exact classification.

  • Setting t4 = 1 implies no difference between the distribution of x1, x2 between the classes. Setting t4 = 0 allows correct classification with probability one only using x1 and x2.

For class 1 the attributes have distributions

(a1, a2, a3) D_1(t1, t2)
a4, a5, a6 D_2(t3)
a7 irrelevant attribute, probabilities of {a,b,c,d} are (1/2, 1/6, 1/6, 1/6)
x1, x2, x3 independent normal variables with mean 0 and standard deviation 1, t4, 1
x4, x5 independent uniformly distributed variables on [0,1]

For class 2 the attributes have distributions

a1, a2, a3 D_2(t3)
(a4, a5, a6) D_1(t1, t2)
a7 irrelevant attribute, probabilities of {a,b,c,d} are (1/2, 1/6, 1/6, 1/6)
x1, x2, x3 independent normal variables with mean 0 and st. dev. t4, 1, 1
x4, x5 independent uniformly distributed variables on [0,1]

x3 is irrelevant for classification, since it has the same distribution under both classes.

Attributes in a bracket are mutually dependent. Otherwise, the attributes are conditionally independent for each of the two classes. This means that if we consider groups of the attributes such that the attributes in each of the two brackets form a group and each of the remaining attributes forms a group with one element, then for each class, we have 7 groups, which are conditionally independent for the given class. Note that the splitting into groups differs for class 1 and 2.

Distribution D_1(t1,t2) consists of three dependent attributes. The distribution of individual attributes depends only on t1*t2. For a given t1*t2, the level of dependence decreases with t1 and increases with t2. There are two extreme settings: Setting t1 = 1, t2 = t1*t2 has the largest t1 and the smallest t2 and all three attributes are independent. Setting t1 = t1*t2, t2 = 1 has the smallest t1 and the largest t2 and also the largest dependence between attributes.

Distribution D_2(t3) is equal to D_1(1, t3), so it contains three independent attributes, whose distributions are the same as in D_1(t1,t2) for every setting satifying t1*t2 = t3.

In other words, if t3 = t1*t2, then the distributions D_1(t1, t2) and D_2(t3) have the same distributions of individual attributes and may differ only in the dependences. There are no in D_2(t3) and there are some in D_1(t1, t2) if t1 < 1.

Hardness of the discrete part

Setting t1 = 1 and t2 = t3 implies no difference between the discrete attributes among the two classes.

Setting satisfying t1*t2 = t3 implies no difference between the distributions of individual discrete attributes among the two classes. However, there may be a difference in dependences.

Setting t1 = 1, t2 = 1, t3 = 0 implies disjoint supports of the distributions of a1, a2, a4, a5, so this allows exact classification.

Hardness of the continuous part

Depends monotonically on t4. Setting t4 = 1 implies no difference between the classes. Setting t4 = 0 allows correct classification with probability one.

Value

The method classDataGen returns a data.frame with noInst rows and 11 columns. Range of values of the attributes and class are

a1

0,1

a2

0,1

a3

a,b,c,d

a4

0,1

a5

0,1

a6

a,b,c,d

a7

a,b,c,d

x1

numeric

x2

numeric

x3

numeric

class

1,2

For detailed specification of attributes (columns) see details section below.

Author(s)

Petr Savicky

See Also

regDataGen, ordDataGen,CoreModel.

Examples

1
2
3
4
5
6
7
8
9
#prepare a classification data set
classData <-classDataGen(noInst=200)

# build random forests model with certain parameters
modelRF <- CoreModel(class~., classData, model="rf",
              selectionEstimator="MDL", minNodeWeightRF=5,
              rfNoTrees=100, maxThreads=1)
print(modelRF)
destroyModels(modelRF) # clean up

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.