simulated: Simulated Complexity Measures

Description Usage Arguments Details Value References Examples

Description

The complexity measures are interested to quantify the the ambiguity of the classes, the sparsity and dimensionality of the data and the complexity of the boundary separating the classes.

Usage

1
2
3
4
5
6
7
simulated(...)

## Default S3 method:
simulated(x, y, features = "all", ...)

## S3 method for class 'formula'
simulated(formula, data, features = "all", ...)

Arguments

...

Further arguments passed to the summarization functions.

x

A data.frame contained only the input attributes.

y

A factor response vector with one label for each row/component of x.

features

A list of features names or "all" to include all them.

formula

A formula to define the class column.

data

A data.frame dataset contained the input attributes and class. The details section describes the valid values for this group.

Details

The following features are allowed for this method:

"F1"

Maximum Fisher's Discriminant Ratio (F1) measures the overlap between the values of the features and takes the value of the largest discriminant ratio among all the available features.

"F1v"

Directional-vector maximum Fisher's discriminant ratio (F1v) complements F1 by searching for a vector able to separate two classes after the training examples have been projected into it.

"F2"

Volume of the overlapping region (F2) computes the overlap of the distributions of the features values within the classes. F2 can be determined by finding, for each feature its minimum and maximum values in the classes.

"F3"

The maximum individual feature efficiency (F3) of each feature is given by the ratio between the number of examples that are not in the overlapping region of two classes and the total number of examples. This measure returns the maximum of the values found among the input features.

"F4"

Collective feature efficiency (F4) get an overview on how various features may work together in data separation. First the most discriminative feature according to F3 is selected and all examples that can be separated by this feature are removed from the dataset. The previous step is repeated on the remaining dataset until all the features have been considered or no example remains. F4 returns the ratio of examples that have been discriminated.

"N1"

Fraction of borderline points (N1) computes the percentage of vertexes incident to edges connecting examples of opposite classes in a Minimum Spanning Tree (MST).

"N2"

Ratio of intra/extra class nearest neighbor distance (N2) computes the ratio of two sums: intra-class and inter-class. The former corresponds to the sum of the distances between each example and its closest neighbor from the same class. The later is the sum of the distances between each example and its closest neighbor from another class (nearest enemy).

"N3"

Error rate of the nearest neighbor (N3) classifier corresponds to the error rate of a one Nearest Neighbor (1NN) classifier, estimated using a leave-one-out procedure in dataset.

"N4"

Non-linearity of the nearest neighbor classifier (N4) creates a new dataset randomly interpolating pairs of training examples of the same class and then induce a the 1NN classifier on the original data and measure the error rate in the new data points.

"T1"

Fraction of hyperspheres covering data (T1) builds hyperspheres centered at each one of the training examples, which have their radios growth until the hypersphere reaches an example of another class. Afterwards, smaller hyperspheres contained in larger hyperspheres are eliminated. T1 is finally defined as the ratio between the number of the remaining hyperspheres and the total number of examples in the dataset.

"LSC"

Local Set Average Cardinality (LSC) is based on Local Set (LS) and defined as the set of points from the dataset whose distance of each example is smaller than the distance from the exemples of the different class. LSC is the average of the LS.

"L1"

Sum of the error distance by linear programming (L1) computes the sum of the distances of incorrectly classified examples to a linear boundary used in their classification.

"L2"

Error rate of linear classifier (L2) computes the error rate of the linear SVM classifier induced from dataset.

"L3"

Non-linearity of a linear classifier (L3) creates a new dataset randomly interpolating pairs of training examples of the same class and then induce a linear SVM on the original data and measure the error rate in the new data points.

"Density"

Average Density of the network (Density) represents the number of edges in the graph, divided by the maximum number of edges between pairs of data points.

"ClsCoef"

Clustering coefficient (ClsCoef) averages the clustering tendency of the vertexes by the ratio of existent edges between its neighbors and the total number of edges that could possibly exist between them.

"Hubs"

Hubs score (Hubs) is given by the number of connections it has to other nodes, weighted by the number of connections these neighbors have.

Value

A list named by the requested meta-features.

References

Ana C. Lorena, Luis P. F. Garcia, Jens Lehmann, Marcilio C. P. de Souto and Tin k. Ho. How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591, 2018.

Examples

1
2
3
4
5
## Extract all complexity measures using formula
simulated(Species ~ ., iris)

## Extract some complexity measures
simulated(iris[1:4], iris[5], c("F2", "F3", "F4"))

lpfgarcia/SCoL documentation built on May 29, 2019, 9:31 a.m.