statistical: Statistical meta-features

Description Usage Arguments Details Value References See Also Examples

View source: R/statistical.R

Description

Statistical meta-features are the standard statistical measures to describe the numerical properties of a distribution of data. As it requires only numerical attributes, the categorical data are transformed to numerical.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
statistical(...)

## Default S3 method:
statistical(
  x,
  y,
  features = "all",
  summary = c("mean", "sd"),
  by.class = FALSE,
  transform = TRUE,
  ...
)

## S3 method for class 'formula'
statistical(
  formula,
  data,
  features = "all",
  summary = c("mean", "sd"),
  by.class = FALSE,
  transform = TRUE,
  ...
)

Arguments

...

Further arguments passed to the summarization functions.

x

A data.frame contained only the input attributes.

y

A factor response vector with one label for each row/component of x.

features

A list of features names or "all" to include all them. The details section describes the valid values for this group.

summary

A list of summarization functions or empty for all values. See post.processing method to more information. (Default: c("mean", "sd"))

by.class

A logical value indicating if the meta-features must be computed for each group of samples belonging to different output classes. (Default: FALSE)

transform

A logical value indicating if the categorical attributes should be transformed. If FALSE they will be ignored. (Default: TRUE)

formula

A formula to define the class column.

data

A data.frame dataset contained the input attributes and class The details section describes the valid values for this group.

Details

The following features are allowed for this method:

"canCor"

Canonical correlations between the predictive attributes and the class (multi-valued).

"gravity"

Center of gravity, which is the distance between the instance in the center of the majority class and the instance-center of the minority class.

"cor"

Absolute attributes correlation, which measure the correlation between each pair of the numeric attributes in the dataset (multi-valued). This measure accepts an extra argument called method = c("pearson", "kendall", "spearman"). See cor for more details.

"cov"

Absolute attributes covariance, which measure the covariance between each pair of the numeric attributes in the dataset (multi-valued).

"nrDisc"

Number of the discriminant functions.

"eigenvalues"

Eigenvalues of the covariance matrix (multi-valued).

"gMean"

Geometric mean of attributes (multi-valued).

"hMean"

Harmonic mean of attributes (multi-valued).

"iqRange"

Interquartile range of attributes (multi-valued).

"kurtosis"

Kurtosis of attributes (multi-valued).

"mad"

Median absolute deviation of attributes (multi-valued).

"max"

Maximum value of attributes (multi-valued).

"mean"

Mean value of attributes (multi-valued).

"median"

Median value of attributes (multi-valued).

"min"

Minimum value of attributes (multi-valued).

"nrCorAttr"

Number of attributes pairs with high correlation (multi-valued when by.class=TRUE).

"nrNorm"

Number of attributes with normal distribution. The Shapiro-Wilk Normality Test is used to assess if an attribute is or not is normally distributed (multi-valued only when by.class=TRUE).

"nrOutliers"

Number of attributes with outliers values. The Turkey's boxplot algorithm is used to compute if an attributes has or does not have outliers (multi-valued only when by.class=TRUE).

"range"

Range of Attributes (multi-valued).

"sd"

Standard deviation of the attributes (multi-valued).

"sdRatio"

Statistic test for homogeneity of covariances.

"skewness"

Skewness of attributes (multi-valued).

"sparsity"

Attributes sparsity, which represents the degree of discreetness of each attribute in the dataset (multi-valued).

"tMean"

Trimmed mean of attributes (multi-valued). It is the arithmetic mean excluding the 20% of the lowest and highest instances.

"var"

Attributes variance (multi-valued).

"wLambda"

Wilks Lambda.

This method uses simple binarization to transform the categorical attributes when transform=TRUE.

Value

A list named by the requested meta-features.

References

Ciro Castiello, Giovanna Castellano, and Anna M. Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457 - 468, 2005.

Shawkat Ali, and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, volume 6, pages 119 - 138, 2006.

See Also

Other meta-features: clustering(), complexity(), concept(), general(), infotheo(), itemset(), landmarking(), model.based(), relative()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## Extract all meta-features
statistical(Species ~ ., iris)

## Extract some meta-features
statistical(iris[1:4], iris[5], c("cor", "nrNorm"))

## Extract all meta-features without summarize the results
statistical(Species ~ ., iris, summary=c())

## Use another summarization function
statistical(Species ~ ., iris, summary=c("min", "median", "max"))

## Extract statistical measures using by.class approach
statistical(Species ~ ., iris, by.class=TRUE)

## Do not transform the data (using only categorical attributes)
statistical(Species ~ ., iris, transform=FALSE)

Example output

$canCor
     mean        sd 
0.7280090 0.3631869 

$gravity
[1] 3.208281

$cor
     mean        sd 
0.5941160 0.3375443 

$cov
     mean        sd 
0.5966542 0.5582672 

$nrDisc
[1] 2

$eigenvalues
    mean       sd 
1.143239 2.058771 

$gMean
    mean       sd 
3.223073 2.022943 

$hMean
    mean       sd 
2.978389 2.145948 

$iqRange
    mean       sd 
1.700000 1.275408 

$kurtosis
      mean         sd 
-0.8105361  0.7326910 

$mad
     mean        sd 
1.0934175 0.5785782 

$max
    mean       sd 
5.425000 2.443188 

$mean
    mean       sd 
3.464500 1.918485 

$median
    mean       sd 
3.612500 1.919364 

$min
    mean       sd 
1.850000 1.808314 

$nrCorAttr
[1] 0.5

$nrNorm
[1] 1

$nrOutliers
[1] 1

$range
 mean    sd 
3.575 1.650 

$sd
     mean        sd 
0.9478671 0.5712994 

$sdRatio
[1] 1.277229

$skewness
      mean         sd 
0.06273198 0.29439896 

$sparsity
      mean         sd 
0.02871478 0.01103236 

$tMean
    mean       sd 
3.470556 1.904802 

$var
    mean       sd 
1.143239 1.332546 

$wLambda
[1] 0.02343863

$cor
     mean        sd 
0.5941160 0.3375443 

$nrNorm
[1] 1

$canCor
non.aggregated1 non.aggregated2 
      0.9848209       0.4711970 

$gravity
[1] 3.208281

$cor
non.aggregated1 non.aggregated2 non.aggregated3 non.aggregated4 non.aggregated5 
      0.1175698       0.8717538       0.4284401       0.8179411       0.3661259 
non.aggregated6 
      0.9628654 

$cov
non.aggregated1 non.aggregated2 non.aggregated3 non.aggregated4 non.aggregated5 
      0.0424340       1.2743154       0.3296564       0.5162707       0.1216394 
non.aggregated6 
      1.2956094 

$nrDisc
[1] 2

$eigenvalues
non.aggregated1 non.aggregated2 non.aggregated3 non.aggregated4 
     4.22824171      0.24267075      0.07820950      0.02383509 

$gMean
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                  5.7857204                   3.0265978 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                  3.2382668                   0.8417075 

$hMean
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                  5.7289051                   2.9958151 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                  2.6941655                   0.4946708 

$iqRange
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                        1.3                         0.5 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                        3.5                         1.5 

$kurtosis
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                 -0.6058125                   0.1387047 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                 -1.4168574                  -1.3581792 

$mad
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                    1.03782                     0.44478 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                    1.85325                     1.03782 

$max
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                        7.9                         4.4 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                        6.9                         2.5 

$mean
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                   5.843333                    3.057333 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                   3.758000                    1.199333 

$median
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                       5.80                        3.00 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                       4.35                        1.30 

$min
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                        4.3                         2.0 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                        1.0                         0.1 

$nrCorAttr
[1] 0.5

$nrNorm
[1] 1

$nrOutliers
[1] 1

$range
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                        3.6                         2.4 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                        5.9                         2.4 

$sd
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                  0.8280661                   0.4358663 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                  1.7652982                   0.7622377 

$sdRatio
[1] 1.277229

$skewness
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                  0.3086407                   0.3126147 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                 -0.2694109                  -0.1009166 

$sparsity
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                 0.02205177                  0.03705865 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                 0.01670048                  0.03904820 

$tMean
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                   5.797778                    3.040000 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                   3.842222                    1.202222 

$var
non.aggregated.Sepal.Length  non.aggregated.Sepal.Width 
                  0.6856935                   0.1899794 
non.aggregated.Petal.Length  non.aggregated.Petal.Width 
                  3.1162779                   0.5810063 

$wLambda
[1] 0.02343863

$canCor
      min    median       max 
0.4711970 0.7280090 0.9848209 

$gravity
[1] 3.208281

$cor
      min    median       max 
0.1175698 0.6231906 0.9628654 

$cov
      min    median       max 
0.0424340 0.4229635 1.2956094 

$nrDisc
[1] 2

$eigenvalues
       min     median        max 
0.02383509 0.16044012 4.22824171 

$gMean
      min    median       max 
0.8417075 3.1324323 5.7857204 

$hMean
      min    median       max 
0.4946708 2.8449903 5.7289051 

$iqRange
   min median    max 
   0.5    1.4    3.5 

$kurtosis
       min     median        max 
-1.4168574 -0.9819959  0.1387047 

$mad
    min  median     max 
0.44478 1.03782 1.85325 

$max
   min median    max 
  2.50   5.65   7.90 

$mean
     min   median      max 
1.199333 3.407667 5.843333 

$median
   min median    max 
 1.300  3.675  5.800 

$min
   min median    max 
   0.1    1.5    4.3 

$nrCorAttr
[1] 0.5

$nrNorm
[1] 1

$nrOutliers
[1] 1

$range
   min median    max 
   2.4    3.0    5.9 

$sd
      min    median       max 
0.4358663 0.7951519 1.7652982 

$sdRatio
[1] 1.277229

$skewness
       min     median        max 
-0.2694109  0.1038621  0.3126147 

$sparsity
       min     median        max 
0.01670048 0.02955521 0.03904820 

$tMean
     min   median      max 
1.202222 3.441111 5.797778 

$var
      min    median       max 
0.1899794 0.6333499 3.1162779 

$wLambda
[1] 0.02343863

$canCor
     mean        sd 
0.7280090 0.3631869 

$gravity
[1] 3.208281

$cor
     mean        sd 
0.4850530 0.2124471 

$cov
      mean         sd 
0.07154263 0.07234487 

$nrDisc
[1] 2

$eigenvalues
     mean        sd 
0.1518663 0.2187384 

$gMean
    mean       sd 
3.444764 2.018251 

$hMean
    mean       sd 
3.424851 2.014514 

$iqRange
     mean        sd 
0.4625000 0.2071177 

$kurtosis
       mean          sd 
-0.07541906  0.64345348 

$mad
     mean        sd 
0.3521175 0.1925954 

$max
    mean       sd 
4.258333 2.333339 

$mean
    mean       sd 
3.464500 2.021852 

$median
    mean       sd 
3.458333 2.014587 

$min
    mean       sd 
2.633333 1.669150 

$nrCorAttr
     mean        sd 
0.5000000 0.4409586 

$nrNorm
     mean        sd 
2.6666667 0.5773503 

$nrOutliers
mean   sd 
   2    1 

$range
     mean        sd 
1.6250000 0.7374711 

$sd
     mean        sd 
0.3577631 0.1613754 

$sdRatio
[1] 1.277229

$skewness
     mean        sd 
0.1199744 0.4378457 

$sparsity
      mean         sd 
0.06017094 0.03608774 

$tMean
    mean       sd 
3.455833 2.011284 

$var
     mean        sd 
0.1518663 0.1221409 

$wLambda
[1] 0.02343863

$canCor
     mean        sd 
0.7280090 0.3631869 

$gravity
[1] 3.208281

$cor
     mean        sd 
0.5941160 0.3375443 

$cov
     mean        sd 
0.5966542 0.5582672 

$nrDisc
[1] 2

$eigenvalues
    mean       sd 
1.143239 2.058771 

$gMean
    mean       sd 
3.223073 2.022943 

$hMean
    mean       sd 
2.978389 2.145948 

$iqRange
    mean       sd 
1.700000 1.275408 

$kurtosis
      mean         sd 
-0.8105361  0.7326910 

$mad
     mean        sd 
1.0934175 0.5785782 

$max
    mean       sd 
5.425000 2.443188 

$mean
    mean       sd 
3.464500 1.918485 

$median
    mean       sd 
3.612500 1.919364 

$min
    mean       sd 
1.850000 1.808314 

$nrCorAttr
[1] 0.5

$nrNorm
[1] 1

$nrOutliers
[1] 1

$range
 mean    sd 
3.575 1.650 

$sd
     mean        sd 
0.9478671 0.5712994 

$sdRatio
[1] 1.277229

$skewness
      mean         sd 
0.06273198 0.29439896 

$sparsity
      mean         sd 
0.02871478 0.01103236 

$tMean
    mean       sd 
3.470556 1.904802 

$var
    mean       sd 
1.143239 1.332546 

$wLambda
[1] 0.02343863

mfe documentation built on July 1, 2020, 10:46 p.m.