Extracts meta-features from datasets to support the design of recommendation systems based on Meta-Learning (MtL). The meta-features, also called characterization measures, are able to characterize the complexity of datasets and to provide estimates of algorithm performance. The package contains not only the standard, but also more recent characterization measures. By making available a large set of meta-feature extraction functions, this package allows a comprehensive data characterization, a deep data exploration and a large number of MtL-based data analysis.
In MtL, meta-features are designed to extract general properties able to characterize datasets. The meta-feature values should provide relevant evidences about the performance of algorithms, allowing the design of MtL-based recommendation systems. Thus, these measures must be able to predict, with a low computational cost, the performance of the algorithms under evaluation. In this package, the meta-feature measures are divided into five groups:
In the following sections we will briefly introduce how to use the mfe
package to extract all the measures using standard methods as well as to extract
specific measures using methods for each group. Once the package is loaded, the
vignette is also available inside R with the command browseVignettes
.
The standard way to extract meta-features is using the metafeatures
methods.
The method can be used by a symbolic description of the model (formula) or by a
data frame. The parameters are the dataset and the group of measures to be
extracted. By default, the method extract all the measures. For instance:
library(mfe) ## Extract all measures using formula iris.info <- metafeatures(Species ~ ., iris) ## Extract all measures using data frame iris.info <- metafeatures(iris[,1:4], iris[,5]) ## Extract general, statistical and information-theoretic measures iris.info <- metafeatures(Species ~ ., iris, groups=c("general", "statistical", "infotheo"))
Several measures return more than one value. To aggregate them, post processing
methods can be used. It is possible to compute min, max, mean, median, kurtosis,
standard deviation, among others. The default methods are the mean
and the
sd
. For instance:
## Compute all measures using min, median and max iris.info <- metafeatures(Species ~ ., iris, summary=c("min", "median", "max")) ## Compute all measures using quantile iris.info <- metafeatures(Species ~ ., iris, summary="quantile")
To customize the measure extraction, is necessary to use specific methods for
each group of measures. For instance, infotheo
and statistical
compute
the information theoretical and the statistical measures, respectively. The
following examples illustrate these cases:
## Extract two information theoretical measures stat.iris <- infotheo(Species ~ ., iris, features=c("attrEnt", "jointEnt")) ## Extract three statistical measures disc.iris <- statistical(Species ~ ., iris, features=c("cancor", "cor", "iqr")) ## Extract the histogram for the correlation measure hist.iris <- statistical(Species ~ ., iris, features="cor", summary="hist")
Different from the metafeatures
method, these methods receive a parameter
called features
, to define which features are required, and return a list
instead of a numeric vector. In additional, some groups can be customized using
additional arguments.
There are five measure groups which can be either general information about the dataset, statistical information, descriptors about information theoretical, measures designed to extract characteristics about the DT model or landmarks which represent the performance of simple algorithms applied to the dataset. The following example show the available groups:
## Show the the available groups ls.metafeatures()
These are the most simple measures for extracting general properties of the
datasets. For instance, nrAttr
and nrClass
are the total number of
attributes in the dataset and the number of output values (classes) in the
dataset, respectively. To list the measures of this group use ls.general()
.
The following examples illustrate these measures:
## Show the the available general measures ls.general() ## Extract all general measures general.iris <- general(Species ~ ., iris) ## Extract two general measures general(Species ~ ., iris, features=c("nrAttr", "nrClass"))
The general measures return a list named by the requested measures. The
post.processing
methods are applied only for the freqClass
meta-feature. For
instance, to extract the minimum, maximum and the standard deviation of the
classes proportion use:
## Extract two general measures general(Species ~ ., iris, features="freqClass", summary=c("min", "max", "sd"))
Statistical meta-features are the standard statistical measures to describe the
numerical properties of a distribution of data. As it requires only numerical
attributes, the categorical data are transformed to numerical. For instance,
cor
and skewness
are the absolute correlation between of each pair of
attributes and the skewness of the numeric attributes in the dataset,
respectively. To list the measures of this group use ls.statistical()
. The
following examples illustrate these measures:
## Show the the available statistical measures ls.statistical() ## Extract all statistical measures stat.iris <- statistical(Species ~ ., iris) ## Extract two statistical measures statistical(Species ~ ., iris, features=c("cor", "skewness"))
The statistical group can use two additional parameter called by.class
and
transform
. To the former the default is by.class=FALSE
which means that the
meta-features are computed without consider the classes values. Otherwise,
the measure is extracted using the instances separated by class. In the latter,
the default value is transform=TRUE
which means that categorical attributes
will be transformed to numeric. The following example shows the use of these
two definitions:
## Extract correlation using instances by classes statistical(Species ~ ., iris, features="cor", by.class=TRUE) ## Ignore the class attributes aux <- cbind(class=iris$Species, iris) statistical(Species ~ ., aux, transform=FALSE)
Note that, in the first example the values and the cardinality of the measure
are different since the correlation between the attributes were computed using
the instances for each class separately. The post.processing
methods are
applied in these measures since they return multiple values. To define which
them should be applied use the summary
parameter, as detailed in the
post.processing
method.
Information theoretical meta-features are particularly appropriate to describe
discrete (categorical) attributes, but they also fit continuous ones using a
discretization process. These measures are based on information theory. For
instance, normClassEnt
and mutInf
are the normalized entropy of the class
and the common information shared between each attribute and the class in the
dataset, respectively. To list the measures of this group use ls.infotheo()
.
The following examples illustrate these measures:
## Show the the available information theoretical measures ls.infotheo() ## Extract all information theoretical measures inf.iris <- infotheo(Species ~ ., iris) ## Extract two information theoretical measures infotheo(Species ~ ., iris, features=c("normClassEnt", "mutInf"))
The Information theoretical group can use one additional parameter called
transform
. Using the default value transform=TRUE
the continuous attributes
will be discretized. The following example shows the use of this definition:
## Ignore the discretization process aux <- cbind(class=iris$Species, iris) infotheo(Species ~ ., aux, transform=FALSE)
The information theoretical measures return a list named by the requested
measures. The post.processing
methods are applied in some measures since they
return multiple values. To define which them should be applied use the
summary
parameter, as detailed in the section Post Processing Methods.
These measures describe characteristics of the investigated models. These
meta-features can include, for example, the description of the DT induced for a
dataset, like its number of leaves (leaves
) and the number of nodes (nodes
)
of the tree. The following examples illustrate these measures:
## Show the the available model.based measures ls.model.based() ## Extract all model.based measures land.iris <- model.based(Species ~ ., iris) ## Extract three model.based measures model.based(Species ~ ., iris, features=c("leaves", "nodes"))
The DT model based measures return a list named by the requested measures. The
post.processing
methods are applied in these measures since they return
multiple values. To define which them should be applied use the summary
parameter, as detailed in the post.processing
method.
Landmarking measures are simple and fast algorithms, from which performance
characteristics can be extracted. These measures include the performance of
simple and efficient learning algorithms like Naive Bayes (naiveBayes
) and
1-Nearest Neighbor (oneNN
). The following examples illustrate these measures:
## Show the the available landmarking measures ls.landmarking() ## Extract all landmarking measures land.iris <- landmarking(Species ~ ., iris) ## Extract two landmarking measures landmarking(Species ~ ., iris, features=c("naiveBayes", "oneNN"))
The performance extraction of these measures without a cross validation step can
cause model overfitting in the data. Therefore the landmarking
function
has the parameter folds
to define the number of k
-fold cross-validation and
the parameter score
to select the performance measure. The following example
show how to set this value:
## Extract one landmarking measures with folds=2 landmarking(Species ~ ., iris, features="naiveBayes", folds=2) ## Extract one landmarking measures with folds=2 landmarking(Species ~ ., iris, features="naiveBayes", score="kappa")
The landmarking measures return a list named by the requested measures. The
post.processing
methods are applied in these measures since they return
multiple values. To define which them should be applied use the summary
parameter, as detailed in the post.processing
method.
The relative group is the landmarking with sampling and ranking strategies. The sampling strategy decreases the computational cost of the landmarking by selecting a subsample of the original examples. The ranking strategy capture relative information between the performance of the algorithms. The following examples illustrate these measures:
## Show the the available relative measures ls.relative() ## Extract all relative measures real.iris <- relative(Species ~ ., iris) ## Extract all relative measures with half of the samples relative(Species ~ ., iris, size=0.5) ## Extract two relative measures relative(Species ~ ., iris, features=c("naiveBayes", "oneNN"))
Clustering measures extract information about dataset based on external validation indexes. The main ideia is measure the complexity of the dataset using indexes able to check information about the predictive attributes and the label. The following examples illustrate these measures:
## Show the the available clustering measures ls.clustering() ## Extract all clustering measures clus.iris <- clustering(Species ~ ., iris) ## Extract two clustering measures clustering(Species ~ ., iris, features=c("vdu", "vdb"))
Several meta-features generate multiple values and mean
and sd
are the
standard method to summary these values. In order to increase the flexibility,
the mfe
package implemented the post processing methods to deal with multiple
measures values. This method is able to deal with descriptive statistic
(resulting in a single value) or a distribution (resulting in multiple values).
The post processing methods are setted using the parameter summary
. It is
possible to compute min, max, mean, median, kurtosis, standard deviation, among
others. Any R method, can be used, as illustrated in the following examples:
## Apply several statistical measures as post processing statistical(Species ~ ., iris, "cor", summary=c("kurtosis", "max", "mean", "median", "min", "sd", "skewness", "var")) ## Apply quantile as post processing method statistical(Species ~ ., iris, "cor", summary="quantile") ## Get the default values without summarize them statistical(Species ~ ., iris, "cor", summary=c())
Beyond these R default methods, two additional post processing methods are
available in the mfe
package: hist
and non.aggregated
. The first computes
a histogram of the values and returns the frequencies of in each bins. The
extra parameters bins
can be used to define the number of values to be
returned. The parameters min
and max
are used to define the range of the
data. The second is a way to obtain all values from the measure and has the
same effect of the use of an empty list. The following code illustrate examples
of the use of these post processing methods:
## Apply histogram as post processing method statistical(Species ~ ., iris, "cor", summary="hist") ## Apply histogram as post processing method and customize it statistical(Species ~ ., iris, "cor", summary="hist", bins=5, min=0, max=1) ## Extract all correlation values statistical(Species ~ ., iris, "cor", summary="non.aggregated")
It is also possible define an user's post processing method, like this:
## Compute the absolute difference between the mean and the median my.method <- function(x, ...) abs(mean(x) - median(x)) ## Using the user defined post processing method statistical(Species ~ ., iris, "cor", summary="my.method")
In this paper the mfe
package, aimed to extract meta-features from dataset,
has been introduced. The functions supplied by this package allow both their
use in MtL experiments as well as perform data analysis using characterization
measures able to describe datasets. Currently, six groups of meta-features can
be extracted for any classification dataset. These groups and features represent
the standard and the state of the art characterization measures.
The mfe
package was designed to be easily customized and extensible. The
development of the mfe
package will continue in the near future by including
new meta-features, group of measures supporting regression problems and MtL
evaluation measures.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.