Using `smartadata`

  collapse = TRUE,
  comment = "#>",
  warning = FALSE, 
  message = FALSE

Purpose of smartdata

R programming language has a wide variety of packages that target machine learning topiscs. However, each package has its own API, which makes the task of using several of them quite time-consuming. The main purpose of smartdata is to provide a common interface for a collection of well-used machine learning packages, so that using methods from different libraries gets easier. Also, it standardizes names of arguments, so that if a method had a parameter num_iterations and other had a parameter iterations with similar meaning, then both methods would take the same parameter name in smartdata.

smartdata includes preprocessing algorithms for oversampling, instance selection, feature selection, normalization, discretization, space transformation, outliers treatment, missing values imputation and noise cleaning.

Each of the aforementioned topics has its corresponding wrapper: instance_selection, feature_selection, normalize, discretize, space_transformation, clean_outliers, impute_missing and clean_noise, respectively.

In addition to that, magrittr has been used to provide a pipeline, so that a typical preprocessing workflow can be expressed as:

result <- dataset %>% impute_missing %>% clean_noise %>% oversample %>% feature_selection

Basic help

To check the methods a certain wrapper can be called with, we can simply call the help function included in smartdata: which_options.


To check parameters for a wrapper, it suffices to do ?{name of the wrapper}. For example: ?instance_selection, which would output:


instance_selection(dataset, method, class_attr = "Class", ...)


dataset     we want to perform an instance selection on
method      selected method of instance selection
class_attr  character. Indicates the class attribute from dataset. Must exist in it
...         Further arguments for method

The most common arguments that can appear in a wrapper documentation are:

Moreover, if we already know which method we want to use but we do not recall its arguments, we can check its list of parameters with:

which_options("instance_selection", "multiedit")

That option provides a brief description for each possible parameter, as well as a reference to the original function, in case the information for some parameter might not seem clear enough (although mapping between original function's arguments and smartdata wrapper is not exact).

In summary, a valid call for multiedit would be:

super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2, 
                                          null_passes = 10, class_attr = "Species")


super_iris <- iris %>% instance_selection("multiedit", k = 3, null_passes = 10,                                           
                                          class_attr = "Species")


super_iris <- iris %>% instance_selection("multiedit", k = 3, 
                                          class_attr = "Species")

Or even:

super_iris <- iris %>% instance_selection("multiedit", class_attr = "Species")

Techniques included in smartdata

Let $S = {(x_i, y_i)}_{i=1^m}$ be our set of data from this moment on. It holds $x_i \in \mathcal{X} \subset \mathbb{R}^n$ and $y_i\in \mathcal{Y}$ where $\mathcal{Y}$ is a finite set of labels. $(x_i, y_i)$ is called an instance.


Given $\mathcal{Y}$ with $|\mathcal{Y}| = 2$, if there exist more instances labeled with one class than with the other, oversampling will generate synthetic instances labeled with the minority class, namely $E$, so that the resulting dataset $S \cup E$ has better characteristics for classification.

Since oversampling is going to replicate or generate artificial instances belonging to the minority class, a class attribute must be indicated with class_attr argument, as well as a ratio (a desired proportion of imbalance, when applicable), where this ratio is computed as: [ \frac{\textrm{number of minority instances}}{\textrm{number of majority instances}} ]

Also, a filtering argument can be provided, indicating whether to perform a filtering of the generated instances using NEATER.

Possible methods are:

Only if picked method is wRACOG, an additional parameter wrapper with possible values 'KNN' and 'C5.0' can be provided. That parameter denotes the desired classificator to select instances.

As example:

data(iris0, package = "imbalance")
super_iris <- oversample(iris0, method = "MWMOTE", class_attr = "Class",
                         ratio = 0.8, filtering = TRUE)

Instance selection

Consists in picking a subset $S'\subseteq S$, so that certain characteristics are preserved with respect to the original sample (such as distribution of classes) or at least we keep most of representative instances.

As said, methods try to pick instances preserving original classes distribution, so class_attr must be supplied.

Possible methods are:

As example:

super_iris <- instance_selection(iris, method = "CNN", class_attr = "Species")

Feature selection

Consists in given a subset of features $T \subseteq {1,\ldots n}$, projecting the tuples of $S = {(x_i, y_i)}{i=1}^m$ to the set of features given by $T$. That is, if we defined [ p{T}((x_{i1}, \ldots, x_{in})) = (x_{ij}){j \in T} ] as the projection to the features in $T$, doing a feature selection would result in a set $S' = {(p_T(x_i), y_i)}{i=1}^m$, if the picked features were $T$.

Apart from the parameters for each specific method, class_attr must be supplied for each of the preprocessings, and an additional parameter exclude can be supplied with a vector of features names, so that those features are striped before the feature selection and joined after the procedure.

Possible methods are:

As example:

super_iris <- feature_selection(iris, "Boruta", class_attr = "Species")


Implies converting the sample data $S$ into another dataset $S'$ where each tuple is treated so that standard deviation of the data ends up being zero (feature-wise or considered as elements of $\mathbb{R}^n$), or all the data is mapped to $[0,1]$ interval (e.g. dividing by the maximum feature-wise), etc

Possible methods are:

Apart from the parameters for each specific method, an additional parameter exclude can be supplied with a vector of feature names, so that those features are striped before the normalization and joined after the procedure, with the unchanged original data.

As an example:

super_iris <- normalize(iris, method = "min_max", exclude = "Species", by = "column")


Discretizing a dataset consists in turning continuous variables into discrete attributes, that is, picking an attribute which can take real values and making a binning so that there exists a finite set of values for the variable. This procedure can have a great importance in procedures such as classification.

Possible methods are:

Similar to feature selection or normalization wrappers, discretize needs a class_attr in certain cases (methods like equalfreq, equalwidth or globalequalwidth do not need it, and will output an error if this argument is supplied to discretize).

Also an exclude argument can be passed to the method, indicating which variables are going to be ignored when performing the process.

As an example:

super_iris <- discretize(iris, method = "chi2", class_attr = "Species")

Space transformation

Space transformation changes a training set $S = {(x_i^{(1)}, \ldots x_i^{(n)})}{i=1\ldots m}$ where $x_i^{(j)} \in \mathbb{R}$ into another set $S' = {(\bar{x}_i^{(1)}, \ldots, \bar{x}_i^{(k)})}{i=1\ldots m}$ with better properties for machine learning methods.

Possible methods are:

This treatment accepts an exclude argument in form of a vector of column names. Those columns will be ignored during the procedure and appended in the same order without variation after it, before the non numeric columns (which will be appended in the precise sequence they appeared originally).

As example:

data(ecoli1, package = "imbalance")
super_ecoli <- space_transformation(ecoli1, "lle_knn", k = 3, num_features = 2,
                                   regularization = 1, exclude = c("Mcg", "Alm1"))


Outliers are instances that lie far from the others (that is, they seem quite distant from the others, using some kind of predefined distance or measure).

Possible methods are:

When treating outliers, those can be considered as whole rows of the dataset (that is, an outlier is a whole instance), or as certain values inside each column (that is, outliers inside attributes). The first case receives the name of multivariate approach, whereas the second case is called univariate.

multivariate approach will imply deletion of the outlier instances from the dataset. univariate outliers can be treated imputing the median or the mean of the column.

As example:

super_iris <- clean_outliers(iris, method = "multivariate", type = "adj")
super_iris <- clean_outliers(iris, method = "univariate", type = "z", fill = "mean")

Missing values

Missing values are tuples that lack a certain attribute value, so a preliminary imputation could have a lot of benefits with respect to a posterior machine learning technique, such as classification.

Possible methods are:

An exclude argument in form of a vector of column names can be passed to impute_missing, so that those columns will be ignored during the procedure and appended without variation after it.

As example:

data(nhanes, package = "mice")
super_nhanes <- impute_missing(nhanes, "gibbs_sampling")


Noisy data are instances which include additional meaningless information apart from the true information they encode. Removing or repairing those instances prior to a classification can have a potential benefit on it.

Possible methods are:

A class_attr must be supplied.

As example:

super_iris <- clean_noise(iris, method = "AENN", class_attr = "Species", k = 3)

Try the smartdata package in your browser

Any scripts or data that you put into this service are public.

smartdata documentation built on Dec. 19, 2019, 1:08 a.m.