custom_preprocessing: Tailored custom preprocessing function which removes...

View source: R/custom_preprocessing.R

custom_preprocessingR Documentation

Tailored custom preprocessing function which removes unnecessary data, imputes the missing fields and selects most important features.

Description

The custom preprocessing function can be used as a more advanced option for the preprocessing step. It should be executed before the usage of 'train()', and its results have to be provided to the input of this function as a separate parameter. The outcomes from the 'custom_preprocessing()' are not obligatory for 'train()' to work, but are highly recommended.

Usage

custom_preprocessing(
  data,
  y,
  type = "auto",
  na_indicators = c(""),
  removal_parameters = list(active_modules = c(duplicate_cols = TRUE, id_like_cols =
    TRUE, static_cols = TRUE, sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols =
    TRUE), id_names = c("id", "nr", "number", "idx", "identification", "index"),
    static_threshold = 0.99, sparse_columns_threshold = 0.3, sparse_rows_threshold = 0.3,
    high_correlation_threshold = 0.7),
  imputation_parameters = list(imputation_method = "median-other", k = 10, m = 5),
  feature_selection_parameters = list(feature_selection_method = "BORUTA", max_features =
    "default", nperm = 1, cutoffPermutations = 20, threadsNumber = NULL, method =
    "estevez"),
  verbose = FALSE
)

Arguments

data

A data source, that is one of the major R formats: data.table, data.frame, matrix and so on.

y

A string that indicates a target column name.

type

A character, one of 'binary_clf'/'regression'/'survival'/'auto'/'multiclass' that sets the type of the task. If 'auto' (the default option) then forester will figure out 'type' based on the number of unique values in the 'y' variable, or the presence of time/status columns.

na_indicators

A list containing the values that will be treated as NA indicators. By default the list is c(”). WARNING Do not include NA or NaN, as these are already checked in other criterion.

removal_parameters

A list containing the parameters used in the removal of unnecessary data. It needs to be provided as presented in an example with exactly the same column names. The parameters are described below:

  • `active_modules` A logical vector describing active removal modules. By default it is set as 'c(duplicate_cols = TRUE, id_like_cols = TRUE, static_cols = TRUE, sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols = TRUE)', which is equal to c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE). Setting corrupt_rows to FALSE still results in the removal of observations without target value.

  • `id_names` A vector of strings indicating which column names are perceived as ID-like. By default the list is: ['id', 'nr', 'number', 'idx', 'identification', 'index'].

  • `static_threshold` A numeric value from [0,1] range, which indicates the maximum threshold of dominating values for column If feature has more dominating values it is going to be removed. By default set to 1, which indicates that all values are equal.

  • `sparse_columns_threshold` A numeric value from [0,1] range, which indicates the maximum threshold of missing values for columns If column has more missing fields it is going to be removed. By default set to 0.3.

  • `sparse_rows_threshold` A numeric value from [0,1] range, which indicates the maximum threshold of missing values for observation. If observation has more missing fields it is going to be removed. By default set to 0.3. `na_indicators` A list containing the values that will be treated as NA indicators. By default the list is c(”). WARNING Do not include NA or NaN, as these are already checked in other criterion.

  • `high_correlation_threshold` A numeric value from [0,1] range, which indicates when we consider the correlation to be high. If feature surpasses this threshold it is going to be removed. By default set to 0.7.

imputation_parameters

A list containing the parameters used in the imputation of missing data. It needs to be provided as presented in an example with exactly the same column names. The parameters are described below:

  • `imputation_method` A string value indication the imputation method. The imputation method must be one of 'median-other', 'median-frequency', 'knn', or 'mice'.

  • `k` An integer describing the number of nearest neighbours to use. By default set to 10. The parameter applicable only if selection ‘imputation_method' is ’knn'.

  • `m` An integer describing the number of multiple imputations to use. By default set to 5. The parameter applicable only if selection ‘imputation_method' is ’mice'.

feature_selection_parameters

A list containing the parameters used in the feature selection process. It needs to be provided as presented in an example with exactly the same column names. The parameters are described below:

  • `feature_selection_method` A string value indication the feature selection method. The imputation method must be one of 'VI', 'MCFS', 'MI', 'BORUTA' (default), or 'none' if we don't want it.

  • `max_features` A positive integer value describing the desired number of selected features. Initial value set as 'default' which is min(10, ncol(data) - 1) for 'VI' and 'MI', and NULL (number of relevant features chosen by the method) for ‘MCFS'. Only 'MCFS' can use the NULL value. 'BORUTA' doesn’t use this parameter.

  • `nperm` An integer describing the number of permutations performed, relevant for the 'VI' method. By default set to 1.

  • `cutoffPermutations` An non-negative integer value that determines the number of permutation runs. It needs at least 20 permutations for a statistically significant result. Minimum value of this parameter is 3, however if it is 0 then permutations method is turned off. Relevant for the 'MCFS' method.

  • `threadsNumber` A positive integer value describing the number of threads to use in computation. More threads needs more CPU cores as well as memory usage is a bit higher. It is recommended to set this value equal to or less than CPU available cores. By default set to NULL, which will use maximal number of cores minus 1. Relevant for the 'MCFS' method.

  • `method` A string that indicates which algorithm will be used for MI method. Available options are the default 'estevez' which works well for smaller datasets, but can raise errors for bigger ones, and simpler 'peng'. More details present in the documentation of ?varrank method.

verbose

A logical value, if set to TRUE, provides all information about preprocessing process, if FALSE gives none.

Value

A list containing four objects:

  • `data` A dataset after the preprocessing,

  • `rm_colnames` The names of removed columns,

  • `rm_rows` The indexes of removed observations,

  • `bin_labels` The text labels before target binarization,

  • `custom_params` The list of all parameters specified for this function.

Examples

## Not run: 
k <- custom_preprocessing(data = lisbon,
                     y = 'Price',
                     na_indicators = c(''),
                     removal_parameters = list(
                       active_modules = c(duplicate_cols = TRUE, id_like_cols    = TRUE,
                                          static_cols    = TRUE, sparse_cols     = TRUE,
                                          corrupt_rows   = TRUE, correlated_cols = TRUE),
                       id_names = c('id', 'nr', 'number', 'idx', 'identification', 'index'),
                       static_threshold           = 0.99,
                       sparse_columns_threshold   = 0.3,
                       sparse_rows_threshold      = 0.3,
                       high_correlation_threshold = 0.7
                     ),
                     imputation_parameters = list(
                       imputation_method = 'median-other',
                       k = 10,
                       m = 5
                     ),
                     feature_selection_parameters = list(
                       feature_selection_method = 'BORUTA',
                       max_features = 'default',
                       nperm = 1,
                       cutoffPermutations = 20,
                       threadsNumber = NULL,
                       method = 'estevez'
                     ),
                     verbose = FALSE)

# If you want to obtain the same results quickly, just use the code below:
do.call(custom_preprocessing, k$custom_params)


## End(Not run)

ModelOriented/forester documentation built on June 6, 2024, 7:29 a.m.