View source: R/custom_preprocessing.R
custom_preprocessing | R Documentation |
The custom preprocessing function can be used as a more advanced option for the preprocessing step. It should be executed before the usage of 'train()', and its results have to be provided to the input of this function as a separate parameter. The outcomes from the 'custom_preprocessing()' are not obligatory for 'train()' to work, but are highly recommended.
custom_preprocessing(
data,
y,
type = "auto",
na_indicators = c(""),
removal_parameters = list(active_modules = c(duplicate_cols = TRUE, id_like_cols =
TRUE, static_cols = TRUE, sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols =
TRUE), id_names = c("id", "nr", "number", "idx", "identification", "index"),
static_threshold = 0.99, sparse_columns_threshold = 0.3, sparse_rows_threshold = 0.3,
high_correlation_threshold = 0.7),
imputation_parameters = list(imputation_method = "median-other", k = 10, m = 5),
feature_selection_parameters = list(feature_selection_method = "BORUTA", max_features =
"default", nperm = 1, cutoffPermutations = 20, threadsNumber = NULL, method =
"estevez"),
verbose = FALSE
)
data |
A data source, that is one of the major R formats: data.table, data.frame, matrix and so on. |
y |
A string that indicates a target column name. |
type |
A character, one of 'binary_clf'/'regression'/'survival'/'auto'/'multiclass' that sets the type of the task. If 'auto' (the default option) then forester will figure out 'type' based on the number of unique values in the 'y' variable, or the presence of time/status columns. |
na_indicators |
A list containing the values that will be treated as NA indicators. By default the list is c(”). WARNING Do not include NA or NaN, as these are already checked in other criterion. |
removal_parameters |
A list containing the parameters used in the removal of unnecessary data. It needs to be provided as presented in an example with exactly the same column names. The parameters are described below:
|
imputation_parameters |
A list containing the parameters used in the imputation of missing data. It needs to be provided as presented in an example with exactly the same column names. The parameters are described below:
|
feature_selection_parameters |
A list containing the parameters used in the feature selection process. It needs to be provided as presented in an example with exactly the same column names. The parameters are described below:
|
verbose |
A logical value, if set to TRUE, provides all information about preprocessing process, if FALSE gives none. |
A list containing four objects:
`data`
A dataset after the preprocessing,
`rm_colnames`
The names of removed columns,
`rm_rows`
The indexes of removed observations,
`bin_labels`
The text labels before target binarization,
`custom_params`
The list of all parameters specified for this function.
## Not run:
k <- custom_preprocessing(data = lisbon,
y = 'Price',
na_indicators = c(''),
removal_parameters = list(
active_modules = c(duplicate_cols = TRUE, id_like_cols = TRUE,
static_cols = TRUE, sparse_cols = TRUE,
corrupt_rows = TRUE, correlated_cols = TRUE),
id_names = c('id', 'nr', 'number', 'idx', 'identification', 'index'),
static_threshold = 0.99,
sparse_columns_threshold = 0.3,
sparse_rows_threshold = 0.3,
high_correlation_threshold = 0.7
),
imputation_parameters = list(
imputation_method = 'median-other',
k = 10,
m = 5
),
feature_selection_parameters = list(
feature_selection_method = 'BORUTA',
max_features = 'default',
nperm = 1,
cutoffPermutations = 20,
threadsNumber = NULL,
method = 'estevez'
),
verbose = FALSE)
# If you want to obtain the same results quickly, just use the code below:
do.call(custom_preprocessing, k$custom_params)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.