View source: R/preprocessing_removal.R
preprocessing_removal | R Documentation |
This function includes 6 modules for the removal of unwanted features / observations. We can remove duplicate columns, the ID-like columns, static columns (with specified staticity threshold), sparse columns (with specified sparsity threshold), and highly correlated ones (with specified high correlation threshold). Additionally we can remove the observations that are too sparse (sparsity threshold), and have missing target value. One can turn on and off each module by setting proper 'active_modules' logical values.
preprocessing_removal(
data,
y,
active_modules = c(duplicate_cols = TRUE, id_like_cols = TRUE, static_cols = TRUE,
sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols = TRUE),
id_names = c("id", "nr", "number", "idx", "identification", "index"),
static_threshold = 0.99,
sparse_columns_threshold = 0.3,
sparse_rows_threshold = 0.3,
na_indicators = c(""),
high_correlation_threshold = 0.7,
verbose = FALSE
)
data |
A data source, that is one of the major R formats: data.table, data.frame, matrix, and so on. |
y |
A string that indicates a target column name. |
active_modules |
A logical vector describing active removal modules. By default it is set as 'c(duplicate_cols = TRUE, id_like_cols = TRUE, static_cols = TRUE, sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols = TRUE)', which is equal to c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE). Setting corrupt_rows to FALSE still results in the removal of observations without target value. |
id_names |
A vector of strings indicating which column names are perceived as ID-like. By default the list is: ['id', 'nr', 'number', 'idx', 'identification', 'index']. |
static_threshold |
A numeric value from [0,1] range, which indicates the maximum threshold of dominating values for column If feature has more dominating values it is going to be removed. By default set to 1, which indicates that all values are equal. |
sparse_columns_threshold |
A numeric value from [0,1] range, which indicates the maximum threshold of missing values for columns If column has more missing fields it is going to be removed. By default set to 0.3. |
sparse_rows_threshold |
A numeric value from [0,1] range, which indicates the maximum threshold of missing values for observation. If observation has more missing fields it is going to be removed. By default set to 0.3. |
na_indicators |
A list containing the values that will be treated as NA indicators. By default the list is c(”). WARNING Do not include NA or NaN, as these are already checked in other criterion. |
high_correlation_threshold |
A numeric value from [0,1] range, which indicates when we consider the correlation to be high. If feature surpasses this threshold it is going to be removed. By default set to 0.7. |
verbose |
A logical value, if set to TRUE, provides all information about preprocessing process, if FALSE gives none. |
A list containing three objects:
`data`
A dataset with deleted observations and columns.
`rm_col`
The indexes of removed columns.
`rm_row`
The indexes of removed rows.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.