data_prep | R Documentation |
Prepares dataset for modelling by cleaning, banding, imputing and by other options.
data_prep(
data,
response,
intervals = NULL,
buckets = NULL,
na_bucket,
unmatched_bucket,
trunc_left = FALSE,
trunc_right = FALSE,
include_left = TRUE,
split_or_fold = 1,
holdout_ratio = 0,
unique_row = TRUE,
rm_low_var = FALSE,
freq_thresh = 95/5,
impute = FALSE,
impute_method = "mice",
pred_ignore = c(),
impute_ignore = c(),
rm_outliers = Inf,
vif_select = FALSE,
vif_ignore = c(),
vif_thresh = 5,
balance = FALSE,
balance_class,
balance_method = "under",
balance_prop = 0.5,
scale = FALSE,
seed = 628,
quiet = FALSE,
thresh = 10,
retain_names = TRUE,
target_encode = FALSE,
encode_cols,
blend = FALSE,
encode_inflec = 50,
smoothing = 20,
noise
)
data |
dataset to be analysed. |
response |
response variable to be used in modelling. |
intervals |
a list defining the bands for each of the variables. |
buckets |
a list defining the names of the bands for each of the variables. |
na_bucket |
a character or a list defining the bucket name for entries with |
unmatched_bucket |
a character or a list defining the bucket name for unmatched entries. |
trunc_left |
a logical specifying whether the band to |
trunc_right |
a logical specifying whether the band to |
include_left |
a logical specifying if should include the left or right endpoint for each interval. |
split_or_fold |
a numeric. If between 0 and 1, dataset is split into training and testing dataset. The number specifies the percentage of rows to be kept for training. If 1, dataset is not split and all rows kept for training. If an integer greater than 1, specifies number of folds to divide the dataset into. |
holdout_ratio |
a numeric between 0 and 1. Specifies what percentage of rows in the original dataset should be used for the holdout dataset. |
unique_row |
a logical. Whether duplicate rows should be deleted or retained. |
rm_low_var |
a logical. Whether variables with low variance should be deleted or retained. |
freq_thresh |
the cutoff for the ratio of the most common value to the second most
common value in which a variable is removed when |
impute |
a logical. Whether missing values and outliers should be imputed. |
impute_method |
method of imputation. Possible methods are "mice" or "mean/mode". |
pred_ignore |
columns in dataset to be not used in data imputation process. Only required if |
impute_ignore |
columns in dataset to be not imputed. |
rm_outliers |
a numeric where values which are |
vif_select |
a logical. Whether stepwise VIF selection should be performed. |
vif_ignore |
columns in dataset to be not removed. Only relevant if |
vif_thresh |
threshold of VIF for variables to be removed. |
balance |
a logical. Whether the dataset should be balanced. |
balance_class |
categorical variable in dataset to be balanced by. This is an optional argument. |
balance_method |
specifies whether undersampling or oversample should be performed. Takes the value "under" or "over". |
balance_prop |
desired distribution of response per each class. |
scale |
a logical specifying whether the numeric variables should be scaled with 0 mean and 1 standard deviation. |
seed |
seed for |
quiet |
a logical specifying whether messages should be output to the console. |
target_encode |
a logical specifying whether to perform target encoding on factor variables. |
encode_cols |
a character vector. Factor type variables to be target encoded. |
blend |
a logical specifying whether the target average should be weighted based on the count of the group. |
encode_inflec |
a numeric. This determines half of the minimal sample size for which
the the estimate based on the sample in the particular level is completely trusted. This
value is only valid when |
smoothing |
a numeric. The smoothing value is used for blending. Only valid when
|
noise |
a numeric. Specify the amount of random noise that should be added to the target average in order to prevent overfitting. Set to 0 to disable noise. |
used |
for determining whether building a regression or classification model. If number
of unique levels in |
The purpose of this function is to provide a general and off-the-shelf process to
quickly prepare raw datasets for modelling. A large amount of flexibility is provided by
the function and has the options to: band variables, impute missing values and outliers,
perform step-wise VIF selection and balance the dataset by class. For further details on
these process and their arguments, see the following functions respectively contained in the
nano
package: band_data, impute, vif_step and balance_data.
This function also provides the option to: split the dataset into k folds, training, testing,
holdout dataset via the split_or_fold
and holdout_ratio
arguments. To split dataset into
training and testing dataset, set split_or_fold
to be a number between 0 and 1. To divide the
dataset into k folds, set split_or_fold
to be an integer greater than 1. If the dataset has
been split into training and testing, or divided into k folds, additionally, a holdout dataset
can be created. This can be done by using the holdout_ratio
argument. Importantly, the
holdout dataset can only be created if the dataset has been split into training and testing, or
into k folds (i.e. split_or_fold
!= 1).
Other features available in this function are: remove variables with low variance via the
rm_low_var
argument, target encoding via the target_encode
argument and scale numeric
variables via the scale
argument.
List containing prepared dataset with various other metrics and summaries depending on the arguments entered.
## Not run:
if(interactive()){
data(property_prices)
data_prep(data = property_prices,
split_or_fold = 0.7,
holdout_ratio = 0.1,
impute = TRUE,
vif_select = TRUE,
quiet = TRUE)
}
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.