data_prep: Prepare Dataset for Modelling

View source: R/data_prep.R

data_prepR Documentation

Prepare Dataset for Modelling

Description

Prepares dataset for modelling by cleaning, banding, imputing and by other options.

Usage

data_prep(
  data,
  response,
  intervals = NULL,
  buckets = NULL,
  na_bucket,
  unmatched_bucket,
  trunc_left = FALSE,
  trunc_right = FALSE,
  include_left = TRUE,
  split_or_fold = 1,
  holdout_ratio = 0,
  unique_row = TRUE,
  rm_low_var = FALSE,
  freq_thresh = 95/5,
  impute = FALSE,
  impute_method = "mice",
  pred_ignore = c(),
  impute_ignore = c(),
  rm_outliers = Inf,
  vif_select = FALSE,
  vif_ignore = c(),
  vif_thresh = 5,
  balance = FALSE,
  balance_class,
  balance_method = "under",
  balance_prop = 0.5,
  scale = FALSE,
  seed = 628,
  quiet = FALSE,
  thresh = 10,
  retain_names = TRUE,
  target_encode = FALSE,
  encode_cols,
  blend = FALSE,
  encode_inflec = 50,
  smoothing = 20,
  noise
)

Arguments

data

dataset to be analysed.

response

response variable to be used in modelling.

intervals

a list defining the bands for each of the variables.

buckets

a list defining the names of the bands for each of the variables.

na_bucket

a character or a list defining the bucket name for entries with NA.

unmatched_bucket

a character or a list defining the bucket name for unmatched entries.

trunc_left

a logical specifying whether the band to -Inf should be created.

trunc_right

a logical specifying whether the band to Inf should be created.

include_left

a logical specifying if should include the left or right endpoint for each interval.

split_or_fold

a numeric. If between 0 and 1, dataset is split into training and testing dataset. The number specifies the percentage of rows to be kept for training. If 1, dataset is not split and all rows kept for training. If an integer greater than 1, specifies number of folds to divide the dataset into.

holdout_ratio

a numeric between 0 and 1. Specifies what percentage of rows in the original dataset should be used for the holdout dataset.

unique_row

a logical. Whether duplicate rows should be deleted or retained.

rm_low_var

a logical. Whether variables with low variance should be deleted or retained.

freq_thresh

the cutoff for the ratio of the most common value to the second most common value in which a variable is removed when rm_low_var = TRUE.

impute

a logical. Whether missing values and outliers should be imputed.

impute_method

method of imputation. Possible methods are "mice" or "mean/mode".

pred_ignore

columns in dataset to be not used in data imputation process. Only required if method = "mice".

impute_ignore

columns in dataset to be not imputed.

rm_outliers

a numeric where values which are rm_outliers standard deviations away from the mean will be imputed. Can either be a single number or a vector of numbers for each column in data (including variables in impute_ignore. By default, set to Inf, hence no outliers are imputed.

vif_select

a logical. Whether stepwise VIF selection should be performed.

vif_ignore

columns in dataset to be not removed. Only relevant if remove is TRUE.

vif_thresh

threshold of VIF for variables to be removed.

balance

a logical. Whether the dataset should be balanced.

balance_class

categorical variable in dataset to be balanced by. This is an optional argument.

balance_method

specifies whether undersampling or oversample should be performed. Takes the value "under" or "over".

balance_prop

desired distribution of response per each class.

scale

a logical specifying whether the numeric variables should be scaled with 0 mean and 1 standard deviation.

seed

seed for set.seed.

quiet

a logical specifying whether messages should be output to the console.

target_encode

a logical specifying whether to perform target encoding on factor variables.

encode_cols

a character vector. Factor type variables to be target encoded.

blend

a logical specifying whether the target average should be weighted based on the count of the group.

encode_inflec

a numeric. This determines half of the minimal sample size for which the the estimate based on the sample in the particular level is completely trusted. This value is only valid when blend = TRUE.

smoothing

a numeric. The smoothing value is used for blending. Only valid when blend = TRUE.

noise

a numeric. Specify the amount of random noise that should be added to the target average in order to prevent overfitting. Set to 0 to disable noise.

used

for determining whether building a regression or classification model. If number of unique levels in response is less than thresh, then classification, otherwise regression model.

Details

The purpose of this function is to provide a general and off-the-shelf process to quickly prepare raw datasets for modelling. A large amount of flexibility is provided by the function and has the options to: band variables, impute missing values and outliers, perform step-wise VIF selection and balance the dataset by class. For further details on these process and their arguments, see the following functions respectively contained in the nano package: band_data, impute, vif_step and balance_data.

This function also provides the option to: split the dataset into k folds, training, testing, holdout dataset via the split_or_fold and holdout_ratio arguments. To split dataset into training and testing dataset, set split_or_fold to be a number between 0 and 1. To divide the dataset into k folds, set split_or_fold to be an integer greater than 1. If the dataset has been split into training and testing, or divided into k folds, additionally, a holdout dataset can be created. This can be done by using the holdout_ratio argument. Importantly, the holdout dataset can only be created if the dataset has been split into training and testing, or into k folds (i.e. split_or_fold != 1).

Other features available in this function are: remove variables with low variance via the rm_low_var argument, target encoding via the target_encode argument and scale numeric variables via the scale argument.

Value

List containing prepared dataset with various other metrics and summaries depending on the arguments entered.

Examples

## Not run: 
if(interactive()){
 data(property_prices)
 data_prep(data          = property_prices, 
           split_or_fold = 0.7, 
           holdout_ratio = 0.1, 
           impute        = TRUE, 
           vif_select    = TRUE,
           quiet         = TRUE)
 }

## End(Not run)

Nanoputian628/nano documentation built on Oct. 30, 2023, 3:28 p.m.