data_prep: Prepare Dataset for Modelling
In Nanoputian628/nano: Data Visualisation and Model Selection

data_prep

R Documentation

Prepare Dataset for Modelling

Description

Prepares dataset for modelling by cleaning, banding, imputing and by other options.

Usage

data_prep(
  data,
  response,
  intervals = NULL,
  buckets = NULL,
  na_bucket,
  unmatched_bucket,
  trunc_left = FALSE,
  trunc_right = FALSE,
  include_left = TRUE,
  split_or_fold = 1,
  holdout_ratio = 0,
  unique_row = TRUE,
  rm_low_var = FALSE,
  freq_thresh = 95/5,
  impute = FALSE,
  impute_method = "mice",
  pred_ignore = c(),
  impute_ignore = c(),
  rm_outliers = Inf,
  vif_select = FALSE,
  vif_ignore = c(),
  vif_thresh = 5,
  balance = FALSE,
  balance_class,
  balance_method = "under",
  balance_prop = 0.5,
  scale = FALSE,
  seed = 628,
  quiet = FALSE,
  thresh = 10,
  retain_names = TRUE,
  target_encode = FALSE,
  encode_cols,
  blend = FALSE,
  encode_inflec = 50,
  smoothing = 20,
  noise
)

Arguments

`data`	dataset to be analysed.
`response`	response variable to be used in modelling.
`intervals`	a list defining the bands for each of the variables.
`buckets`	a list defining the names of the bands for each of the variables.
`na_bucket`	a character or a list defining the bucket name for entries with `NA`.
`unmatched_bucket`	a character or a list defining the bucket name for unmatched entries.
`trunc_left`	a logical specifying whether the band to `-Inf` should be created.
`trunc_right`	a logical specifying whether the band to `Inf` should be created.
`include_left`	a logical specifying if should include the left or right endpoint for each interval.
`split_or_fold`	a numeric. If between 0 and 1, dataset is split into training and testing dataset. The number specifies the percentage of rows to be kept for training. If 1, dataset is not split and all rows kept for training. If an integer greater than 1, specifies number of folds to divide the dataset into.
`holdout_ratio`	a numeric between 0 and 1. Specifies what percentage of rows in the original dataset should be used for the holdout dataset.
`unique_row`	a logical. Whether duplicate rows should be deleted or retained.
`rm_low_var`	a logical. Whether variables with low variance should be deleted or retained.
`freq_thresh`	the cutoff for the ratio of the most common value to the second most common value in which a variable is removed when `rm_low_var = TRUE`.
`impute`	a logical. Whether missing values and outliers should be imputed.
`impute_method`	method of imputation. Possible methods are "mice" or "mean/mode".
`pred_ignore`	columns in dataset to be not used in data imputation process. Only required if `method` = "mice".
`impute_ignore`	columns in dataset to be not imputed.
`rm_outliers`	a numeric where values which are `rm_outliers` standard deviations away from the mean will be imputed. Can either be a single number or a vector of numbers for each column in `data` (including variables in `impute_ignore`. By default, set to `Inf`, hence no outliers are imputed.
`vif_select`	a logical. Whether stepwise VIF selection should be performed.
`vif_ignore`	columns in dataset to be not removed. Only relevant if `remove` is `TRUE`.
`vif_thresh`	threshold of VIF for variables to be removed.
`balance`	a logical. Whether the dataset should be balanced.
`balance_class`	categorical variable in dataset to be balanced by. This is an optional argument.
`balance_method`	specifies whether undersampling or oversample should be performed. Takes the value "under" or "over".
`balance_prop`	desired distribution of response per each class.
`scale`	a logical specifying whether the numeric variables should be scaled with 0 mean and 1 standard deviation.
`seed`	seed for `set.seed`.
`quiet`	a logical specifying whether messages should be output to the console.
`target_encode`	a logical specifying whether to perform target encoding on factor variables.
`encode_cols`	a character vector. Factor type variables to be target encoded.
`blend`	a logical specifying whether the target average should be weighted based on the count of the group.
`encode_inflec`	a numeric. This determines half of the minimal sample size for which the the estimate based on the sample in the particular level is completely trusted. This value is only valid when `blend = TRUE`.
`smoothing`	a numeric. The smoothing value is used for blending. Only valid when `blend = TRUE`.
`noise`	a numeric. Specify the amount of random noise that should be added to the target average in order to prevent overfitting. Set to 0 to disable noise.
`used`	for determining whether building a regression or classification model. If number of unique levels in `response` is less than `thresh`, then classification, otherwise regression model.

Details

The purpose of this function is to provide a general and off-the-shelf process to quickly prepare raw datasets for modelling. A large amount of flexibility is provided by the function and has the options to: band variables, impute missing values and outliers, perform step-wise VIF selection and balance the dataset by class. For further details on these process and their arguments, see the following functions respectively contained in the nano package: band_data, impute, vif_step and balance_data.

This function also provides the option to: split the dataset into k folds, training, testing, holdout dataset via the split_or_fold and holdout_ratio arguments. To split dataset into training and testing dataset, set split_or_fold to be a number between 0 and 1. To divide the dataset into k folds, set split_or_fold to be an integer greater than 1. If the dataset has been split into training and testing, or divided into k folds, additionally, a holdout dataset can be created. This can be done by using the holdout_ratio argument. Importantly, the holdout dataset can only be created if the dataset has been split into training and testing, or into k folds (i.e. split_or_fold != 1).

Other features available in this function are: remove variables with low variance via the rm_low_var argument, target encoding via the target_encode argument and scale numeric variables via the scale argument.

Value

List containing prepared dataset with various other metrics and summaries depending on the arguments entered.

Examples

## Not run: 
if(interactive()){
 data(property_prices)
 data_prep(data          = property_prices, 
           split_or_fold = 0.7, 
           holdout_ratio = 0.1, 
           impute        = TRUE, 
           vif_select    = TRUE,
           quiet         = TRUE)
 }

## End(Not run)

Nanoputian628/nano documentation built on Oct. 30, 2023, 3:28 p.m.