data_cleansing: Data Cleaning
In creditmodel: Toolkit for Credit Modeling, Analysis and Visualization

Description Usage Arguments Value See Also Examples

The data_cleansing function is a simpler wrapper for data cleaning functions, such as delete variables that values are all NAs; checking dat and target format. delete low variance variables replace null or NULL or blank with NA; encode variables which NAs & miss value rate is more than 95 encode variables which unique value rate is more than 95 merge categories of character variables that is more than 10; transfer time variables to dateformation; remove duplicated observations; process outliers; process NAs.

data_cleansing(
  dat,
  target = NULL,
  obs_id = NULL,
  occur_time = NULL,
  pos_flag = NULL,
  x_list = NULL,
  ex_cols = NULL,
  miss_values = NULL,
  remove_dup = TRUE,
  outlier_proc = TRUE,
  missing_proc = "median",
  low_var = 0.999,
  missing_rate = 0.999,
  merge_cat = TRUE,
  note = TRUE,
  parallel = FALSE,
  save_data = FALSE,
  file_name = NULL,
  dir_path = tempdir()
)

`dat`	A data frame with x and target.
`target`	The name of target variable.
`obs_id`	The name of ID of observations.Default is NULL.
`occur_time`	The name of occur time of observations.Default is NULL.
`pos_flag`	The value of positive class of target variable, default: "1".
`x_list`	A list of x variables.
`ex_cols`	A list of excluded variables. Default is NULL.
`miss_values`	Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing".
`remove_dup`	Logical, if TRUE, remove the duplicated observations.
`outlier_proc`	Logical, process outliers or not. Default is TRUE.
`missing_proc`	If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis.
`low_var`	The maximum percent of unique values (including NAs) for filtering low variance variables.
`missing_rate`	The maximum percent of missing values for recoding values to missing and non_missing.
`merge_cat`	The minimum number of categories for merging categories of character variables.
`note`	Logical. Outputs info. Default is TRUE.
`parallel`	Logical, parallel computing or not. Default is FALSE.
`save_data`	Logical, save the result or not. Default is FALSE.
`file_name`	The name for periodically saved data file. Default is NULL.
`dir_path`	The path for periodically saved data file. Default is tempdir().

A preprocessed data.frame

remove_duplicated, null_blank_na, entry_rate_na, low_variance_filter, process_nas, process_outliers

#data cleaning
dat_cl = data_cleansing(dat = UCICreditCard[1:2000,],
                       target = "default.payment.next.month",
                       x_list = NULL,
                       obs_id = "ID",
                       occur_time = "apply_date",
                       ex_cols = c("PAY_6|BILL_"),
                       outlier_proc = TRUE,
                       missing_proc = TRUE,
                       low_var = TRUE,
                       save_data = FALSE)