data_cleansing: Data Cleaning

Description Usage Arguments Value See Also Examples

Description

The data_cleansing function is a simpler wrapper for data cleaning functions, such as delete variables that values are all NAs; checking dat and target format. delete low variance variables replace null or NULL or blank with NA; encode variables which NAs & miss value rate is more than 95 encode variables which unique value rate is more than 95 merge categories of character variables that is more than 10; transfer time variables to dateformation; remove duplicated observations; process outliers; process NAs.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
data_cleansing(
  dat,
  target = NULL,
  obs_id = NULL,
  occur_time = NULL,
  pos_flag = NULL,
  x_list = NULL,
  ex_cols = NULL,
  miss_values = NULL,
  remove_dup = TRUE,
  outlier_proc = TRUE,
  missing_proc = "median",
  low_var = 0.999,
  missing_rate = 0.999,
  merge_cat = TRUE,
  note = TRUE,
  parallel = FALSE,
  save_data = FALSE,
  file_name = NULL,
  dir_path = tempdir()
)

Arguments

dat

A data frame with x and target.

target

The name of target variable.

obs_id

The name of ID of observations.Default is NULL.

occur_time

The name of occur time of observations.Default is NULL.

pos_flag

The value of positive class of target variable, default: "1".

x_list

A list of x variables.

ex_cols

A list of excluded variables. Default is NULL.

miss_values

Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing".

remove_dup

Logical, if TRUE, remove the duplicated observations.

outlier_proc

Logical, process outliers or not. Default is TRUE.

missing_proc

If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis.

low_var

The maximum percent of unique values (including NAs) for filtering low variance variables.

missing_rate

The maximum percent of missing values for recoding values to missing and non_missing.

merge_cat

The minimum number of categories for merging categories of character variables.

note

Logical. Outputs info. Default is TRUE.

parallel

Logical, parallel computing or not. Default is FALSE.

save_data

Logical, save the result or not. Default is FALSE.

file_name

The name for periodically saved data file. Default is NULL.

dir_path

The path for periodically saved data file. Default is tempdir().

Value

A preprocessed data.frame

See Also

remove_duplicated, null_blank_na, entry_rate_na, low_variance_filter, process_nas, process_outliers

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#data cleaning
dat_cl = data_cleansing(dat = UCICreditCard[1:2000,],
                       target = "default.payment.next.month",
                       x_list = NULL,
                       obs_id = "ID",
                       occur_time = "apply_date",
                       ex_cols = c("PAY_6|BILL_"),
                       outlier_proc = TRUE,
                       missing_proc = TRUE,
                       low_var = TRUE,
                       save_data = FALSE)

creditmodel documentation built on Jan. 7, 2022, 5:06 p.m.