Description Usage Arguments Value See Also Examples
The data_cleansing function is a simpler wrapper for data cleaning functions, such as
delete variables that values are all NAs;
checking dat and target format.
delete low variance variables
replace null or NULL or blank with NA;
encode variables which NAs &  miss value rate is more than 95
encode variables which unique value  rate is  more than 95
merge categories of character variables that  is more than 10;
transfer time variables to dateformation;
remove duplicated observations;
process outliers;
process NAs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  | data_cleansing(
  dat,
  target = NULL,
  obs_id = NULL,
  occur_time = NULL,
  pos_flag = NULL,
  x_list = NULL,
  ex_cols = NULL,
  miss_values = NULL,
  remove_dup = TRUE,
  outlier_proc = TRUE,
  missing_proc = "median",
  low_var = 0.999,
  missing_rate = 0.999,
  merge_cat = TRUE,
  note = TRUE,
  parallel = FALSE,
  save_data = FALSE,
  file_name = NULL,
  dir_path = tempdir()
)
 | 
dat | 
 A data frame with x and target.  | 
target | 
 The name of target variable.  | 
obs_id | 
 The name of ID of observations.Default is NULL.  | 
occur_time | 
 The name of occur time of observations.Default is NULL.  | 
pos_flag | 
 The value of positive class of target variable, default: "1".  | 
x_list | 
 A list of x variables.  | 
ex_cols | 
 A list of excluded variables. Default is NULL.  | 
miss_values | 
 Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing".  | 
remove_dup | 
 Logical, if TRUE, remove the duplicated observations.  | 
outlier_proc | 
 Logical, process outliers or not. Default is TRUE.  | 
missing_proc | 
 If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis.  | 
low_var | 
 The maximum percent of unique values (including NAs) for filtering low variance variables.  | 
missing_rate | 
 The maximum percent of missing values for recoding values to missing and non_missing.  | 
merge_cat | 
 The minimum number of categories for merging categories of character variables.  | 
note | 
 Logical. Outputs info. Default is TRUE.  | 
parallel | 
 Logical, parallel computing or not. Default is FALSE.  | 
save_data | 
 Logical, save the result or not. Default is FALSE.  | 
file_name | 
 The name for periodically saved data file. Default is NULL.  | 
dir_path | 
 The path for periodically saved data file. Default is tempdir().  | 
A preprocessed data.frame
remove_duplicated,
null_blank_na,
entry_rate_na,
low_variance_filter,
process_nas,
process_outliers
1 2 3 4 5 6 7 8 9 10 11  | #data cleaning
dat_cl = data_cleansing(dat = UCICreditCard[1:2000,],
                       target = "default.payment.next.month",
                       x_list = NULL,
                       obs_id = "ID",
                       occur_time = "apply_date",
                       ex_cols = c("PAY_6|BILL_"),
                       outlier_proc = TRUE,
                       missing_proc = TRUE,
                       low_var = TRUE,
                       save_data = FALSE)
 | 
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.