feature_selector: Feature Selection Wrapper

Description Usage Arguments Value See Also Examples

View source: R/variable_selection.R

Description

feature_selector This function uses four different methods (IV, PSI, correlation, xgboost) in order to select important features.The correlation algorithm must be used with IV.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
feature_selector(
  dat_train,
  dat_test = NULL,
  x_list = NULL,
  target = NULL,
  pos_flag = NULL,
  occur_time = NULL,
  ex_cols = NULL,
  filter = c("IV", "PSI", "XGB", "COR"),
  cv_folds = 1,
  iv_cp = 0.01,
  psi_cp = 0.5,
  xgb_cp = 0,
  cor_cp = 0.98,
  breaks_list = NULL,
  hopper = FALSE,
  vars_name = TRUE,
  parallel = FALSE,
  note = TRUE,
  seed = 46,
  save_data = FALSE,
  file_name = NULL,
  dir_path = tempdir(),
  ...
)

Arguments

dat_train

A data.frame with independent variables and target variable.

dat_test

A data.frame of test data. Default is NULL.

x_list

Names of independent variables.

target

The name of target variable.

pos_flag

The value of positive class of target variable, default: "1".

occur_time

The name of the variable that represents the time at which each observation takes place.

ex_cols

A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL.

filter

The methods for selecting important and stable variables.

cv_folds

Number of cross-validations. Default: 5.

iv_cp

The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.02

psi_cp

The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1

xgb_cp

Threshold of XGB feature's Gain. 0 <= xgb_cp <=1. Default is 1/number of independent variables.

cor_cp

Threshold of correlation between features. 0 <= cor_cp <=1; 0.7 to 0.98 usually work. Default is 0.98.

breaks_list

A table containing a list of splitting points for each independent variable. Default is NULL.

hopper

Logical.Filtering screening. Default is FALSE.

vars_name

Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE.

parallel

Logical, parallel computing. Default is FALSE.

note

Logical.Outputs info. Default is TRUE.

seed

Random number seed. Default is 46.

save_data

Logical, save results in locally specified folder. Default is FALSE.

file_name

The name for periodically saved results files. Default is "select_vars".

dir_path

The path for periodically saved results files. Default is "./variable"

...

Other parameters.

Value

A list of selected features

See Also

psi_iv_filter, xgb_filter, gbm_filter

Examples

1
2
3
4
5
feature_selector(dat_train = UCICreditCard[1:1000,c(2,8:12,26)],
                      dat_test = NULL, target = "default.payment.next.month",
                      occur_time = "apply_date", filter = c("IV", "PSI"),
                      cv_folds = 1, iv_cp = 0.01, psi_cp = 0.1, xgb_cp = 0, cor_cp = 0.98,
                      vars_name = FALSE,note = FALSE)

Example output

Package 'creditmodel' version 1.2.7
  Feature    IV   PSI
1   PAY_0 1.019 0.032
2   PAY_2 0.467 0.022
3   PAY_3 0.419 0.017
4   PAY_4 0.255 0.014
5   PAY_5 0.325 0.018
Warning message:
In train_test_split(dat = dat_train, split_type = "OOT", prop = 0.7,  :
  apply_date is  not date or time, unable to use OOT , split random.

creditmodel documentation built on Jan. 7, 2022, 5:06 p.m.