training_model: Training model

Description Usage Arguments Value See Also Examples

View source: R/model_training.R

Description

training_model Model builder

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
training_model(
  model_name = "mymodel",
  dat,
  dat_test = NULL,
  target = NULL,
  occur_time = NULL,
  obs_id = NULL,
  x_list = NULL,
  ex_cols = NULL,
  pos_flag = NULL,
  prop = 0.7,
  split_type = if (!is.null(occur_time)) "OOT" else "Random",
  preproc = TRUE,
  low_var = 0.99,
  missing_rate = 0.98,
  merge_cat = 30,
  remove_dup = TRUE,
  outlier_proc = TRUE,
  missing_proc = "median",
  default_miss = list(-1, "missing"),
  miss_values = NULL,
  one_hot = FALSE,
  trans_log = FALSE,
  feature_filter = list(filter = c("IV", "PSI", "COR", "XGB"), iv_cp = 0.02, psi_cp =
    0.1, xgb_cp = 0, cv_folds = 1, hopper = FALSE),
  algorithm = list("LR", "XGB", "GBM", "RF"),
  LR.params = lr_params(),
  XGB.params = xgb_params(),
  GBM.params = gbm_params(),
  RF.params = rf_params(),
  breaks_list = NULL,
  parallel = FALSE,
  cores_num = NULL,
  save_pmml = FALSE,
  plot_show = FALSE,
  vars_plot = TRUE,
  model_path = tempdir(),
  seed = 46,
  ...
)

Arguments

model_name

A string, name of the project. Default is "mymodel"

dat

A data.frame with independent variables and target variable.

dat_test

A data.frame of test data. Default is NULL.

target

The name of target variable.

occur_time

The name of the variable that represents the time at which each observation takes place.Default is NULL.

obs_id

The name of ID of observations or key variable of data. Default is NULL.

x_list

Names of independent variables. Default is NULL.

ex_cols

Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL.

pos_flag

The value of positive class of target variable, default: "1".

prop

Percentage of train-data after the partition. Default: 0.7.

split_type

Methods for partition. See details at : train_test_split.

preproc

Logical. Preprocess data. Default is TRUE.

low_var

Logical, delete low variance variables or not. Default is TRUE.

missing_rate

The maximum percent of missing values for recoding values to missing and non_missing.

merge_cat

merge categories of character variables that is more than m.

remove_dup

Logical, if TRUE, remove the duplicated observations.

outlier_proc

Logical, process outliers or not. Default is TRUE.

missing_proc

If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis.

default_miss

Default value of missing data imputation, Defualt is list(-1,'missing').

miss_values

Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing".

one_hot

Logical. If TRUE, one-hot_encoding of category variables. Default is FASLE.

trans_log

Logical, Logarithmic transformation. Default is FALSE.

feature_filter

Parameters for selecting important and stable features.See details at: feature_selector

algorithm

Algorithms for training a model. list("LR", "XGB", "GBDT", "RF") are available.

LR.params

Parameters of logistic regression & scorecard. See details at : lr_params.

XGB.params

Parameters of xgboost. See details at : xgb_params.

GBM.params

Parameters of GBM. See details at : gbm_params.

RF.params

Parameters of Random Forest. See details at : rf_params.

breaks_list

A table containing a list of splitting points for each independent variable. Default is NULL.

parallel

Default is FALSE.

cores_num

The number of CPU cores to use.

save_pmml

Logical, save model in PMML format. Default is TRUE.

plot_show

Logical, show model performance in current graphic device. Default is FALSE.

vars_plot

Logical, if TRUE, plot distribution ,correlation or partial dependence of model input variables . Default is TRUE.

model_path

The path for periodically saved data file. Default is tempdir().

seed

Random number seed. Default is 46.

...

Other parameters.

Value

A list containing Model Objects.

See Also

train_test_split,data_cleansing, feature_selector, lr_params, xgb_params, gbm_params, rf_params,fast_high_cor_filter,get_breaks_all,lasso_filter, woe_trans_all, get_logistic_coef, score_transfer,get_score_card, model_key_index,ks_psi_plot,ks_table_plot

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
x_list = c("LIMIT_BAL")
B_model = training_model(dat = dat,
                         model_name = "UCICreditCard",
                         target = "default.payment.next.month",
							x_list = x_list,
                         occur_time =NULL,
                         obs_id =NULL,
							dat_test = NULL,
                         preproc = FALSE,
                         outlier_proc = FALSE,
                         missing_proc = FALSE,
                         feature_filter = NULL,
                         algorithm = list("LR"),
                         LR.params = lr_params(lasso = FALSE,
                                               step_wise = FALSE,
                                                 score_card = FALSE),
                         breaks_list = NULL,
                         parallel = FALSE,
                         cores_num = NULL,
                         save_pmml = FALSE,
                         plot_show = FALSE,
                         vars_plot = FALSE,
                         model_path = tempdir(),
                         seed = 46)

creditmodel documentation built on Jan. 7, 2022, 5:06 p.m.