process_pipeline: process_pipeline

View source: R/process_pipeline.R

process_pipelineR Documentation

process_pipeline

Description

Main EDA function call to use. This is a wrapper around the functions that will bin the attributes in the dataframe and give back summarized tables.

Usage

process_pipeline(
  run_id,
  df,
  unique_id_var,
  dv_var,
  dv_type = "Binary",
  dv_denominator = NULL,
  var_list,
  num_nbins = 20,
  num_min_pct = 0.02,
  num_binning_type = "Bucketing",
  num_monotonic = TRUE,
  cat_max_levels = 200,
  cat_min_pct = 0.02,
  bin_random_together = 0.005,
  eda_tracking = TRUE,
  path_2_save = getwd()
)

Arguments

run_id

An identifier that will be used when naming output tables to the specified path (path_2_save parameter). Example: 'MyRun1'

df

A dataframe you are wanting to analyze

unique_id_var

A variable in your dataframe that uniquely identifies a record. Can only be 1 variable.

dv_var

The name of the dependent variable (dv). Example: 'target'

dv_type

Can take on 1 of two inpunts - c('Binary','Frequency'). Both should be numeric. If 'Frequency' is the input, it should be the numerator (if it is a rate). The denominator will be specified as a separate parameter

dv_denominator

The denominator of your dependent variable. In many cases, this can be considered the exposure.

var_list

A list of non-numeric variables to analyze and create bins for

num_nbins

For numeric variables, maximum number of bins to initially split numeric variables into. Default is 20

num_min_pct

For numeric variables, the minimun percent of records a final bin should have. The input should be between (0,1). Generally applies to only bins that are not NA. Default is 0.02 (or 2 percent)

num_binning_type

The type of binning to use when splitting the variable. One of two can be selected: c("Bucketing","Quantiles"). "Bucketing" uses the cut() function where breaks=nbins. "Quantiles" uses the cut() function where breaks=c(-Inf, unique(quantile( tmpDF[,i],probs=seq(0,1, by=1/nbins),include.lowest=TRUE,na.rm=TRUE)))). Default is "Bucketing"

num_monotonic

For numeric variables, this is a Logical TRUE/FALSE input. If TRUE, it will force the bins to be monotonic based on the event rate. Default is TRUE

cat_max_levels

For non-numeric variables, if a variable initially has more unique levels than cat_max_levels, it will be skipped. Default is 200

cat_min_pct

For non-numeric variables, this is the minimun percent of records a final bin should have. The input should be between (0,1). Generally applies to only bins that are not NA. Default is 0.02 (or 2 percent)

bin_random_together

This is the threshold to identify if a level belongs in a random bin. The input should be between (0,1). Generally applies to only bins that are not NA. Default is 0.005 (or 0.5 percent)

eda_tracking

Logical TRUE/FALSE inputs. If set to TRUE, the user will be able to see what variable the function is analyzing. Default is TRUE

path_2_save

A path to a folder where the outputs will be stored. Default is: getwd(). Or an example: /store/outputs/in/this/folder

Value

A list of dataframes. First in the list will be 'Numeric_eda' - this is an aggregated dataframe showing the groups created along with other key information. The second is 'numeric_iv' - This is a dataframe with each variable processed and their information value. The last is 'numeric_logics' - This is a dataframe with the information needed to apply to your dataframe and transform your variables. This table will be the input to apply_numeric_logic(logic_df=numeric_logics)


cjodice10/eda documentation built on Feb. 7, 2023, 3:26 p.m.