discretize_get_bins: Get the data frame thresholds for discretization

Description Usage Arguments Value Examples

Description

It takes a data frame and returns another data frame indicating the threshold for each bin (or segment) in order to discretize the variable.

Usage

1
discretize_get_bins(data, n_bins = 5, input = NULL)

Arguments

data

Data frame source

n_bins

The number of desired bins (or segments) that each variable will have.

input

Vector of string containing all the variables that will be processed. If empty it will run for all numerical variables that match the following condition, the number of unique values must be higher than the ones defined at 'n_bins' parameter. NAs values are automatically handled by converting them into another category (more info about it at https://livebook.datascienceheroes.com/data-preparation.html#treating-missing-values-in-numerical-variables). This function must be used with discretize_df. If it is needed a different number of bins per variable, then the function must be called more than once.

Value

Data frame containing the thresholds or cuts to bin every variable

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
# Getting the bins thresholds for each. If input is missing, will run for all numerical variables.
d_bins=discretize_get_bins(data=heart_disease,
                           input=c("resting_blood_pressure", "oldpeak"),
                           n_bins=5)

# Now it can be applied on the same data frame, or in a new one (for example in a predictive model
# that change data over time)
 heart_disease_discretized=discretize_df(data=heart_disease, data_bins=d_bins, stringsAsFactors=T)

# Checking results
df_status(heart_disease_discretized)

## End(Not run)

Example output

Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package:HmiscThe following objects are masked frompackage:base:

    format.pval, units

funModeling v.1.9.4 :)
Examples and tutorials at livebook.datascienceheroes.com
 / Now in Spanish: librovivodecienciadedatos.ai
Variables processed: resting_blood_pressure, oldpeak
Variables processed: resting_blood_pressure, oldpeak
Warning message:
`funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
                 variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
1                     age       0    0.00    0 0.00     0     0 integer     41
2                  gender       0    0.00    0 0.00     0     0  factor      2
3              chest_pain       0    0.00    0 0.00     0     0  factor      4
4  resting_blood_pressure       0    0.00    0 0.00     0     0  factor      5
5       serum_cholestoral       0    0.00    0 0.00     0     0 integer    152
6     fasting_blood_sugar     258   85.15    0 0.00     0     0  factor      2
7         resting_electro     151   49.83    0 0.00     0     0  factor      3
8          max_heart_rate       0    0.00    0 0.00     0     0 integer     91
9             exer_angina     204   67.33    0 0.00     0     0 integer      2
10                oldpeak       0    0.00    0 0.00     0     0  factor      5
11                  slope       0    0.00    0 0.00     0     0 integer      3
12      num_vessels_flour     176   58.09    4 1.32     0     0 integer      4
13                   thal       0    0.00    2 0.66     0     0  factor      3
14 heart_disease_severity     164   54.13    0 0.00     0     0 integer      5
15           exter_angina     204   67.33    0 0.00     0     0  factor      2
16      has_heart_disease       0    0.00    0 0.00     0     0  factor      2

funModeling documentation built on July 1, 2020, 5:40 p.m.