knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(rbin)
Binning is the process of transforming numerical or continuous data into categorical data. It is a common data pre-processing step of the model building process.
rbin has the following features:
For manual binning, you need to specify the cut points for the bins. rbin
follows the left closed and
right open interval ([0,1) = {x | 0 ≤ x < 1}
) for creating bins. The number of cut points you specify
is one less than the number of bins you want to create i.e. if you want to create 10 bins, you
need to specify only 9 cut points as shown in the below example. The accompanying RStudio addin,
rbinAddin()
can be used to iteratively bin the data and to enforce monotonic increasing/decreasing trend.
After finalizing the bins, you can use rbin_create()
to create the dummy variables.
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56)) bins
# plot plot(bins)
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56)) rbin_create(mbank, age, bins)
You can collapse or combine levels of a factor/categorical variable using rbin_factor_combine()
and then use rbin_factor()
to look at weight of evidence, entropy and information value. After
finalizing the bins, you can use rbin_factor_create()
to create the dummy variables. You can
use the RStudio addin, rbinFactorAddin()
to interactively combine the levels and create dummy variables
after finalizing the bins.
upper <- c("secondary", "tertiary") out <- rbin_factor_combine(mbank, education, upper, "upper") table(out$education) out <- rbin_factor_combine(mbank, education, c("secondary", "tertiary"), "upper") table(out$education)
bins <- rbin_factor(mbank, y, education) bins
# plot plot(bins)
upper <- c("secondary", "tertiary") out <- rbin_factor_combine(mbank, education, upper, "upper") rbin_factor_create(out, education)
Quantile binning aims to bin the data into roughly equal groups using quantiles.
bins <- rbin_quantiles(mbank, y, age, 10) bins
# plot plot(bins)
Equal length binning creates bins of equal widths. It is different from equal frequency binning which creates bins of equal size.
bins <- rbin_equal_length(mbank, y, age, 10) bins
# plot plot(bins)
Winsorized binning is similar to equal length binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data pre-processing stage. For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in equal length binning.
bins <- rbin_winsorize(mbank, y, age, 10, winsor_rate = 0.05) bins
# plot plot(bins)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.