discretization | R Documentation |

This function discretizes a data frame of possibly continuous random variables through rules for discretization. The discretization algorithms are unsupervised and univariate. See details for the complete list (the number of state of each random variable could also be provided).

discretization(data.df = NULL, data.dists = NULL, discretization.method = "sturges", nb.states = FALSE)

`data.df` |
a data frame containing the data to discretize, binary variables must be declared as factors, other as a numeric vector. The data frame must be named. |

`data.dists` |
a named list giving the distribution for each node in the network. |

`discretization.method` |
a character vector giving the discretization method to use; see details. If a number is provided, the variable will be discretized by equal binning. |

`nb.states` |
logical variable to select the output. If set to |

`discretization()`

supports multiple rules for discretization. Below is the list of supported rules. IQR() stands for interquartile range.

`fd`

stands for the Freedman Diaconis rule. The number of bins is given by

*
range(x) * n^{1/3} / 2 * IQR(x)*

The Freedman Diaconis rule is known to be less sensitive than the Scott's rule to outlier.

`doane`

stands for doane's rule. The number of bins is given by

*
1 + \log_{2}{n} + \log_{2}{1+\frac{|g|}{σ_{g}}}*

This is a modification of Sturges' formula, which attempts to improve its performance with non-normal data.

`sqrt`

The number of bins is given by:

*√(n)*

`cencov`

stands for Cencov's rule. The number of bins is given by:

*n^{1/3}*

`rice`

stands for Rice' rule. The number of bins is given by:

*2 n^{1/3}*

`terrell-scott`

stands for Terrell-Scott's rule. The number of bins is given by:

*(2 n)^{1/3}*

This is known that Cencov, Rice, and Terrell-Scott rules over-estimates k, compared to other rules due to his simplicity.

`sturges`

stands for Sturges's rule. The number of bins is given by:

*1 + \log_2(n)*

`scott`

stands for Scott's rule. The number of bins is given by:

*range(x) / σ(x) n^{-1/3}*

The discretized data frame or a list containing the table of counts for each bin the discretized data frame.

Gilles Kratzer

Garcia, S., et al. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. *IEEE Transactions on Knowledge and Data Engineering*, 25.4, 734-750.

Cebeci, Z. and Yildiz, F. (2017). Unsupervised Discretization of Continuous Variables in a Chicken Egg Quality Traits Dataset. *Turkish Journal of Agriculture-Food Science and Technology*, 5.4, 315-320.

## Generate random variable rv <- rnorm(n = 100, mean = 5, sd = 2) dist <- list("gaussian") names(dist) <- c("rv") ## Compute the entropy through discretization entropyData(freqs.table = discretization(data.df = rv, data.dists = dist, discretization.method = "sturges", nb.states = FALSE))

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.