The `glmdisc` package: discretization at its finest
In glmdisc: Discretization and Grouping for Logistic Regression

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.align = "center",
  fig.height = 4
)

Context

This research has been financed by Crédit Agricole Consumer Finance (CA CF), subsidiary of the Crédit Agricole Group which provides all kinds of banking and insurance services. CA CF specializes in consumer loans. It is a joint work at Inria Nord-Europe between Adrien Ehrhardt (CA CF, Inria), Christophe Biernacki (Inria, Lille University), Vincent Vandewalle (Inria, Lille University) and Philippe Heinrich (Lille University).

In order to accept / reject loan applications more efficiently (both quicker and to select better applicants), most financial institutions resort to Credit Scoring: given the applicant's characteristics he/she is given a Credit Score, which has been statistically designed using previously accepted applicants, and which partly decides whether the financial institution will grant the loan or not.

The current methodology for building a Credit Score in most financial institutions is based on logistic regression for several (mostly practical) reasons: it generally gives satisfactory discriminant results, it is reasonably well explainable (contrary to a random forest for example), it is robust to missing data (in particular not financed clients) that follow a MAR missingness mechanism.

Example

In practice, the statistical modeler has historical data about each customer's characteristics. For obvious reasons, only data available at the time of inquiry must be used to build a future application scorecard. Those data often take the form of a well-structured table with one line per client alongside their performance (did they pay back their loan or not?) as can be seen in the following table:

knitr::kable(data.frame(Job=c("Craftsman","Technician","Executive","Office employee"),Habitation = c("Owner","Renter","Starter","By family"),Time_in_job = c(10,20,5,2), Children = c(0,1,2,3), Family_status=  c("Divorced","Widower","Single","Married"),Default = c("No","No","Yes","No")))

Notations

In the rest of the vignette, the random vector $X=(X_j)_1^d$ will designate the predictive features, i.e. the characteristics of a client. The random variable $Y \in {0,1}$ will designate the label, i.e. if the client has defaulted ($Y=1$) or not ($Y=0$).

We are provided with an i.i.d. sample $(\bar{x},\bar{y}) = (x_i,y_i)_1^n$ consisting in $n$ observations of $X$ and $Y$.

Logistic regression

The logistic regression model assumes the following relation between $X$ (supposed continuous here) and $Y$: $$\ln \left( \frac{p_\theta(Y=1|x)}{p_\theta(Y=0|x)} \right) = \theta_0 + \sum_{j=1}^d \theta_j*x_j $$ where $\theta = (\theta_j)_0^d$ are estimated using $(\bar{x},\bar{y})$.

Clearly, the model assumes linearity of the logit transform of the response $Y$ with respect to $X$.

Common problems with logistic regression on "raw" data

Fitting a logistic regression model on "raw" data presents several problems, among which some are tackled here.

Feature selection

First, among all collected information on individuals, some are irrelevant for predicting $Y$. Their coefficient $\theta_j$ should subsequently be $0$ which might (eventually) be the case asymptotically (i.e. $n \rightarrow \infty$).

Second, some collected information are highly correlated and affect each other's coefficient estimation.

As a consequence, data scientists often perform feature selection before training a machine learning algorithm such as logistic regression.

There already exists methods and packages to perform feature selection, see for example the caret package.

glmdisc is not a feature selection tool but acts as such as a side-effect as we will see in the next part.

Linearity

When provided with continuous features, the logistic regression model assumes linearity of the logit transform of the response $Y$ with respect to $X$. This might not be the case at all.

For example, we can simulate a logistic model with an arbitrary power of $X$ and then try to fit a linear logistic model:

x = matrix(runif(1000), nrow = 1000, ncol = 1)
p = 1/(1+exp(-3*x^5))
y = rbinom(1000,1,p)
modele_lin <- glm(y ~ x, family = binomial(link="logit"))
pred_lin <- predict(modele_lin,as.data.frame(x),type="response")
pred_lin_logit <- predict(modele_lin,as.data.frame(x))

knitr::kable(head(data.frame(True_prob = p,Pred_lin = pred_lin)))

Of course, providing the glm function with a formula object containing $X^5$ would solve the problem. This can't be done in practice for two reasons: first, it is too time-consuming to examine all features and candidate polynomials; second, we lose the interpretability of the logistic decision function which was of primary interest.

Consequently, we wish to discretize the input variable $X$ into a categorical feature which will "minimize" the error with respect to the "true" underlying relation:

x_disc <- factor(cut(x,c(-Inf,0.5,0.7,0.8,0.9,+Inf)),labels = c(1,2,3,4,5))
modele_disc <- glm(y ~ x_disc, family = binomial(link="logit"))
pred_disc <- predict(modele_disc,as.data.frame(x_disc),type="response")
pred_disc_logit <- predict(modele_disc,as.data.frame(x_disc))

knitr::kable(head(data.frame(True_prob = p,Pred_lin = pred_lin,Pred_disc = pred_disc)))
plot(x,3*x^5,main = "Estimated logit transform of p(Y|X)", ylab = "p(Y|X) under different models")
lines(x,pred_lin_logit,type="p",col="red")
lines(x,pred_disc_logit,type="p",col="blue")

Too many values per categorical feature

When provided with categorical features, the logistic regression model fits a coefficient for all its values (except one which is taken as a reference). A common problem arises when there are too many values as each value will be taken by a small number of observations $x_{i,j}$ which makes the estimation of a logistic regression coefficient unstable:

x_disc_bad_idea <- factor(cut(x,c(-Inf,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,+Inf)),labels = c(1,2,3,4,5,6,7,8,9,10))

If we divide the training set in 10 and estimate the variance of each coefficient, we get:

liste_coef <- list()

for (k in 1:10) {
     x_part <- factor(x_disc_bad_idea[((k-1)*nrow(x)/10 +1) : (k/10*nrow(x))])
     y_part <- y[((k-1)*length(y)/10 +1) : (k/10*length(y))]
     modele_part <- glm(y_part ~ x_part, family=binomial(link = "logit"))
     liste_coef[[k]] <- (modele_part$coefficients)
}

estim_coef <- matrix(NA, nrow = nlevels(x_disc_bad_idea), ncol = 10)

for (i in 1:nlevels(x_disc_bad_idea)) {
     estim_coef[i,] <- unlist(lapply(liste_coef,function(batch) batch[paste0("x_part",levels(factor(x_disc_bad_idea))[i])]))
}

stats_coef <- matrix(NA, nrow = nlevels(x_disc_bad_idea), ncol = 3)

for (i in 1:nlevels(x_disc_bad_idea)) {
     stats_coef[i,1] <- mean(estim_coef[i,], na.rm = TRUE)
     stats_coef[i,2] <- sd(estim_coef[i,], na.rm = TRUE)
     stats_coef[i,3] <- sum(is.na(estim_coef[i,]))
}

stats_coef <- stats_coef[-1,] 
row.names(stats_coef) <- levels(x_disc_bad_idea)[2:nlevels(x_disc_bad_idea)]

plot (row.names(stats_coef), stats_coef[,1],ylab="Estimated coefficient",xlab="Factor value of x", ylim = c(-1,8))
segments(as.numeric(row.names(stats_coef)), stats_coef[,1]-stats_coef[,2],as.numeric(row.names(stats_coef)),stats_coef[,1]+stats_coef[,2])
lines(row.names(stats_coef),rep(0,length(row.names(stats_coef))),col="red")

All intervals crossing $0$ are non-significant! We should group factor values to get a stable estimation and (hopefully) significant coefficient values.

Discretization and grouping: theoretical background

Notations

Let $Q=(Q_j)_1^d$ be the latent discretized transform of $X$, i.e. taking values in ${0,\ldots,m_j}$ where the number of values of each covariate $m_j$ is also latent.

The fitted logistic regression model is now: $$\ln \left( \frac{p_\theta(Y=1|q)}{p_\theta(Y=0|q)} \right) = \theta_0 + \sum_{j=1}^d \sum_{k=1}^{m_j} \theta_j^k*\mathbb{1}_{q_j=k}$$ Clearly, the number of parameters has grown which allows for flexible approximation of the true underlying model $p(Y|Q)$.

Best discretization?

Our goal is to obtain the model $p_\theta(Y|e)$ with best predictive power. As $Q$ and $\theta$ are both optimized, a formal goodness-of-fit criterion could be: $$ (\hat{\theta},\hat{\bar{e}}) = \arg \max_{\theta,\bar{e}} \text{AIC}(p_\theta(\bar{y}|\bar{e})) $$ where AIC stands for Akaike Information Criterion.

Combinatorics

The problem seems well-posed: if we were able to generate all discretization schemes transforming $X$ to $Q$, learn $p_\theta(y|e)$ for each of them and compare their AIC values, the problem would be solved.

Unfortunately, there are way too many candidates to follow this procedure. Suppose we want to construct $k$ intervals of $Q_j$ given $n$ distinct $(x_{i,j})_1^n$. There is $n \choose k$ models. The true value of $k$ is unknown, so it must be looped over. Finally, as logistic regression is a multivariate model, the discretization of $Q_j$ can influence the discretization of $Q^k$, $k \neq j$.

As a consequence, existing approaches to discretization (in particular discretization of continuous attributes) rely on strong assumptions to simplify the search of good candidates as can be seen in the review of Ramírez‐Gallego, S. et al. (2016) - see References section.

Discretization and grouping: estimation

Likelihood estimation

$Q$ can be introduced in $p(Y|X)$: $$\forall \: x,y, \; p(y|x) = \sum_e p(y|x,e)p(e|x)$$

First, we assume that all information about $Y$ in $X$ is already contained in $Q$ so that: $$\forall \: x,y,q, \; p(y|x,q)=p(y|q)$$ Second, we assume the conditional independence of $Q_j$ given $X_j$, i.e. knowing $X_j$, the discretization $Q_j$ is independent of the other features $X^k$ and $Q^k$ for all $k \neq j$: $$\forall \:x, k\neq j, \; Q_j | x_j \perp Q_k | x_k$$ The first equation becomes: $$\forall \: x,y, \; p(y|x) = \sum_q p(y|q) \prod_{j=1}^d p(q_j|x_j)$$ As said earlier, we consider only logistic regression models on discretized data $p_\theta(y|e)$. Additionnally, it seems like we have to make further assumptions on the nature of the relationship of $q_j$ to $x_j$. We chose to use polytomous logistic regressions for continuous $X_j$ and contengency tables for qualitative $X_j$. This is an arbitrary choice and future versions will include the possibility of plugging your own model.

The first equation becomes: $$\forall \: x,y, \; p(y|x) = \sum_q p_\theta(y|q) \prod_{j=1}^d p_{\alpha_j}(q_j|x_j)$$

The SEM algorithm

It is still hard to optimize over $p(y|x;\theta,\alpha)$ as the number of candidate discretizations is gigantic as said earlier.

However, calculating $p(y,q|x)$ is easy: $$\forall \: x,y, \; p(y,q|x) = p_\theta(y|q) \prod_{j=1}^d p_{\alpha_j}(q_j|x_j)$$

As a consequence, we will draw random candidates $e$ approximately at the mode of the distribution $p(y,\cdot|x)$ using an SEM algorithm (see References section).

Gibbs sampling

To update, at each random draw, the parameters $\theta$ and $\alpha$ and propose a new discretization $e$, we use the following equation: $$p(q_j|x_j,y,q_{{-j}}) \propto p_\theta(y|q) p_{\alpha_j}(q_j|x_j)$$ Note that we draw $q_j$ knowing all other variables, especially $q_{-{j}}$: this is called a Gibbs sampler (see References section). It is an MCMC method which target distribution is $p(q|x,y)$.

Interaction discovery

What do "interactions" mean?

In a continuous setting, if there is an interaction between features $k > l$, then the logistic regression model becomes $\text{logit}[p_\theta(1|x)] = \theta_0 + \sum_{j=1}^d \theta_j x_j + \theta_{kl} x^k x^l$. It amounts to adding some flexibility to the model by introducing a non-linearity, as was done with discretization.

Our proposal

We consider that features interacting in the logistic regression model are unknown and denote by $\Delta$ the random matrix and $\delta$ its observation with $\delta_{kl} = 0$ for $k \leq l$, $\delta_{kl} = 1$ for $k > l$ if features $k$ and $l$ interact, $\delta_{kl} = 0$ otherwise.

Introducing $\Delta$ as a latent variable as we have done for discretization, we get $p(\delta|x,y) \propto p_\theta(y|x,\delta) = \text{logit}^{-1} [\theta_0 + \sum_{j=1}^d \theta_j x_j + \sum_{k>l} \delta_{kl} \theta_{kl} x^k x^l]$.

Again, a simple strategy would consist in estimating $\hat{\theta}_\delta$ for every value of $\Delta$ i.e. $2^{\frac{d(d-1)}{2}}$ models and compare their AIC or BIC values.

As a consequence of this untractable approach, we propose an MCMC procedure to draw $\delta$ from $p(\delta|x,y)$.

An MCMC estimation procedure

The Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm (see References section) is an MCMC procedure which relies on: 1. A proposal distribution $q_{kl}$, the conditional probability of proposing to change the current state of $\delta_{kl}$. 2. The acceptance distribution $a_{kl}$, the conditional probability to accept changing the current state of $\delta_{kl}$.

The transition probability is the product of these terms: $p_{kl} = q_{kl}a_{kl}$.

Proposal

We must choose a proposal distribution $q_{kl}$. We suspect that a good hint for knowing if there effectively is an interaction between features $k$ and $l$ is to compare the bivariate logistic regression model with $x_k$ and $x_l$ to the logistic regression model containing $x_k$, $x_l$ and their interaction term. Comparison can be done via the BIC criterion for example: $\text{hint}_{kl} \propto \exp(-\text{BIC}(\text{with interaction}) + \text{BIC}(\text{without interaction}))$.

Given the current state of $\Delta$ denoted by $\delta^{\text{current}}$, we want to propose changing the $\delta_{kl}$ entry if it is far from our "hint", i.e.: $q_{kl}=|\delta^{\text{current}}{kl} - \text{hint}{kl}|$.

We draw $kl$ from $q_{kl}$ and subsequently obtain $\delta^{\text{new}}$ which $kl$ entry is $1-\delta^{\text{current}}_{kl}$.

This gives $a_{kl} = min(1,\exp(-\text{BIC}(\delta^{\text{new}})+\text{BIC}(\delta^{\text{current}}))\frac{q_{kl}}{1-q_{kl}})$.

This approach would work as a "stand-alone" model selection tool but it can also be included in the discretization algorithm described in the previous section. That was developed in the glmdisc package.

The `glmdisc` package

The `glmdisc` function

The glmdisc function implements the algorithm discribed in the previous section. Its parameters are described first, then its internals are briefly discussed. We finally focus on its ouptuts.

Parameters

The number of iterations in the SEM algorithm is controlled through the iter parameter. It can be useful to first run the glmdisc function with a low (10-50) iter parameter so you can have a better idea of how much time your code will run.

The validation and test boolean parameters control if the provided dataset should be divided into training, validation and/or test sets. The validation set aims at evaluating the quality of the model fit at each iteration while the test set provides the quality measure of the final chosen model.

The criterion parameters lets the user choose between standard model selection statistics like aic and bic and the gini index performance measure (proportional to the more traditional AUC measure). Note that if validation=TRUE, there is no need to penalize the log-likelihood and aic and bic become equivalent. On the contrary if criterion="gini" and validation=FALSE then the algorithm may overfit the training data.

The m_start parameter controls the maximum number of categories of $Q_j$ for $X_j$ continuous. The SEM algorithm will start with random $Q_j$ taking values in ${1,m_{\text{start}}}$. For qualitative features $X_j$, $Q_j$ is initialized with as many values as $X_j$ so that m_start has no effect.

The reg_type parameter controls the model used to fit $Q_j$ to continuous $X_j$. The default behavior (reg_type=poly) uses multinom from the nnet package which is a polytomous logistic regression (see References section). The other reg_type value is polr from the MASS package which is an ordered logistic regression (simpler to estimate but less flexible as it fits way fewer parameters).

Empirical studies show that with a reasonably small training dataset ($\leq 100 000$ rows) and a small m_start parameter ($\leq 20$), approximately $500$ to $1500$ iterations are largely sufficient to obtain a satisfactory model $p_\theta(y|q)$.

Internals

First, the discretized version $Q$ of $X$ is initialized at random using the user-provided maximum number of values for quantitative values through the m_start parameter.

Then we iterate times: First, the model $p_\theta(y|q)$, where $q$ denotes the current value of $Q$, is adjusted. Then, for each feature $X_j$, the model $p_{\alpha_j}(q|x)$ is adjusted (depending on the type of $X_j$ it can be a polytomous logistic regression or a contengency table). Finally, we draw a new candidate $q_j$ with probability $p_\theta(y|q)p_{\alpha_j}(q_j|x_j)$.

The performance metric chosen through the criterion parameter determines the best candidate $q$.

Results

First we simulate a "true" underlying discrete model:

x = matrix(runif(300), nrow = 100, ncol = 3)
cuts = seq(0,1,length.out= 4)
xd = apply(x,2, function(col) as.numeric(cut(col,cuts)))
theta = t(matrix(c(0,0,0,2,2,2,-2,-2,-2),ncol=3,nrow=3))
log_odd = rowSums(t(sapply(seq_along(xd[,1]), function(row_id) sapply(seq_along(xd[row_id,]),
function(element) theta[xd[row_id,element],element]))))
y = rbinom(100,1,1/(1+exp(-log_odd)))

The glmdisc function will try to "recover" the hidden true discretization xd when provided only with x and y:

library(glmdisc)
set.seed(123)
discretization <- glmdisc(x,y,iter=50,m_start=5,test=FALSE,validation=FALSE,criterion="aic",interact=FALSE)

library(glmdisc)
set.seed(1)
discretization <- glmdisc(x,y,iter=50,m_start=5,test=FALSE,validation=FALSE,criterion="aic",interact=FALSE)

For now, we simulated a discrete model without interactions. We can verify that our proposed approach for interaction discovery selects a model without interactions.

all_formula <- list()

for (l in 1:10) {
     set.seed(l)
     x = matrix(runif(300), nrow = 100, ncol = 3)
     cuts = seq(0,1,length.out= 4)
     xd = apply(x,2, function(col) as.numeric(cut(col,cuts)))
     theta = t(matrix(c(0,0,0,2,2,2,-2,-2,-2),ncol=3,nrow=3))
     log_odd = rowSums(t(sapply(seq_along(xd[,1]), function(row_id) sapply(seq_along(xd[row_id,]),
     function(element) theta[xd[row_id,element],element]))))
     y = rbinom(100,1,1/(1+exp(-log_odd)))

     discretization <- glmdisc(x,y,iter=50,m_start=5,test=FALSE,validation=FALSE,criterion="aic",interact=TRUE)
     all_formula[[l]] <- discretization@best.disc$formulaOfBbestestLogisticRegression
}

#barplot(table(grepl(":",all_formula)),names.arg=c("No interaction","One interaction"),xlim = 2)

Conversely, we can simulate a discrete model with an interaction and report how many times the proposed algorithm detected the right interaction.

all_formula <- list()

for (l in 1:10) {
     set.seed(100+l)
     x = matrix(runif(3000), nrow = 1000, ncol = 3)
     cuts = seq(0,1,length.out= 4)
     xd = apply(x,2, function(col) as.numeric(cut(col,cuts)))
     theta = t(matrix(c(0,0,0,2,2,2,-2,-2,-2),ncol=3,nrow=3))
     log_odd = matrix(0,1000,1)
       for (i in 1:1000) {
          log_odd[i] = 2*(xd[i,1]==1)+
               (-2)*(xd[i,1]==2)+
               2*(xd[i,1]==1)*(xd[i,2]==1)+
               4*(xd[i,1]==2)*(xd[i,2]==2)+
               (-2)*(xd[i,2]==1)*(xd[i,3]==2)+
               (-4)*(xd[i,2]==2)*(xd[i,3]==1)+
               1*(xd[i,2]==3)*(xd[i,3]==1)+
               (-1)*(xd[i,1]==1)*(xd[i,3]==3)+
               3*(xd[i,1]==3)*(xd[i,3]==2)+
               (-1)*(xd[i,1]==2)*(xd[i,3]==3)
       }

     y = rbinom(1000,1,1/(1+exp(-log_odd)))

     discretization <- glmdisc(x,y,iter=50,m_start=5,test=FALSE,validation=FALSE,criterion="aic",interact=TRUE)
     all_formula[[l]] <- discretization@best.disc$formulaOfBbestestLogisticRegression
}

#barplot(table(grepl(":",all_formula)),names.arg=c("No interaction","One interaction"),xlim = 2)

The `parameters` slot

The parameters slot refers back to the user-provided parameters given to the glmdisc function. Consequently, users can compare results of different glmdisc with respect to their parameters.

discretization@parameters

The `best.disc` slot

The best.disc slot contains all that is needed to reproduce the best discretization scheme obtained by glmdisc function: the first element is the best logistic regression model obtained $p_\theta(y|e)$; the second element is its associated "link" function, i.e. either the multinom or polr functions fit to each continuous features and the contengency tables fit to each quantitative features.

discretization@best.disc[[1]]

# The first link function is:
discretization@best.disc[[2]][[1]]

The `performance` slot

The performance slot lists both the achieved performance defined by the parameters criterion, test and validation and this performance metric over all the iterations.

discretization was fit with no test nor validation and we used the AIC criterion. The best AIC achieved is therefore:

discretization@performance[[1]]

The first 5 AIC values obtained on the training set over the iterations are:

discretization@performance[[2]][1:5]

The `disc.data` slot

The disc.data slot contains the discretized data obtained by applying the best discretization scheme found on the test set if test=TRUE or the validation set if validation=TRUE or the training set.

In our example, discretization has validation=FALSE and test=FALSE so that we get the discretized training set:

discretization@disc.data

knitr::kable(head(discretization@disc.data))

The `cont.data` slot

The cont.data slot contains the original "continuous" counterpart of the disc.data slot.

In our example, discretization has validation=FALSE and test=FALSE so that we get the "raw" training set:

discretization@cont.data

knitr::kable(head(discretization@cont.data))

How well did we do?

To compare the estimated and the true discretization schemes, we can represent them with respect to the input "raw" data x:

plot(x[,1],xd[,1])
plot(discretization@cont.data[,1],discretization@disc.data[,1])

"Classical" methods

A few classical R methods have been adapted to the glmdisc S4 object.

The print function is the standard print.summary.glm method applied to the best discretization scheme found by the glmdisc function:

print(discretization)

Its S4 equivalent show is also available:

show(discretization)

The `discretize` method

The discretize method allows the user to discretize a new "raw" input set by applying the best discretization scheme found by the glmdisc function in a provided glmdisc object.

x_new <- discretize(discretization,x)

knitr::kable(head(x_new))

The `predict` method

The discretize method allows the user to predict the response of a new "raw" input set by first discretizing it using the previously described discretize method and a previously trained glmdisc object. It then predicts using the standard predict.glm method and the best discretization scheme found by the glmdisc function.

# pred_new <- predict(discretization, data.frame(x))

# knitr::kable(head(pred_new))

Future developments

Enhancing the `discretization` package

The discretization package available from CRAN implements some existing supervised univariate discretization methods (applicable only to continuous inputs) : ChiMerge (chiM), Chi2 (chi2), Extended Chi2 (extendChi2), Modified Chi2 (modChi2), MDLP (mdlp), CAIM, CACC and AMEVA (disc.Topdown).

To compare the result of such methods to the one we propose, the package discretization will be enhanced to support missing values (by putting them in a single class for example) and quantitative features (through a Chi2 grouping method for example).

Possibility of changing model assumptions

In the third section, we described two fundamental modelling hypotheses that were made:

The real probability density function $p(Y|X)$ can be approximated by a logistic regression $p_\theta(Y|Q)$ on the discretized data $Q$.
The nature of the relationship of $Q_j$ to $X_j$ is:
A polytomous logistic regression if $X_j$ is continuous;
A contengency table if $X_j$ is qualitative.

These hypotheses are "building blocks" that could be changed at the modeller's will: discretization could optimize other models.

References

Celeux, G., Chauveau, D., Diebolt, J. (1995), On Stochastic Versions of the EM Algorithm. [Research Report] RR-2514, INRIA. 1995.

Agresti, A. (2002) Categorical Data. Second edition. Wiley.

Ramírez‐Gallego, S., García, S., Mouriño‐Talín, H., Martínez‐Rego, D., Bolón‐Canedo, V., Alonso‐Betanzos, A. and Herrera, F. (2016). Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(1), 5-21.

Geman, S., & Geman, D. (1987). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. In Readings in Computer Vision (pp. 564-584).

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97-109.

Any scripts or data that you put into this service are public.

glmdisc documentation built on Oct. 23, 2020, 7:12 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

glmdisc Discretization and Grouping for Logistic Regression

The `glmdisc` package: discretization at its finest In glmdisc: Discretization and Grouping for Logistic Regression

Context

Example

Notations

Logistic regression

Common problems with logistic regression on "raw" data

Feature selection

Linearity

Too many values per categorical feature

Discretization and grouping: theoretical background

Notations

Best discretization?

Combinatorics

Discretization and grouping: estimation

Likelihood estimation

The SEM algorithm

Gibbs sampling

Interaction discovery

What do "interactions" mean?

Our proposal

An MCMC estimation procedure

The Metropolis-Hastings algorithm

Proposal

The glmdisc package

The glmdisc function

Parameters

Internals

Results

The parameters slot

The best.disc slot

The performance slot

The disc.data slot

The cont.data slot

How well did we do?

"Classical" methods

The discretize method

The predict method

Future developments

Enhancing the discretization package

Possibility of changing model assumptions

References

Try the glmdisc package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

glmdisc
Discretization and Grouping for Logistic Regression

The `glmdisc` package: discretization at its finest
In glmdisc: Discretization and Grouping for Logistic Regression

The `glmdisc` package

The `glmdisc` function

The `parameters` slot

The `best.disc` slot

The `performance` slot

The `disc.data` slot

The `cont.data` slot

The `discretize` method

The `predict` method

Enhancing the `discretization` package