knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Detect linear separation in binary classification problems using linear programming.
To cite the author of this method [1]:
The parameter estimates of a binary logistic regression model fit using the method of maximum likelihood sometimes do not converge to finite values. This phenomenon (also known as monotone likelihood or infinite parameters) occurs because of a condition among the sample points known as separation. There are two classes of separation.
When complete separation is present among the sample points, iterative procedures for maximizing the likelihood tend to break down, when it would be clear that there is a problem with the model. However, when quasicomplete separation is present among the sample points, the iterative procedures for maximizing the likelihood tend to satisfy their convergence criterion before revealing any indication of separation.
The package exports a single assertion function to test for separation. The linear program is solved using the ROI package that offeres a unified interface for a variety of solvers. A solver is a program (package) that can solve those linear programs.
I recommend using glpk
or lpSolve
as the original implementation. But you can also use your favorite commerical solver. However large linear programs
could still be difficult to solve. The complexity is probably mainly determined by the design matrix (i.e. #cols, #rows, sparsity).
~~You can install the released version of losep from~~ CRAN with:
install.packages("losep")
Or from Github:
remotes::install_github("dirkschumacher/losep")
losep
can help you detect separation before bad things happen.
As part of your analysis, just add assert_no_separation
after the model fit to test for separation.
... # make sure at least one solver is loaded that can handle linear programs library(ROI.plugin.glpk) ... my_model <- glm(formula, data, family = "binomial") assert_no_separation(my_model) # this uses a default solver (works in most cases) ...
Or with small toy example. In this case, for x=3
, the response is always 1.
library(losep) library(ROI.plugin.glpk) data <- data.frame( x = factor(c(1, 1, 1, 2, 2, 2, 3, 3)), y = c(1, 1, 0, 1, 1, 0, 1, 1) ) model <- glm(y ~ -1 + x, data = data, family = "binomial") # throws an error if the data is seperable # uses any compatible loaded solver tryCatch(assert_no_separation(model), error = print) # or solve it using GLPK with the option presolve try(assert_no_separation(model, solver = "glpk", presolve = TRUE))
According to Konis (2007): not a lot. The linear program involves no integer variables and should be feasible for fairly large models.
Here is a test with 10^6 observations. Fitting the glm takes longer than solving the LP with glpk.
library(losep) library(ROI.plugin.glpk) n <- 1e6 data <- tibble::tibble( x = as.integer(runif(n) < 0.5), x2 = rnorm(n), x3 = rnorm(n), x4 = rnorm(n), x5 = rnorm(n), x6 = rnorm(n), x7 = rnorm(n), y = x ) system.time( model <- glm(y ~ 1 + x + x2 + x3 + x4 + x5 + x6 + x7, data = data, family = "binomial") )
system.time( tryCatch(assert_no_separation(model), error = print) )
And with verbose output and an additional solver option (presolve):
system.time( try( assert_no_separation(model, solver = "glpk", presolve = TRUE, verbose = TRUE ) ) )
If you find any bugs or have further ideas, please let me know. Either by writing an issue, sending a PR or an email. Before you send a PR, please post an issue first.
losep = logistic regression, separation
1: Kjell Konis (2007). Linear programming algorithms for detecting separated data in binary logistic regression models. Ph. D. thesis, University of Oxford.
The author of the method implemented the algorithm in the very useful package safeBinaryRegression.
The package overloads the glm
function and adds a test for separation to it.
This package aims at decoupling the test from glm
and potentially to use it with other inputs as well.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.