Detect separation in binary classification problems using linear programming.
To cite the author of this method [1]:
The parameter estimates of a binary logistic regression model fit using the method of maximum likelihood sometimes do not converge to finite values. This phenomenon (also known as monotone likelihood or infinite parameters) occurs because of a condition among the sample points known as separation. There are two classes of separation. When complete separation is present among the sample points, iterative procedures for maximizing the likelihood tend to break down, when it would be clear that there is a problem with the model. However, when quasicomplete separation is present among the sample points, the iterative procedures for maximizing the likelihood tend to satisfy their convergence criterion before revealing any indication of separation.
The package exports a single assertion function to test for separation. The linear program is solved using the ROI package that offeres a unified interface for a variety of solvers. A solver is a program (package) that can solve those linear programs.
I recommend using glpk
or lpSolve
as the original implementation.
But you can also use your favorite commerical solver. However large
linear programs could still be difficult to solve. The complexity is
probably mainly determined by the design matrix (i.e. #cols, #rows,
sparsity).
~~You can install the released version of losep from~~ CRAN with:
install.packages("losep")
Or from Github:
remotes::install_github("dirkschumacher/losep")
losep
can help you detect separation before bad things happen.
As part of your analysis, just add assert_no_separation
after the
model fit to test for separation.
...
# make sure at least one solver is loaded that can handle linear programs
library(ROI.plugin.glpk)
...
my_model <- glm(formula, data, family = "binomial")
assert_no_separation(my_model) # this uses a default solver (works in most cases)
...
Or with small toy example. In this case, for x=3
, the response is
always 1.
library(losep)
library(ROI.plugin.glpk)
data <- data.frame(
x = factor(c(1, 1, 1, 2, 2, 2, 3, 3)),
y = c(1, 1, 0, 1, 1, 0, 1, 1)
)
model <- glm(y ~ -1 + x, data = data, family = "binomial")
# throws an error if the data is seperable
# uses any compatible loaded solver
tryCatch(assert_no_separation(model), error = print)
#> <simpleError: Separation detected in your model in the following variables:
#> x3>
# or solve it using GLPK with the option presolve
try(assert_no_separation(model, solver = "glpk", presolve = TRUE))
According to Konis (2007): not a lot. The linear program involves no integer variables and should be feasible for fairly large models.
Here is a test with 10^6 observations. Fitting the glm takes longer than solving the LP with glpk.
library(losep)
library(ROI.plugin.glpk)
n <- 1e6
data <- tibble::tibble(
x = as.integer(runif(n) < 0.5),
x2 = rnorm(n),
x3 = rnorm(n),
x4 = rnorm(n),
x5 = rnorm(n),
x6 = rnorm(n),
x7 = rnorm(n),
y = x
)
system.time(
model <- glm(y ~ 1 + x + x2 + x3 + x4 + x5 + x6 + x7,
data = data,
family = "binomial")
)
#> Warning: glm.fit: algorithm did not converge
#> user system elapsed
#> 16.607 4.913 25.590
system.time(
tryCatch(assert_no_separation(model), error = print)
)
#> <simpleError: Separation detected in your model in the following variables:
#> x>
#> user system elapsed
#> 5.895 1.427 9.920
And with verbose output and an additional solver option (presolve):
system.time(
try(
assert_no_separation(model,
solver = "glpk",
presolve = TRUE,
verbose = TRUE
)
)
)
#> <SOLVER MSG> ----
#> GLPK Simplex Optimizer, v4.63
#> 1000000 rows, 8 columns, 7500907 non-zeros
#> Preprocessing...
#> 1000000 rows, 8 columns, 7500907 non-zeros
#> Scaling...
#> A: min|aij| = 1.918e-07 max|aij| = 5.890e+00 ratio = 3.072e+07
#> GM: min|aij| = 5.494e-04 max|aij| = 1.820e+03 ratio = 3.313e+06
#> EQ: min|aij| = 3.300e-07 max|aij| = 1.000e+00 ratio = 3.031e+06
#> Constructing initial basis...
#> Size of triangular part is 1000000
#> 0: obj = -5.043174664e+05 inf = 1.100e+06 (567963)
#> 10: obj = 9.505182388e-11 inf = 8.539e-10 (0)
#> * 22: obj = 5.009070000e+05 inf = 3.358e-10 (0)
#> OPTIMAL LP SOLUTION FOUND
#> <!SOLVER MSG> ----
#> user system elapsed
#> 9.213 1.815 11.524
This is the very first version. If you find any bugs or have further ideas, please let me know. Either by writing an issue, sending a PR or an email. Before you send a PR, please post an issue first.
losep = logistic regression, separation
I am not good at naming things.
1: Konis, K. (2007). Linear programming algorithms for detecting separated data in binary logistic regression models. Ph. D. thesis, University of Oxford.
The author of the method implemented the algorithm in the very useful
package
safeBinaryRegression.
The package overloads the glm
function and adds a test for separation
to it. This package aims at decoupling the test from glm
and
potentially use it with other inputs as well.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.