Detecting separation and infinite estimates in log binomial regression
In detectseparation: Detect and Check for Separation and Infinite Maximum Likelihood Estimates

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 6
)

Introduction

In contrast to logistic regression in log-binomial regression, separation is not the only factor which decides if the maximum likelihood estimate (MLE) has infinite components. As we will see in the example below, for the log-binomial regression model (LBRM) there exist data configurations which are separated but the MLE still exists.

The MLE of the LBRM can be obtained by solving the following optimization problem. $$ \begin{equation} \begin{array}{rl} \underset{\beta}{\textrm{maximize}} & \ell(\beta) = \displaystyle\sum_{i = 1}^n y_i ~ X_{i} \beta + (1 - y_i) ~ \log(1 - \exp(X_{i} \beta)) \ \ \ \ \textrm{s.t.} & X \beta \leq 0. \end{array} \end{equation} $$

From the optimization problem we can already guess that the different behavior with regard to the existence of the MLE is caused by the linear constraint $X \beta \leq 0$.

Let $X^0$ be the submatrix of $X$ obtained by keeping only the rows $I^0 = {i|y_i = 0}$ and $X^1$ the submatrix obtained by keeping only the rows $I^1 = {i|y_i = 1}$. @schwendinger+gruen+hornik:2021 pointed out that the finiteness of the MLE can be checked by solving the following linear optimization problem.

$$ \begin{equation} \begin{array}{rll} \underset{\beta}{\text{maximize}}~~ & - \sum_{i \in I^0} X_{i*} \beta \ \text{subject to}~~ & X^0 \beta \leq 0 \ & X^1 \beta = 0. \end{array} \end{equation} $$ The MLE has only finite components if the solution of this linear program is a zero vector. If the MLE contains infinite components, the linear programming problem is unbounded. The function detect_infinite_estimates() from the detectseparation implements the LP problem described above and can therefore be used to detect infinite components in the MLE of the LBRM.

library("detectseparation")

Example

To show the different effect of separation on the logistic regression model compared to the LBRM consider the following data.

data <- data.frame(a = c(1, 0, 3, 2, 3, 4),
                   b = c(2, 1, 1, 4, 6, 8),
                   y = c(0, 0, 0, 1, 1, 1))

Detect separation

Clearly the data is separated, which can be verified by using the detect_separation method (a detailed explanation of the output can be found in Section 3).

glm(y ~ a + b, data = data, family = binomial("logit"), method = "detect_separation")

Since separation is a property of the data, checking for separation gives the same result for the logistic regression and the LBRM.

glm(y ~ a + b, data = data, family = binomial("log"), method = "detect_separation")

Detect infinite estimates

For logistic regression separation is necessary and sufficient that the MLE contains infinite components.

glm(y ~ a + b, data = data, family = binomial("logit"), method = "detect_infinite_estimates")

However, due to the linear constraint of the LBRM, there exists data allocations where the MLE does exist (i.e., has only finite components), despite the fact that the data is separated.

glm(y ~ a + b, data = data, family = binomial("log"), method = "detect_infinite_estimates")

Fitting the LBRM

Using glm to solve this problem we get the following error message.

fit <- try(glm(y ~ a + b, data = data, family = binomial("log")))

The error message means that we should provide starting values, a simple but reliable approach is to use $(-1, 0, ..., 0)$ as starting value.

Since in this example one of the constraints $X_{i*} \beta \leq 0$ is by design binding, the iteratively re-weighted least squares (IRLS) method used by the glm function has convergence problems. The glm function informs us about the convergence problems by issuing some warnings. The warnings

#> Warning: step size truncated: out of bounds
#> Warning: glm.fit: algorithm stopped at boundary value

tell us that IRLS is not the best option for optimization problems with binding constraints. The warning

#> Warning: glm.fit: algorithm did not converge

tell us that the algorithm did not converge. Practically in most cases this just means the default value for the maximum number of iterations should be increased.

args(glm.control)

Since the default value for the maximum number of iterations is quite low (maxit = 25). However, maxit = 25 is typically high enough for unbounded optimization problems (almost all models supported by glm), but is often to low for the LBRM. For most data sets setting maxit = 10000 when estimating LBRMs is high enough.

formula <- y ~ a + b
start <- c(-1, double(ncol(model.matrix(formula, data = data)) - 1L))
ctrl = glm.control(epsilon = 1e-8, maxit = 10000, trace = FALSE)
suppressWarnings(
  fit <- glm(formula, data = data, family = binomial("log"), start = start, control = ctrl)
)
summary(fit)

We can verify that one of the constraints $X_{i} \beta \leq 0$ is binding (i.e., $X_{i} \beta = 0$ for at least one $i$) by multiplying the coefficients with the model matrix.

print(mm <- drop(model.matrix(formula, data) %*% coef(fit)))
abs(drop(mm)) < 1e-6

Details on the output{#output_details}

Detect separation

glm(y ~ a + b, data = data, family = binomial("logit"), method = "detect_separation")

The output above provides much information:

The output shows, that for the verification oft the separation the linear optimization solver lpsolve was used.
The line Separation: TRUE indicates that the data is separated.
The coefficients at Inf and -Inf indicate that the underlying optimization problem is unbounded and therefore the MLE does not exist for the logistic regression model.

Detect infinite estimates

glm(y ~ a + b, data = data, family = binomial("log"), method = "detect_infinite_estimates")

The output above provides much information:

The output shows, that for the verification oft the separation the linear optimization solver lpsolve was used.
The line Infinite estimates: FALSE indicates that the MLE has only finite components (the MLE exists).
The coefficients are all 0 which again indicates that the MLE has only finite components and therefore underlying optimization problem is bounded.

Choosing starting values

For the LBRM the glm function often requires the users to specify starting values. Valid starting values have to reside in the interior of the feasible region ($X \beta < 0$). There have been different methods suggested for finding valid starting values.

Simple approach

If an intercept is present in the estimation (-1, 0, ..., 0) will always provide valid starting values.

find_start_simple <- function(formula, data) {
  c(-1, double(ncol(model.matrix(formula, data = data)) - 1L))  
}

find_start_simple(formula, data)
max(model.matrix(formula, data = data) %*% find_start_simple(formula, data))

Hot start via Poisson model

@andrade+andrade2018 suggest a hot start method, by using the modified estimation result of a Poisson model with log link as starting values. Again this method only works if an intercept is used.

find_start_poisson <- function(formula, data, delta = 1) {
  b0 <- coef(glm(formula, data, family = poisson(link = "log")))
  mX <- -model.matrix(formula, data = data)[, -1L, drop = FALSE]
  b0[1] <- min(mX %*% b0[-1]) - delta
  b0
}

find_start_poisson(formula, data)
max(model.matrix(formula, data = data) %*% find_start_poisson(formula, data))

Recommendation

One can also solve an LP to find valid start values or think of other strategies. However, for the benchmark examples reported in @schwendinger+gruen+hornik:2021 we found no conclusive evidence that one of these initialization methods outperforms the others. Therefore, my personal favorite is the simple approach (-1, 0, ..., 0).

References

Any scripts or data that you put into this service are public.

detectseparation documentation built on Aug. 27, 2022, 1:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

detectseparation
Detect and Check for Separation and Infinite Maximum Likelihood Estimates

Detecting separation and infinite estimates in log binomial regression
In detectseparation: Detect and Check for Separation and Infinite Maximum Likelihood Estimates

Introduction

Example

Detect separation

Detect infinite estimates

Fitting the LBRM

Details on the output{#output_details}

Detect separation

Detect infinite estimates

Choosing starting values

Simple approach

Hot start via Poisson model

Recommendation

References

Try the detectseparation package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

detectseparation Detect and Check for Separation and Infinite Maximum Likelihood Estimates

Detecting separation and infinite estimates in log binomial regression In detectseparation: Detect and Check for Separation and Infinite Maximum Likelihood Estimates

Introduction

Example

Detect separation

Detect infinite estimates

Fitting the LBRM

Details on the output{#output_details}

Detect separation

Detect infinite estimates

Choosing starting values

Simple approach

Hot start via Poisson model

Recommendation

References

Try the detectseparation package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

detectseparation
Detect and Check for Separation and Infinite Maximum Likelihood Estimates

Detecting separation and infinite estimates in log binomial regression
In detectseparation: Detect and Check for Separation and Infinite Maximum Likelihood Estimates