In zq00/glmhd: Estimates the inflation and variance of the MLE from a high-dimensional GLM

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE, 
  warning = FALSE, 
  fig.align = "center", 
  fig.width = 5,
  fig.height = 4
)

library(ggplot2)
library(purrr)

par(mar = c(3, 3, 2, 1), mgp = c(1.5, 0.5, 0))
options(max.print = 20, digits = 3)

In this vignette we will show you how to use functions in the glmhd package to estimate the bias and variance of high dimensional logistic MLE (maximum likelihood estimates). "High dimension" refers to the setting when the number of observations $n$ and number of variables $p$ are both large.

Warm-up

As a recurring example, we consider a logistic regression model.

# Sample from a logistic model
#     n - number of samples 
#     beta - coefficient vector
#     R - cholesky decomposition of the covariance matrix (R^t R) of the variables 
#     adjust - if TRUE, computes the adjusted p-value for H0: beta1 = 0
sample_logistic <- function(n, beta, R, adjust = FALSE){
  p <- length(beta) # number of variables
  # Generate data matrix
  X <- matrix(rnorm(n * p, 0, 1), n, p) %*%  R / sqrt(n)
  # Sample response 
  Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
  # Logistic regression 
  fit <- glm(Y ~ X + 0, family = binomial, x = TRUE, y = TRUE)
  if(!adjust){
    # Returns the p-th fitted coefficient and its standard error estimate
    c(fit$coef[p], summary(fit)$coef[p, 2])  
    }else{
    # Returns the adjusted p-value for the first coordinate
    adjusted_fit <- adjust_glm(fit, verbose = FALSE, echo = TRUE)
    summary(adjusted_fit)$coef[1,4]
  }
}

We pick $n = 1000$ and $p = 200$, and set the variables to be from multivariate Gaussian from a AR(1) model with $\rho = 0.5$. For simplicity, we fix the the first half variables to be null and the second half have fixed coefficients $\beta = 4$.

n <- 1000; p <- 200
R <- chol(toeplitz(0.5^(0:(p - 1))))
beta <- rep(c(0, 4), each = p / 2)

We repeat this function $B = 200$ times, and draw a histogram of the MLE of the $p$th coefficient.

beta_hat <- replicate(B <- 200, sample_logistic(n, beta, R))

par(mar = c(3,3,3,1), mgp = c(1.5, 0.5, 0))
# plot fitted coefficients
hist(beta_hat[1, ], breaks = 15, freq = FALSE, main = "", 
     xlab = "Estimated coefficients", cex.lab = 1.5)
mtext(expression(paste("Histogram of ",hat(beta)[p])), line = 1, cex = 1.5 )
mtext(expression(paste(beta[p], "=4 (black line)")), line = 0)
# theoretical density curve of beta_hat 
lines(t <-  seq(-10, 25, by = 0.01), dnorm(t, 4 * 1.49, 5.37), col = "blue", lwd = 1.5)
abline(v = 4, col = "black", lwd = 2)

X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
fit <- glm(Y ~ X + 0, family = binomial)

As we can see, the estimated $\hat{\beta}$ is not centered at the true value $\beta = 4$ (black line). The average $\hat{\beta}$ is at r mean(beta_hat[1, ]), which is larger. If we look at a single simulation below, the estimated standard error of $\hat{\beta}_p$ is r summary(fit)$coef[p,2], which is smaller than the observed standard deviation r sd(beta_hat[1, ]) from the independent samples.

X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
fit <- glm(Y ~ X + 0, family = binomial)

summary(fit)$coef[p,2] # estimated std. from inverse Fisher information

In general, as $n, p\to\infty$ at a fixed ratio $p / n\to \kappa \in (0, 1)$, and if the variables $X$ to be multivariate Gaussian, the MLE of any coordinate $\beta_j$ satisfies [ \frac{\hat{\beta}j^\mathrm{MLE} - \alpha\star \beta_j}{\sigma_\star / \tau_j} \stackrel{d}{\longrightarrow} \mathcal{N}(0, 1), ] where $\tau^2_j = \mathrm{Var}(x_j\,|\,x_{-j})$ is the conditional variance of $j$th variable given all the others [1]. These parameters $(\alpha_\star, \sigma_\star)$ can be computed given $\kappa = p/n$ and $\gamma = \sqrt{\mathrm{Var}(X^\top \beta)}$. In this example, [ \kappa = 0.2, \quad \gamma = 2.18, ] and the corresponding $(\alpha_\star, \sigma_\star / \tau_p) = (1.49,5.37)$. We overlay the theoretical density curve of $\hat{\beta}_p$ in the above histogram (blue curve), which agree well with the obeserved histogram overall, except for a few large $\hat{\beta}_p$ in simulations. In reality, the parameter $\gamma$ is typically unknown, so in the next section we will use glmhd package to estimate it.

Adjusting the MLE with glmhd

The goal of glmhd package is to estimate the inflation and standard deviation of the MLE more accurately than the classical theory. We use the ProbeFrontier algorithm [2] to estimate the signal strength parameter $\gamma$, and then we estimate the inflation and standard deviation of the MLE. Now let's load the library.

library(glmhd)

We repeat the previous example and compute the MLE. Note that we specify x = TRUE and y = TRUE here so the original data matrix is returned in fit.

# fit the MLE again  
fit <- glm(Y ~ X + 0, family = binomial, x = TRUE, y = TRUE)

Now we call the function adjust_glm to compute the adjusted MLE and standard error. There are three arguments for this function, but you only need to input the first one, the output glm object. Below, the second argument and third arguments are set at their default values, verbose specifies if progress should be printed, and echo = TRUE means that the input glm object is also returned in the output. Settingverbose = TRUE helps to track progress of the code, and identify issues if the algorithm fails.

adjusted_fit <- adjust_glm(fit, verbose = FALSE, echo = TRUE)

You can access the estimated parameters through adjusted_fit$param. In this vector, adjusted_fit$param["alpha_s"] is estimated $\alpha_\star$ and adjusted_fit$param["sigma_s"] is estimated $\sqrt{\kappa}\sigma_\star$. For example, you can compare the estimated inflation r adjusted_fit$param["alpha_s"] with the observed value r mean(beta_hat[1, ]) / 4.

adjusted_fit$param["alpha_s"] # Estimated inflation
adjusted_fit$param["sigma_s"] / sqrt(p/n) # estimated sigma_star

You can obtain the adjusted MLE $\hat{\beta}^\mathrm{Adj}p = \hat{\beta}_p^\mathrm{MLE} / \alpha\star$ and the adjusted standard error estimate $\sigma_\star / \tau_j$ as follows.

adjusted_fit$coef_adj[p] # Adjusted MLE
adjusted_fit$std_adj[p] # Adjusted std.

Next, summary(adjusted_fit) prints the coefficient table. The first column is the adjusted MLE, the second column is $\hat{\sigma}j / \alpha\star$, because the adjusted MLE satisfies [ \frac{\hat{\beta}j^\mathrm{Adj} - \beta_j}{\hat{\sigma}_j / \alpha\star} \approx \mathcal{N}(0, 1), ] for $\hat{\sigma}j = \sigma\star / \tau_j$ the adjusted std. estimate.

print(summary(adjusted_fit), max = 40)

Hypothesis testing

The summary function prints the adjusted MLE $\hat{\beta}^\mathrm{Adj}$ and its std. $\hat{\sigma} / \alpha_\star$, as well as a z-statistics $z = \hat{\beta}/\hat{\sigma}^\mathrm{MLE}$ and a two-sided p-value $\mathrm{P}(|\mathcal{N}(0,1)|\geq |z|)$. If the variables are multivariate Gaussian and the true model is logistic, then this p-value is from a Uniform[0,1] distribution under the null hypothesis that $\beta_j = 0$.

The following is a simulation to examine whether the calculated p-value is approximately uniform. For simplicity, we consider the first variable, which is a null, and the function is also included at the beginning.

# pval_adj <- replicate(B <- 100, sample_logistic(n, p, R, adjust = TRUE))

You can use the likelihood ratio test (LRT) by calling the function lrt_glm. It requires two arguments, the first is a list of glm fits and the second is the param output from adjust_glm. It prints the likelihood ratio statistics and p-values computed according to the rescaled chi-squared distribution. Under the null hypothesis [ H_0:\quad \beta_1 = \ldots = \beta_k = 0, ] two times the likelihood ratio statistics $\Lambda$ is asymptotically a re-scaled chi-squared variable with $k$ degrees of freedom [ 2 \Lambda \stackrel{d}{\longrightarrow} \frac{\kappa_\star \sigma_\star^2}{\lambda_\star}\chi^2_k. ]

f1 <- fit
f2 <- glm(Y ~ X[, -1] + 0, family = binomial, x = TRUE, y = TRUE)
# P-values for the LRT
lrt <- lrt_glm(list(f1, f2), param = adjusted_fit$param)
print(lrt)

Model with an intercept

The previous example concerns a logistic model without an intercept, you can use the same function call for a model with an intercept. This is a separate section because the theory justification for this algorithm is still under study. The algorithm implemented in this package is based on the following conjecture: [ \frac{\hat{\beta}j^\mathrm{MLE} - \alpha\star \beta_j}{\sigma_\star / \tau_j} \stackrel{d}{\longrightarrow} \mathcal{N}(0, 1), ] and the MLE of the intercept converges to some fixed quantity [ \hat{\beta}0 \stackrel{p}{\longrightarrow} b\star. ] The adjusted MLE is $\hat{\beta}^\mathrm{Adj} = \hat{\beta} / \alpha_\star$ and adjusted std. is $\sigma_\star / \tau_j$.

Consider an example where the intercept term $\beta_0 = - 0.5$. In this example, we can compute the $(\alpha_\star, \sigma_\star)$ (they depend on $\kappa = p/n$, $\gamma$ and true intercept $\beta_0$) and compare them with the estimates from independent samples. Notice that the input argument beta_0 for find_param function is the absolute value of $\beta_0$.

gamma <- sqrt(sum((R %*% beta)^2/n)) # signal strength
params <- find_param(kappa = p/n, gamma = gamma, beta0 = 0.5, verbose = FALSE)
params[1] # true alpha_star
params[3] / sqrt(p/n) # true sigma_star

sample_logistic_intercept <- function(n, beta, R){
  p <- length(beta)
  X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
  Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta + 0.5)))
  fit <- glm(Y ~ X, family = binomial)
  # Returns the p-th fitted coefficient and its standard error
  c(fit$coef[p+1], summary(fit)$coef[p+1, 2])
}

beta_hat <- replicate(B, sample_logistic_intercept(n, beta, R))

par(mar = c(3,3,3,1), mgp = c(1.5, 0.5, 0))
hist(beta_hat[1, ], breaks = 15, freq = FALSE, main = "", 
     xlab = "Estimated coefficients", cex.lab = 1.5)
mtext(expression(paste("Histogram of ",hat(beta)[p])), 
      line = 1, cex = 1.5 )
mtext(expression(paste(beta[p], "=4 (black line)")), 
      line = 0)
# theoretical density curve of beta_hat 
lines(t <-  seq(-10, 25, by = 0.01), dnorm(t, 4 * 1.5, 5.47), col = "blue", lwd = 1.5)
abline(v = 4, col = "black", lwd = 2)

The observed inflation in 200 samples is r mean(beta_hat[1, ])/4 and observed standard deviation is r sd(beta_hat[1, ]). The theoretical density curve of $\hat{\beta}_p$ is shown in blue, and it agrees well with the histogram of $\hat{\beta}_p$.

You can use the same function adjust_glm to compute the adjusted MLE $\hat{\beta}j^\mathrm{Adj} = \hat{\beta}_j^\mathrm{MLE} / \alpha\star$ and adjusted standard error $\sigma_\star / \tau_j$ without knowing the true parameters.

X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta + 0.5)))
fit <- glm(Y ~ X, family = binomial, x = TRUE, y = TRUE)

# Adjust MLE coefficients
adjusted_fit <- adjust_glm(fit, verbose = FALSE)

adjusted_fit$intercept returns the estimated intercept, which you can compare with the MLE r fit$coef[1].

adjusted_fit$intercept

You can also find the estimated inflation and standard deviation like before.

# Estimated inflation and std.
adjusted_fit$param["alpha_s"]
adjusted_fit$param["sigma_s"] / sqrt(p/n)
# Adjusted MLE
adjusted_fit$coef_adj[p]
# Adjusted std.
adjusted_fit$std_adj[p]

References

[1] A modern maximum-likelihood theory for high-dimensional logistic regression, Pragya Sur and Emmanuel J. Candes, Proceedings of the National Academy of Sciences Jul 2019, 116 (29) 14516-14525

[2] The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance, Qian Zhao, Pragya Sur and Emmanuel J. Candes, arXiv:2001.09351

zq00/glmhd documentation built on April 7, 2023, 7:45 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

zq00/glmhd
Estimates the inflation and variance of the MLE from a high-dimensional GLM

In zq00/glmhd: Estimates the inflation and variance of the MLE from a high-dimensional GLM

Warm-up

Adjusting the MLE with glmhd

Hypothesis testing

Model with an intercept

Other link functions

References

R Package Documentation

Browse R Packages

We want your feedback!

zq00/glmhd Estimates the inflation and variance of the MLE from a high-dimensional GLM

In zq00/glmhd: Estimates the inflation and variance of the MLE from a high-dimensional GLM

Warm-up

Adjusting the MLE with glmhd

Hypothesis testing

Model with an intercept

Other link functions

References

R Package Documentation

Browse R Packages

We want your feedback!

zq00/glmhd
Estimates the inflation and variance of the MLE from a high-dimensional GLM