In ammeir2/PSAT: Inference after Selection with Aggregate Testing

knitr::opts_chunk$set(echo = TRUE)
library(magrittr)

Installing Package

devtools::install_github("ammeir2/PSAT")

Inference for Normal Means

The PSAT pacakge provides functionalities for performing inference on multivariate normal vectors and generalized linear models that pass a qudratic aggregate tests. Suppose that we observe $y \sim N(\mu, \Sigma)$ and that we are interested in conducting inference on the individual coordinates of $\mu$ only if we can reject a quadratic test of the form:

$$ y^{T} K y > c > 0 $$

for some pre-determined threshold $c$. Then, we must account for the fact that we have selected the model via a data driven method in order to obtain valid p-values, confidence intervals that achieve the desired coverage rate, and efficient point estimates. We begin by generating a dataset by sampling multivariate normal vectors and screening with a Wald test, so $K = \Sigma^{-1}$.

# Tests for wald test ------------------------
set.seed(5)
p <- 8
sparsity <- 2
snr <- 3
mu <- rexp(sparsity) * (1 - 2 * rbinom(sparsity, 1, 0.5))
mu <- mu / sum(abs(mu)) * snr
mu <- c(mu, rep(0, p - sparsity))
# sigma <- cov2cor(clusterGeneration::genPositiveDefMat(p)$Sigma)
sigma <- matrix(0.3, nrow = p, ncol = p)
diag(sigma) <- 1
sqrtSig <- expm::sqrtm(sigma)
testmat <- solve(sigma)

pthreshold <- 0.001
threshold <- qchisq(pthreshold, p, lower.tail = FALSE)
testStat <- 0
invcov <- solve(sigma)
while(testStat < threshold) {
  z <- rnorm(p)
  y <- as.numeric(z %*% sqrtSig) + mu
  testStat <- as.numeric(t(y) %*% testmat %*% y)
}
print(round(y, 2))

Our threshold is r round(threshold, 1) and the realized test statistic is r round(testStat, 2), so we just barely passed the aggregate test. Now, let us use the mvnQuadratic function to conduct inference. The mvnQuadratic function can be used to compute several different types of confidence intervals and p-values. Generally speaking, the confidence interval methods global-null and switch are quite computationally intensive while the polyhedral method is very fast, yet somewhat less powerful when the signal is sparse. Similarly, the p-value methods hybrid and global-null are computationally intensive and the polyhedral method is not. In the following, we also demonstrate an option to provide the inference function with an optional set of contrasts to be tested/estimated as a matrix of dimension $l \times p$ where $l$ is the number of contrasts. If no such matrix is provided, the function will estimated the mean vector of $y$. Specifcally, we test the difference between the first and second coordinates of $y$, as well as the seventh and eighth coordinates.

library(PSAT)
diffCont <- matrix(0, nrow = 2, ncol = p)
diffCont[1, 1] <- 1
diffCont[1, 2] <- -1
diffCont[2, 7] <- 1
diffCont[2, 8] <- -1
contrasts <- rbind(diag(length(y)), diffCont)
control <- psatControl(trueHybrid = TRUE, 
                       rbIters = 5000, 
                       nSamples = 10^4, 
                       truncPmethod = "UMPU")
allfit <- mvnQuadratic(y, sigma, testMat = "wald", 
                       threshold = NULL, pval_threshold = pthreshold,
                       contrasts = contrasts,
                       estimate_type = c("mle", "naive"),
                       pvalue_type = c("hybrid", "polyhedral", "naive"),
                       ci_type = c("switch", "polyhedral", "naive"),
                       confidence_level = .95, verbose = TRUE,
                       control = control)

Notice that when calling mvnQuadratic, we listed several methods for confidence interval and p-value computations. The function will compute all of the requested quantities. A good way to examine the outputs of the function is to plot the estimates and confidence intervals, in the following function call we will provide the plotting function with the ground truth so we can see whether our confidence intervals have successfully identified the true signal.

trueMean <- as.numeric(contrasts %*% mu)
plot(allfit, type = "estimates", true = trueMean)

In the figure, notice that the naive confidence intervals miss one of the true signals. Also, notice that while the polyhedral confidence intervals tend to be narrower than the global-null or regime switching confidence intervals, they fail to detect that the first coordinate of the mean vector is different from zero. These are all typical behaviors of for these inference methods. Also, notice that the selection adjusted MLE has applied some shrinkage to the observed coordinates of the mean vector.

We can also plot the p-values with or without multiplicity adjustment.

noadjust <- plot(allfit, type = "p-values", threshold = c(0.1, 0.01),
                 adjust = "none")
bhadjust <- plot(allfit, type = "p-values", threshold = c(0.1, 0.01),
                 adjust = "BH")
gridExtra::grid.arrange(noadjust, bhadjust, nrow = 2)

Next, let us examine some ways to access the quantities we computed. The simplest way, is perhaps to access them directly through the returned object.

cat("polyhedral pvalues: ", round(allfit$polyPval, 3), "\n")
cat("global-null pvalues: ", round(allfit$nullPval, 3), "\n")
cat("hybrid pvalues: ", round(allfit$hybridPval, 3), "\n")

A second option is to use the getCI and getPval functions.

cat("polyhedral pvalues: ", round(getPval(allfit, type = "polyhedral"), 3), "\n")
print(round(getCI(allfit, type = "polyhedral"), 3))

The getCI function also allows to compute confidence intervals based on a different confidence level than the one used in the original computation

# Original confidence interval:
print(round(getCI(allfit, type = "switch"), 3))
# Some more conservative confidence intervals:
print(round(getCI(allfit, type = "switch", confidence_level = .99), 3))

The mvnQuadratic function can also be used to perform inference for quadratic tests different than the Wald test. In such cases, the conditional MLE cannot be computed in closed and we resort to using a stochastic gradient method. To verify that stochastic gradient method indeed converges to the neighborhood of the MLE, in the next chunk we will perform inference on the same dataset as before but instead of telling the function that we are using a Wald test we will pass the inverse of the covariance matrix as a general test matrix.

testmat <- solve(sigma)
sgdfit <- mvnQuadratic(y, sigma, testMat = testmat, 
                       threshold = NULL, pval_threshold = pthreshold,
                       estimate_type = c("mle", "naive"),
                       pvalue_type = c("polyhedral", "hybrid", "naive"),
                       ci_type = c("polyhedral", "switch", "naive"),
                       confidence_level = .95, verbose = FALSE,
                       control = psatControl(optimMethod = "SGD"))
nmfit <- mvnQuadratic(y, sigma, testMat = testmat, threshold = NULL, 
                      pval_threshold = pthreshold,
                      estimate_type = c("mle", "naive"),
                      pvalue_type = c("polyhedral", "hybrid", "naive"),
                      ci_type = c("polyhedral", "switch", "naive"),
                      confidence_level = .95, verbose = FALSE,
                      control = psatControl(optimMethod = "Nelder-Mead"))

Next, let us evaluate the solution paths of the SGD algorithm:

plot(sgdfit, type = "solution-path")

We can also compare the new result to the old one, as the two maximum likelihood estimates should be roughly the same.

plot(y, coef(sgdfit), col = "red")
points(y, coef(allfit), col = "black", pch = 2)
points(y, coef(nmfit), col = "blue", pch = 3)
abline(a = 0, b = 1)
abline(v = 0, h = 0, lty = 2)
legend("topleft", col = c("black", "red", "blue"), pch = 1:3, 
       legend = c("exact MLE", "SGD MLE", "Nelder-Mead"))

Inference for Generalized Linear Models

The PSAT pacakge can also be used to perform inference in generalized linear models that were selected via an aggregate test. These functionalities are based on the assumption that the distribution of the (naive) maximum likelihood estimator for the regression coefficients is approximately normal. That is: $$ \hat\beta \sim N(\beta, (XWX)^{-1}) $$

where the weight matrix $W$ may depend on the estimated regression coefficients $\hat\beta$ and the type of GLM used. Next, let us generate data from a logistic regression model using an arbitrary matrix $K$ as our test matrix.

# set.seed(1234)
pthreshold <- 0.001
n <- 500
p <- 20
xcov <- cov2cor(clusterGeneration::genPositiveDefMat(p)$Sigma)
X <- as.matrix(scale(mvtnorm::rmvnorm(n, sigma = xcov)))
X <- as.matrix(scale(X))
sparsity <- 3
snr <- 0.5
beta <- rnorm(sparsity)
beta <- beta / sum(abs(beta)) * snr
beta <- c(beta, rep(0, p - sparsity))
eta <- -0.5 + as.numeric(X %*% beta)
probs <- 1 / (1 + exp(-eta))
testmat <- cov2cor(clusterGeneration::genPositiveDefMat(p)$Sigma)
while(TRUE) {
  y <- rbinom(n, 1, probs)
  glmfit <- glm(y ~ X, family = "binomial")
  est <- coef(glmfit)[-1]
  sigma <- vcov(glmfit)[-1, , drop = FALSE][, -1, drop = FALSE]
  threshold <- getTestThreshold(pthreshold, sigma, testmat)
  testStat <- as.numeric(t(est) %*% testmat %*% est)
  if(testStat > threshold) break
}

Performing inference and plotting estimates:

diffCont <- matrix(0, nrow = 1, ncol = ncol(X))
diffCont[1, 1] <- 1
diffCont[1, 2] <- -1
contrasts <- rbind(diag(ncol(X)), diffCont)
rownames(contrasts) <- paste("V", 1:p, sep ="") %>% c("diff")

glmfit <- psatGLM(X, y, test = testmat, family = "binomial",
                  contrasts = contrasts,
                  resid_sd = c("naive", "null"),
                  threshold = NULL, pval_threshold = pthreshold,
                  estimate_type = c("mle", "naive"),
                  pvalue_type = c("hybrid", "polyhedral", "naive"),
                  ci_type = c("switch", "polyhedral", "naive"),
                  confidence_level = .95, verbose = TRUE, 
                  control = psatControl(trueHybrid = TRUE))

trueVal <- as.numeric(contrasts %*% beta)
plot(glmfit, type = "estimates", true = trueVal)
plot(glmfit, type = "p-values", adjust = "none", threshold = c(0.05, 0.05 / p))

Notice that the naive confidence intervals miss three of the true coefficients and that the switching regime confidence intervals manage to identify one of the coefficients as being non-zero. The psatGLM function also provides some standard regression interfaces such as the coef, predict and summary functions. When reporting confidence intervals, pvalues or summaries, the reported quantities will be the first listed in the original function call, unless specified otherwise.

print(round(rbind(coef(glmfit, type = "naive"), coef(glmfit, type = "mle"))[, 1:10], 3))
print(round(predict(glmfit)[1:10], 3))
summary(glmfit, pvalue_type = "hybrid", ci_type = "switch")