knitr::opts_chunk$set(echo = TRUE)
library(magrittr)

Installing Package

devtools::install_github("ammeir2/PSAT")

Inference for Normal Means

The PSAT pacakge provides functionalities for performing inference on multivariate normal vectors and generalized linear models that pass a linear aggregate test. Suppose that we observe $y \sim N_p(\mu, \Sigma)$ and that we wish to perform inference on the individual coordinates of $\mu$ only if we can reject a linear test of the form:

$$ |a^{T} y| > c > 0, \qquad a\in\Re^{p}. $$

for some pre-determined threshold $c$ and vector $a$. Then, we must account for the fact that we have selected the model via a data driven method or in order to obtain valid p-values, confidence intervals that achieve the desire coverage rate and efficient point estimates. Let us begin with generating a dataset by sampling multivariate normal vectors and screening with a contrast $a = (1,\dots,1)$.

# Tests for wald test ------------------------
set.seed(5)
p <- 8
sparsity <- 2
snr <- 3
mu <- rexp(sparsity) 
mu <- mu / sum(abs(mu)) * snr
mu <- c(mu, rep(0, p - sparsity))
sigma <- matrix(0.3, nrow = p, ncol = p)
diag(sigma) <- 1
sqrtSig <- expm::sqrtm(sigma)
testmat <- solve(sigma)

pthreshold <- 0.001
a <- rep(1, p)
contrastSD <- sqrt(as.numeric(t(a) %*% sigma %*% a))
threshold <- qnorm(1 - pthreshold / 2, sd = contrastSD)
testStat <- 0
invcov <- solve(sigma)
while(abs(testStat) < threshold) {
  z <- rnorm(p)
  y <- as.numeric(z %*% sqrtSig) + mu
  testStat <- sum(y * a)
}
print(round(y, 2))

Our threshold is r round(threshold, 1) and the realized test statistic is r round(testStat, 2), so we just barely passed the test. Now, let us use the mvnQuadratic function to perform inference. The mvnQuadratic function can be used to compute several different types of confidence intervals and p-values. Generally speaking, the confidence interval methods global-null and switch are quite computationally intensive while the polyhedral method is very fast, yet somewhat less powerful when the signal is sparse. Similarly, the p-value methods hybrid and global-null are computationally intensive while the polyhedral method is not. In the following, we also demonstrate an option to provide the inference function with an optional set of contrasts to be tested/estimated as a matrix of dimension $l \times p$ where $l$ is the number of contrasts. If no such matrix is provided, the function will estimated the mean vector of $y$. Specifcally, we test the difference between the first and second coordinates of $y$, as well as the seventh and eighth coordinates.

library(PSAT)
diffCont <- matrix(0, nrow = 2, ncol = p)
diffCont[1, 1] <- 1
diffCont[1, 2] <- -1
diffCont[2, 7] <- 1
diffCont[2, 8] <- -1
contrasts <- rbind(diag(length(y)), diffCont)
allfit <- mvnLinear(y, sigma, testVec = a, pval_threshold = pthreshold, 
                    contrasts = contrasts,
                    # test_direction = "two-sided",
                    estimate_type = c("mle", "naive"),
                    pvalue_type = c("hybrid", "polyhedral", "naive"),
                    ci_type = c("switch", "polyhedral", "naive"),
                    confidence_level = .95, verbose = TRUE,
                    control = psatControl(nSamples = 10^4))

Notice that when calling mvnQuadratic, we listed several methods for confidence interval and p-value computations. The function will compute all of the requested quantities. A good way to examine the outputs of the function is to plot the estimates and confidence intervals, in the following function call we will provide the plotting function with the ground truth so we can see whether our confidence intervals have successfully identified the true signal.

trueMean <- as.numeric(contrasts %*% mu)
plot(allfit, type = "estimates", true = trueMean)

In the figure, notice that the naive confidence intervals miss one of the true signals. Also, notice that while the polyhedral confidence intervals tend to be narrower than the global-null or regime switching confidence intervals, they fail to detect that the first coordinate of the mean vector is different from zero. These are all typical behaviors of for these inference methods. Also, notice that the selection adjusted MLE has applied some shrinkage to the observed coordinates of the mean vector.

We can also plot the p-values with or without multiplicity adjustment.

noadjust <- plot(allfit, type = "p-values", threshold = c(0.1, 0.01),
                 adjust = "none")
bhadjust <- plot(allfit, type = "p-values", threshold = c(0.1, 0.01),
                 adjust = "BH")
gridExtra::grid.arrange(noadjust, bhadjust, nrow = 2)

Next, let us examine some ways to access the quantities we computed. The simplest way, is perhaps to access them directly through the returned object.

cat("polyhedral pvalues: ", round(allfit$polyPval, 3), "\n")
cat("hybrid pvalues: ", round(allfit$hybridPval, 3), "\n")

A second option is to use the getCI and getPval functions.

cat("polyhedral pvalues: ", round(getPval(allfit, type = "polyhedral"), 3), "\n")
print(round(getCI(allfit, type = "polyhedral"), 3))

The getCI function also allows to compute confidence intervals based on a different confidence level than the one used in the original computation

# Original confidence interval:
print(round(getCI(allfit, type = "switch"), 3))
# Some more conservative confidence intervals:
print(round(getCI(allfit, type = "switch", confidence_level = .99), 3))

Inference for Generalized Linear Models

The PSAT pacakge can also be used to perform inference in generalized linear models that were selected via an aggregate test. These functionalities are based on the assumption that the distribution of the (naive) maximum likelihood estimator for the regression coefficients is approximately normal. That is: $$ \hat\beta \sim N(\beta, (XWX)^{-1}) $$

where the weight matrix $W$ may depend on the estimated regression coefficients $\hat\beta$ and the type of GLM used. Next, let us generate data from a logistic regression model using an arbitrary matrix $K$ as our test matrix.

set.seed(1234)
pthreshold <- 0.001
n <- 500
p <- 20
xcov <- cov2cor(clusterGeneration::genPositiveDefMat(p)$Sigma)
X <- as.matrix(scale(mvtnorm::rmvnorm(n, sigma = xcov)))
X <- as.matrix(scale(X))
sparsity <- 3
snr <- 0.5
beta <- rnorm(sparsity)
beta <- beta / sum(abs(beta)) * snr
beta <- c(beta, rep(0, p - sparsity))
a <- runif(p)
eta <- -0.5 + as.numeric(X %*% beta)
probs <- 1 / (1 + exp(-eta))
testmat <- cov2cor(clusterGeneration::genPositiveDefMat(p)$Sigma)
while(TRUE) {
  y <- rbinom(n, 1, probs)
  glmfit <- glm(y ~ X, family = "binomial")
  est <- coef(glmfit)[-1]
  sigma <- vcov(glmfit)[-1, , drop = FALSE][, -1, drop = FALSE]
  contrastSD <- sqrt(as.numeric(t(a) %*% sigma %*% a))
  threshold <- qnorm(1 - pthreshold, sd = contrastSD)
  testStat <- sum(est * a)
  if(abs(testStat) > threshold) break
}

Performing inference and plotting estimates:

glmfit <- psatGLM(X, y, test = a, test_direction = "two-sided",
                  family = "binomial",
                  resid_sd = c("naive", "null"),
                  threshold = NULL, pval_threshold = pthreshold,
                  estimate_type = c("mle", "naive"),
                  pvalue_type = c("hybrid", "polyhedral", "naive"),
                  ci_type = c("switch", "polyhedral", "naive"),
                  confidence_level = .95, verbose = TRUE, 
                  control = psatControl(trueHybrid = TRUE))
plot(glmfit, type = "estimates", true = beta)
plot(glmfit, type = "p-values", adjust = "none", threshold = c(0.05, 0.05 / p))

Notice that the naive confidence intervals miss three of the true coefficients and that the switching regime confidence intervals manage to identify one of the coefficients as being non-zero. The psatGLM function also provides some standard regression interfaces such as the coef, predict and summary functions. When reporting confidence intervals, pvalues or summaries, the reported quantities will be the first listed in the original function call, unless specified otherwise.

print(round(rbind(coef(glmfit, type = "naive"), coef(glmfit, type = "mle"))[, 1:10], 3))
print(round(predict(glmfit)[1:10], 3))
summary(glmfit, pvalue_type = "hybrid", ci_type = "switch")


ammeir2/PSAT documentation built on May 27, 2019, 7:40 a.m.