# igg.normalmeans: Normal Means Estimation and Classification with the IGG Prior In IGG: Inverse Gamma-Gamma

## Description

This function provides a fully Bayesian approach for obtaining a sparse estimate of θ = (θ_1, ..., θ_n) in the normal means problem,

X_i = θ_i + ε_i,

where ε_i \sim N(0, σ^2). This is achieved by placing the inverse gamma-gamma (IGG) prior on the individual θ_i's. Variable selection can also be performed by using either thresholding the shrinkage factor in the posterior mean, or by examining the marginal 95 percent credible intervals.

## Usage

 1 2 igg.normalmeans(x, a=NA, b=NA, sigma2=NA, var.select = c("threshold", "intervals"), max.steps=10000, burnin=5000) 

## Arguments

 x an n \times 1 multivariate normal vector. a The parameter for IG(a,1). If not specified (a=NA), defaults to 1/2+1/n. User may specify a value for a between 0 and 1. b The parameter for G(b,1). If not specified (b=NA), defaults to 1/n. User may specify a value for b between 0 and 1. sigma2 The variance parameter. If the user does not specify this (sigma2=NA), the Gibbs sampler will estimate this using Jeffreys prior. If sigma^2 is known or estimated separately (e.g. through empirical Bayes), the user may also specify it. var.select The method of variable selection. threshold selects variables by thresholding the shrinkage factor in the posterior mean. intervals will classify entries of x as either signals (x_i \neq 0) or as noise (x_i = 0) by examining the 95 percent marginal credible intervals. max.steps The total number of iterations to run in the Gibbs sampler. Defaults to 10,000. burnin The number of burn-in iterations for the Gibbs sampler. Defaults to 5,000.

## Details

The function performs sparse estimation of θ = (θ_1, ..., θ_n) in normal means problem. The full model is:

X | θ \sim N_n( θ, σ^2 I_n),

θ_i | (λ_i,ξ_i, σ^2) \sim N(0, σ^2 λ_i ξ_i), i = 1, ..., n,

λ_i \sim IG(a,1), i = 1, ..., n,

ξ_i \sim G(b,1), i = 1, ..., n,

σ^2 \propto 1/σ^2.

If σ^2 is known or estimated separately, the Gibbs sampler does not sample from the full conditional for σ^2.

As described in Bai and Ghosh (2018), the posterior mean is of the form [E(1-κ_i | X_i)] X_i, i = 1, ..., n. The "threshold" method for variable selection is to classify θ_i as signal (θ_i \neq 0) if

E(1 - κ_i | X_i) > 1/2,

and to classify θ_i as noise (θ_i = 0) if

E(1 - κ_i |X_i) ≤q 1/2.

## Value

The function returns a list containing the following components:

 theta.hat The posterior mean estimate of θ. theta.med The posterior median estimate of θ. theta.intervals The lower and upper endpoints of the 95 percent credible intervals for all n components of θ. igg.classifications An n-dimensional binary vector with "1" if the covariate is selected and "0" if it is deemed irrelevant.

## Author(s)

Ray Bai and Malay Ghosh

## References

Bai, R. and Ghosh, M. (2018). "The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing." Submitted, arXiv:1711.07635.

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 ################################################### ################################################### ## Example on synthetic data. ## ## 5 percent of entries in a sparse vector theta ## ## are set equal to signal value A =7. ## ################################################### ################################################### n <- 100 sparsity.level <- 5 A <- 7 # Initialize theta vector of all zeros theta.true <- rep(0,n) # Set (sparsity.level)% of them to be A q <- floor(n*(sparsity.level/100)) # Pick random indices of theta.true to equal A signal.indices <- sample(1:n, size=q, replace=FALSE) ####################### # Generate true theta # ####################### theta.true[signal.indices] <- A ####################################################### # Generate data X by corrupting theta.true with noise # ####################################################### X <- theta.true + rnorm(n,0,1) ########################## # Run the IGG model on X # ########################## # For optimal performance, should set max.steps=10,000 and burnin=5000. igg.model <- igg.normalmeans(X, var.select="threshold", max.steps=2000, burnin=1000) ################################ # Calculate mean squared error # ################################ igg.mse <- sum((igg.model$theta.med-theta.true)^2)/n igg.mse # To compute misclassification probability and false discovery rate true.classifications <- rep(0,n) signal.indicies <- which(theta.true != 0) true.classifications[signal.indices] <- 1 igg.classifications <- igg.model$igg.classifications false.pos <- length(which(igg.classifications != 0 & true.classifications == 0)) num.pos <- length(which(igg.classifications != 0)) false.neg <- length(which(igg.classifications == 0 & true.classifications != 0)) ######################################################## # Calculate FDR and misclassification probability (MP) # ######################################################## igg.fdr <- false.pos/num.pos igg.fdr igg.mp <- (false.pos+false.neg)/n igg.mp ## Not run: # # ###################################################### ###################################################### ## Prostate cancer data analysis. ## ## Running this code will allow you to reproduce ## ## the results in Section 6 of Bai and Ghosh (2018) ## ###################################################### ###################################################### # Load the data data(singh2002) attach(singh2002) # Only look at the gene expression data. # First 50 rows are the cancer patients, # and the last 52 rows are the control subjects.d prostate.data <- singh2002$x # Perform 2-sample t-test and obtain z-scores n <- ncol(prostate.data) test.stat <- rep(NA,n) z.scores <- rep(NA, n) # Fill in the vectors for(i in 1:n){ test.stat[i] <- t.test(prostate.data[51:102,i], prostate.data[1:50,i])$statistic z.scores[i] <- qnorm(pt(test.stat[i],100)) } ####################################### # Apply IGG model on the z-scores. # # Here sigma2 is known with sigma2= 1 # ####################################### igg.model <- igg.normalmeans(z.scores, sigma2=1, var.select="threshold") ########################################## # How many genes flagged as significant? # ########################################## num.sig <- sum(igg.model$igg.classifications != 0) num.sig ####################################################### # Estimated effect size for 10 most significant genes # ####################################################### most.sig <- c(610,1720,332,364,914,3940,4546,1068,579,4331) igg.model$theta.hat[most.sig] ## End(Not run) 

IGG documentation built on May 2, 2019, 2:04 a.m.