igg.normalmeans: Normal Means Estimation and Classification with the IGG Prior
In IGG: Inverse Gamma-Gamma

Description Usage Arguments Details Value Author(s) References Examples

This function provides a fully Bayesian approach for obtaining a sparse estimate of θ = (θ_1, ..., θ_n) in the normal means problem,

X_i = θ_i + ε_i,

where ε_i \sim N(0, σ^2). This is achieved by placing the inverse gamma-gamma (IGG) prior on the individual θ_i's. Variable selection can also be performed by using either thresholding the shrinkage factor in the posterior mean, or by examining the marginal 95 percent credible intervals.

1 2	igg.normalmeans(x, a=NA, b=NA, sigma2=NA, var.select = c("threshold", "intervals"), max.steps=10000, burnin=5000)

`x`	an n \times 1 multivariate normal vector.
`a`	The parameter for IG(a,1). If not specified (`a=NA`), defaults to 1/2+1/n. User may specify a value for `a` between 0 and 1.
`b`	The parameter for G(b,1). If not specified (`b=NA`), defaults to 1/n. User may specify a value for `b` between 0 and 1.
`sigma2`	The variance parameter. If the user does not specify this (`sigma2=NA`), the Gibbs sampler will estimate this using Jeffreys prior. If sigma^2 is known or estimated separately (e.g. through empirical Bayes), the user may also specify it.
`var.select`	The method of variable selection. `threshold` selects variables by thresholding the shrinkage factor in the posterior mean. `intervals` will classify entries of x as either signals (x_i \neq 0) or as noise (x_i = 0) by examining the 95 percent marginal credible intervals.
`max.steps`	The total number of iterations to run in the Gibbs sampler. Defaults to 10,000.
`burnin`	The number of burn-in iterations for the Gibbs sampler. Defaults to 5,000.

The function performs sparse estimation of θ = (θ_1, ..., θ_n) in normal means problem. The full model is:

X | θ \sim N_n( θ, σ^2 I_n),

θ_i | (λ_i,ξ_i, σ^2) \sim N(0, σ^2 λ_i ξ_i), i = 1, ..., n,

λ_i \sim IG(a,1), i = 1, ..., n,

ξ_i \sim G(b,1), i = 1, ..., n,

σ^2 \propto 1/σ^2.

If σ^2 is known or estimated separately, the Gibbs sampler does not sample from the full conditional for σ^2.

As described in Bai and Ghosh (2018), the posterior mean is of the form [E(1-κ_i | X_i)] X_i, i = 1, ..., n. The "threshold" method for variable selection is to classify θ_i as signal (θ_i \neq 0) if

E(1 - κ_i | X_i) > 1/2,

and to classify θ_i as noise (θ_i = 0) if

E(1 - κ_i |X_i) ≤q 1/2.

The function returns a list containing the following components:

`theta.hat`	The posterior mean estimate of θ.
`theta.med`	The posterior median estimate of θ.
`theta.intervals`	The lower and upper endpoints of the 95 percent credible intervals for all n components of θ.
`igg.classifications`	An n-dimensional binary vector with "1" if the covariate is selected and "0" if it is deemed irrelevant.

Ray Bai and Malay Ghosh

Bai, R. and Ghosh, M. (2018). "The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing." Submitted, arXiv:1711.07635.

###################################################
###################################################
## Example on synthetic data.                    ## 
## 5 percent of entries in a sparse vector theta ##
## are set equal to signal value A =7.           ##
###################################################
###################################################

n <- 100
sparsity.level <- 5
A <- 7

# Initialize theta vector of all zeros
theta.true <- rep(0,n)

# Set (sparsity.level)% of them to be A
q <- floor(n*(sparsity.level/100))


# Pick random indices of theta.true to equal A
signal.indices <- sample(1:n, size=q, replace=FALSE)

#######################
# Generate true theta #
#######################
theta.true[signal.indices] <- A

####################################################### 
# Generate data X by corrupting theta.true with noise #
#######################################################
X <- theta.true + rnorm(n,0,1)

##########################
# Run the IGG model on X #
##########################
# For optimal performance, should set max.steps=10,000 and burnin=5000.

igg.model <- igg.normalmeans(X, var.select="threshold", max.steps=2000, burnin=1000)

################################ 
# Calculate mean squared error #
################################
igg.mse <- sum((igg.model$theta.med-theta.true)^2)/n
igg.mse

# To compute misclassification probability and false discovery rate
true.classifications <- rep(0,n)
signal.indicies <- which(theta.true != 0)
true.classifications[signal.indices] <- 1 
igg.classifications <- igg.model$igg.classifications

false.pos <- length(which(igg.classifications != 0 & true.classifications == 0))
num.pos <- length(which(igg.classifications != 0))
false.neg <- length(which(igg.classifications == 0 & true.classifications != 0))

########################################################
# Calculate FDR and misclassification probability (MP) #
########################################################
igg.fdr <- false.pos/num.pos
igg.fdr

igg.mp <- (false.pos+false.neg)/n
igg.mp

## Not run: 
#
#
######################################################
######################################################
## Prostate cancer data analysis.                   ##
## Running this code will allow you to reproduce    ##
## the results in Section 6 of Bai and Ghosh (2018) ##
######################################################
######################################################

# Load the data
data(singh2002)
attach(singh2002)

# Only look at the gene expression data.
# First 50 rows are the cancer patients,
# and the last 52 rows are the control subjects.d

prostate.data <- singh2002$x

# Perform 2-sample t-test and obtain z-scores
n <- ncol(prostate.data)
test.stat <- rep(NA,n)
z.scores <- rep(NA, n)

# Fill in the vectors
for(i in 1:n){
  test.stat[i] <- t.test(prostate.data[51:102,i], 
                        prostate.data[1:50,i])$statistic
  z.scores[i] <- qnorm(pt(test.stat[i],100))
}


#######################################
# Apply IGG model on the z-scores.    #
# Here sigma2 is known with sigma2= 1 #
#######################################

igg.model <- igg.normalmeans(z.scores, sigma2=1, var.select="threshold")

##########################################
# How many genes flagged as significant? #
##########################################
num.sig <- sum(igg.model$igg.classifications != 0)
num.sig

####################################################### 
# Estimated effect size for 10 most significant genes #
#######################################################
most.sig <- c(610,1720,332,364,914,3940,4546,1068,579,4331)
igg.model$theta.hat[most.sig]

## End(Not run)