igg.normalmeans: Normal Means Estimation and Classification with the IGG Prior

Description Usage Arguments Details Value Author(s) References Examples

Description

This function provides a fully Bayesian approach for obtaining a sparse estimate of θ = (θ_1, ..., θ_n) in the normal means problem,

X_i = θ_i + ε_i,

where ε_i \sim N(0, σ^2). This is achieved by placing the inverse gamma-gamma (IGG) prior on the individual θ_i's. Variable selection can also be performed by using either thresholding the shrinkage factor in the posterior mean, or by examining the marginal 95 percent credible intervals.

Usage

1
2
igg.normalmeans(x, a=NA, b=NA, sigma2=NA, var.select = c("threshold", "intervals"), 
max.steps=10000, burnin=5000)

Arguments

x

an n \times 1 multivariate normal vector.

a

The parameter for IG(a,1). If not specified (a=NA), defaults to 1/2+1/n. User may specify a value for a between 0 and 1.

b

The parameter for G(b,1). If not specified (b=NA), defaults to 1/n. User may specify a value for b between 0 and 1.

sigma2

The variance parameter. If the user does not specify this (sigma2=NA), the Gibbs sampler will estimate this using Jeffreys prior. If sigma^2 is known or estimated separately (e.g. through empirical Bayes), the user may also specify it.

var.select

The method of variable selection. threshold selects variables by thresholding the shrinkage factor in the posterior mean. intervals will classify entries of x as either signals (x_i \neq 0) or as noise (x_i = 0) by examining the 95 percent marginal credible intervals.

max.steps

The total number of iterations to run in the Gibbs sampler. Defaults to 10,000.

burnin

The number of burn-in iterations for the Gibbs sampler. Defaults to 5,000.

Details

The function performs sparse estimation of θ = (θ_1, ..., θ_n) in normal means problem. The full model is:

X | θ \sim N_n( θ, σ^2 I_n),

θ_i | (λ_i,ξ_i, σ^2) \sim N(0, σ^2 λ_i ξ_i), i = 1, ..., n,

λ_i \sim IG(a,1), i = 1, ..., n,

ξ_i \sim G(b,1), i = 1, ..., n,

σ^2 \propto 1/σ^2.

If σ^2 is known or estimated separately, the Gibbs sampler does not sample from the full conditional for σ^2.

As described in Bai and Ghosh (2018), the posterior mean is of the form [E(1-κ_i | X_i)] X_i, i = 1, ..., n. The "threshold" method for variable selection is to classify θ_i as signal (θ_i \neq 0) if

E(1 - κ_i | X_i) > 1/2,

and to classify θ_i as noise (θ_i = 0) if

E(1 - κ_i |X_i) ≤q 1/2.

Value

The function returns a list containing the following components:

theta.hat

The posterior mean estimate of θ.

theta.med

The posterior median estimate of θ.

theta.intervals

The lower and upper endpoints of the 95 percent credible intervals for all n components of θ.

igg.classifications

An n-dimensional binary vector with "1" if the covariate is selected and "0" if it is deemed irrelevant.

Author(s)

Ray Bai and Malay Ghosh

References

Bai, R. and Ghosh, M. (2018). "The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing." Submitted, arXiv:1711.07635.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
###################################################
###################################################
## Example on synthetic data.                    ## 
## 5 percent of entries in a sparse vector theta ##
## are set equal to signal value A =7.           ##
###################################################
###################################################

n <- 100
sparsity.level <- 5
A <- 7

# Initialize theta vector of all zeros
theta.true <- rep(0,n)

# Set (sparsity.level)% of them to be A
q <- floor(n*(sparsity.level/100))


# Pick random indices of theta.true to equal A
signal.indices <- sample(1:n, size=q, replace=FALSE)

#######################
# Generate true theta #
#######################
theta.true[signal.indices] <- A

####################################################### 
# Generate data X by corrupting theta.true with noise #
#######################################################
X <- theta.true + rnorm(n,0,1)

##########################
# Run the IGG model on X #
##########################
# For optimal performance, should set max.steps=10,000 and burnin=5000.

igg.model <- igg.normalmeans(X, var.select="threshold", max.steps=2000, burnin=1000)

################################ 
# Calculate mean squared error #
################################
igg.mse <- sum((igg.model$theta.med-theta.true)^2)/n
igg.mse

# To compute misclassification probability and false discovery rate
true.classifications <- rep(0,n)
signal.indicies <- which(theta.true != 0)
true.classifications[signal.indices] <- 1 
igg.classifications <- igg.model$igg.classifications

false.pos <- length(which(igg.classifications != 0 & true.classifications == 0))
num.pos <- length(which(igg.classifications != 0))
false.neg <- length(which(igg.classifications == 0 & true.classifications != 0))

########################################################
# Calculate FDR and misclassification probability (MP) #
########################################################
igg.fdr <- false.pos/num.pos
igg.fdr

igg.mp <- (false.pos+false.neg)/n
igg.mp

## Not run: 
#
#
######################################################
######################################################
## Prostate cancer data analysis.                   ##
## Running this code will allow you to reproduce    ##
## the results in Section 6 of Bai and Ghosh (2018) ##
######################################################
######################################################

# Load the data
data(singh2002)
attach(singh2002)

# Only look at the gene expression data.
# First 50 rows are the cancer patients,
# and the last 52 rows are the control subjects.d

prostate.data <- singh2002$x

# Perform 2-sample t-test and obtain z-scores
n <- ncol(prostate.data)
test.stat <- rep(NA,n)
z.scores <- rep(NA, n)

# Fill in the vectors
for(i in 1:n){
  test.stat[i] <- t.test(prostate.data[51:102,i], 
                        prostate.data[1:50,i])$statistic
  z.scores[i] <- qnorm(pt(test.stat[i],100))
}


#######################################
# Apply IGG model on the z-scores.    #
# Here sigma2 is known with sigma2= 1 #
#######################################

igg.model <- igg.normalmeans(z.scores, sigma2=1, var.select="threshold")

##########################################
# How many genes flagged as significant? #
##########################################
num.sig <- sum(igg.model$igg.classifications != 0)
num.sig

####################################################### 
# Estimated effect size for 10 most significant genes #
#######################################################
most.sig <- c(610,1720,332,364,914,3940,4546,1068,579,4331)
igg.model$theta.hat[most.sig]

## End(Not run)

IGG documentation built on May 2, 2019, 2:04 a.m.