Forward simulation for differential expression.

Share:

Description

Forward simulation allows to evaluate the expected utility for sequential designs. Here the utility is the expected number of true discoveries minus a sampling cost. The routine simulates future data either from the prior predictive or using a set of pilot data and a GaGa or normal-normal model fit. At each future time point, it computes a summary statistic that will be used to determine when to stop the experiment.

Usage

1
2
forwsimDiffExpr(fit, x, groups, ngenes, maxBatch, batchSize, fdrmax = 0.05, genelimit, v0thre = 1, B = 100,
Bsummary = 100, trace = TRUE, randomSeed, mc.cores=1)

Arguments

fit

Either GaGa or MiGaGa fit (object of type gagafit, as returned by fitGG) or Normal-Normal fit (type nnfit returned by fitNN).

x

ExpressionSet, exprSet, data frame or matrix containing the gene expression measurements used to fit the model.

groups

If x is of type ExpressionSet or exprSet, groups should be the name of the column in pData(x) with the groups that one wishes to compare. If x is a matrix or a data frame, groups should be a vector indicating to which group each column in x corresponds to.

ngenes

Number of genes to simulate data for. If x is specified this argument is set to nrow(x) and data is simulated from the posterior predictive conditional on x. If x not specified simulation is from the prior predictive.

maxBatch

Maximum number of batches, i.e. the routine simulates batchSize*maxBatch samples per group.

batchSize

Batch size, i.e. number of observations per group to simulate at each time point. Defaults to ncol(x)/length(unique(groups)).

fdrmax

Upper bound on FDR.

genelimit

Only the genelimit genes with the lowest probability of being equally expressed across all groups will be simulated. Setting this limit can significantly increase the computational speed.

v0thre

Only genes with posterior probability of being equally expressed < v0thre will be simulated. Setting this limit can significantly increase the computational speed.

B

Number of forward simulations.

Bsummary

Number of simulations for estimating the summary statistic.

trace

For trace==TRUE iteration progress is displayed.

randomSeed

Integer value used to set random number generator seed. Defaults to as.numeric(Sys.time()) modulus 10^6.

mc.cores

If multicore package is available, mc.cores indicates the number of cores to use for parallel computing. Currently only used when fit is of class nnfit.

Details

To improve computational speed hyper-parameters are not re-estimated as new data is simulated.

Value

A data.frame with the following columns:

simid

Simulation number.

j

Time (sample size).

u

Expected number of true positives if we were to stop experimentation at this time.

fdr

Expected FDR if we were to stop experimentation at this time.

fnr

Expected FNR if we were to stop experimentation at this time.

power

Expected power (as estimated by E(TP)/E(positives)) if we were to stop experimentation at this time.

summary

Summary statistic: increase in expected true positives if we were to obtain one more data batch.

Author(s)

David Rossell.

References

Rossell D., Mueller P. Sequential sample sizes for high-throughput hypothesis testing experiments. http://sites.google.com/site/rosselldavid/home.

Rossell D. GaGa: a simple and flexible hierarchical model for microarray data analysis. Annals of Applied Statistics, 2009, 3, 1035-1051.

See Also

plotForwSim to plot the simulated trajectories, fitGG for fitting a GaGa model, fitNN for fitting a normal-normal model, seqBoundariesGrid for finding the optimal design based on the forwards simulation output. powfindgenes for fixed sample size calculations.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#Simulate data and fit GaGa model
set.seed(1)
x <- simGG(n=20,m=2,p.de=.5,a0=3,nu=.5,balpha=.5,nualpha=25)
gg1 <- fitGG(x,groups=1:2,method='EM')
gg1 <- parest(gg1,x=x,groups=1:2)

#Run forward simulation
fs1 <- forwsimDiffExpr(gg1, x=x, groups=1:2,
maxBatch=2,batchSize=1,fdrmax=0.05, B=100, Bsummary=100, randomSeed=1)

#Expected number of true positives for each sample size
tapply(fs1$u,fs1$time,'mean')

#Expected utility for each sample size
samplingCost <- 0.01
tapply(fs1$u,fs1$time,'mean') - samplingCost*(0:2)

#Optimal sequential design
b0seq <- seq(0,20,length=200); b1seq <- seq(0,40,length=200)
bopt <-seqBoundariesGrid(b0=b0seq,b1=b1seq,forwsim=fs1,samplingCost=samplingCost,powmin=0)
bopt <- bopt$opt

plot(fs1$time,fs1$summary,xlab='Additional batches',ylab='E(newly discovered DE genes)')
abline(bopt['b0'],bopt['b1'])
text(.2,bopt['b0'],'Continue',pos=3)
text(.2,bopt['b0'],'Stop',pos=1)