forwsimDiffExpr: Forward simulation for differential expression.
In gaga: GaGa hierarchical model for high-throughput data analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

Forward simulation allows to evaluate the expected utility for sequential designs. Here the utility is the expected number of true discoveries minus a sampling cost. The routine simulates future data either from the prior predictive or using a set of pilot data and a GaGa or normal-normal model fit. At each future time point, it computes a summary statistic that will be used to determine when to stop the experiment.

1 2	forwsimDiffExpr(fit, x, groups, ngenes, maxBatch, batchSize, fdrmax = 0.05, genelimit, v0thre = 1, B = 100, Bsummary = 100, trace = TRUE, randomSeed, mc.cores=1)

`fit`	Either GaGa or MiGaGa fit (object of type `gagafit`, as returned by `fitGG`) or Normal-Normal fit (type `nnfit` returned by `fitNN`).
`x`	`ExpressionSet`, `exprSet`, data frame or matrix containing the gene expression measurements used to fit the model.
`groups`	If `x` is of type `ExpressionSet` or `exprSet`, `groups` should be the name of the column in `pData(x)` with the groups that one wishes to compare. If `x` is a matrix or a data frame, `groups` should be a vector indicating to which group each column in x corresponds to.
`ngenes`	Number of genes to simulate data for. If `x` is specified this argument is set to `nrow(x)` and data is simulated from the posterior predictive conditional on `x`. If `x` not specified simulation is from the prior predictive.
`maxBatch`	Maximum number of batches, i.e. the routine simulates `batchSize`*`maxBatch` samples per group.
`batchSize`	Batch size, i.e. number of observations per group to simulate at each time point. Defaults to `ncol(x)/length(unique(groups))`.
`fdrmax`	Upper bound on FDR.
`genelimit`	Only the `genelimit` genes with the lowest probability of being equally expressed across all groups will be simulated. Setting this limit can significantly increase the computational speed.
`v0thre`	Only genes with posterior probability of being equally expressed < `v0thre` will be simulated. Setting this limit can significantly increase the computational speed.
`B`	Number of forward simulations.
`Bsummary`	Number of simulations for estimating the summary statistic.
`trace`	For `trace==TRUE` iteration progress is displayed.
`randomSeed`	Integer value used to set random number generator seed. Defaults to `as.numeric(Sys.time())` modulus 10^6.
`mc.cores`	If `multicore` package is available, `mc.cores` indicates the number of cores to use for parallel computing. Currently only used when `fit` is of class `nnfit`.

To improve computational speed hyper-parameters are not re-estimated as new data is simulated.

A data.frame with the following columns:

`simid`	Simulation number.
`j`	Time (sample size).
`u`	Expected number of true positives if we were to stop experimentation at this time.
`fdr`	Expected FDR if we were to stop experimentation at this time.
`fnr`	Expected FNR if we were to stop experimentation at this time.
`power`	Expected power (as estimated by E(TP)/E(positives)) if we were to stop experimentation at this time.
`summary`	Summary statistic: increase in expected true positives if we were to obtain one more data batch.

David Rossell.

Rossell D., Mueller P. Sequential sample sizes for high-throughput hypothesis testing experiments. http://sites.google.com/site/rosselldavid/home.

Rossell D. GaGa: a simple and flexible hierarchical model for microarray data analysis. Annals of Applied Statistics, 2009, 3, 1035-1051.

plotForwSim to plot the simulated trajectories, fitGG for fitting a GaGa model, fitNN for fitting a normal-normal model, seqBoundariesGrid for finding the optimal design based on the forwards simulation output. powfindgenes for fixed sample size calculations.

#Simulate data and fit GaGa model
set.seed(1)
x <- simGG(n=20,m=2,p.de=.5,a0=3,nu=.5,balpha=.5,nualpha=25)
gg1 <- fitGG(x,groups=1:2,method='EM')
gg1 <- parest(gg1,x=x,groups=1:2)

#Run forward simulation
fs1 <- forwsimDiffExpr(gg1, x=x, groups=1:2,
maxBatch=2,batchSize=1,fdrmax=0.05, B=100, Bsummary=100, randomSeed=1)

#Expected number of true positives for each sample size
tapply(fs1$u,fs1$time,'mean')

#Expected utility for each sample size
samplingCost <- 0.01
tapply(fs1$u,fs1$time,'mean') - samplingCost*(0:2)

#Optimal sequential design
b0seq <- seq(0,20,length=200); b1seq <- seq(0,40,length=200)
bopt <-seqBoundariesGrid(b0=b0seq,b1=b1seq,forwsim=fs1,samplingCost=samplingCost,powmin=0)
bopt <- bopt$opt

plot(fs1$time,fs1$summary,xlab='Additional batches',ylab='E(newly discovered DE genes)')
abline(bopt['b0'],bopt['b1'])
text(.2,bopt['b0'],'Continue',pos=3)
text(.2,bopt['b0'],'Stop',pos=1)