Optimal design for genetical genomics experiments

Description

Main function to search and display A- and D- optimal designs for single- or two-channel genetical genomics experiments. Simulated annealing or Metropolis Hastings used to find the best design.

Usage

1
2
3
4
5
6
  designGG ( genotype, nSlides, nTuple, nEnvFactors, nLevels,
                     Level=NULL, bTwoColorArray=TRUE, initial=NULL, weight=1,
                     region=NULL, optimality="A", method="SA", nIterations=3000,
                     n.search=2, endTemp=1e-10, startTemp=1, maxTempStep=0.9,
                     plotScores=TRUE, directory=NULL, fileName=NULL, 
                     envFactorNames=NULL, writingProcess=TRUE )                     

Arguments

genotype

genotype data: a nMarker-by-nRILs matrix with two allels being 0 and 1 (or A and B) or three allels being 0, 0.5 and 1 (or, A, H, and B), where 0.5 (or H) represents heterozygous allele.

nSlides

total number of slides available for the experiment.

nTuple

average number of RILs (or strains) to be assigned onto each condition.
nTuple should be a real number which is larger than 1.
If nTuple < 1, the algorithm will stop and show the message,
warning: "The number of slides is too small to perform the experiment."

nEnvFactors

number of environmental factors, an integer bewteen 1 and 3. When nEnvFactors is 1 and the number of levels for the enviromental factor (nLevels)is 1, there is one condition in the experiment (i.e. no enviromental perturbation) and thus only genetic factor will be considered in the algorithm. When nEnvFactors is 1 and nLevels is larger than 1 or nEnvFactors is larger than 1, all main factor(s) and interacting facotr(s) will be included. Examples: If there is a temperature perturbation, then nEnvFactors is 1; If there is both temperature and drug treatment perturbation, then nEnvFactors is 2.

nLevels

number of levels for each factor, a vector with each component being an integer. The length of it should equal nEnvFactors.

Level

a list which specifies the levels for each factor in the experiment. There are in total nEnvFactors elements in the list and each element corresponds to a certain environmental factor. The element is a vector describing all levels of the environmental factor. default setting for the level of each factor is 1, 2, ...., nlevels[i]. (Here nLevels[i] is the ith element of nLevels, which gives the total number of levels for i environmental factor).

bTwoColorArray

binary variable indicating experiment type:
bTwoColorArray <- TRUE \#for dual channel experiment
bTwoColorArray <- FALSE \#for single channel experiment

initial

the starting design matrix for the algorithm. If specified, this should be a list with 2 matrices:
condition.allocation: allocate RILs (or strains) into different conditional (nrow = nCondition, ncol= nRILs)
array.allocation: pair RILs (or strains) into sldies (nrow = nSlide, ncol = nRILs)
However, the algorithm does not require that a starting matrix is specified. Default = NULL.

weight

a vector with length of variableNumber which is calculated from function variableNumber. Default = 1 (which means the parameters to be estimated are all equally important during optimization). See details below.

region

genome region of biological interest. Default = NULL (which means the entire genome will considered).

optimality

type of optimality, i.e. "A" (A-optimality) or "D" (D-optimality). A-optimality minimizes $Trace((X'X)^-1)$, which corresponds to minimum average variance of the parameter estimates. D-optimality minimizes $det(X'X)^-1$, which corresponds to minimum generalized variance of the parameter estimates.

method

method for searching for an optimal design. "SA" uses simulated annealing. "MH" uses Metropolis Hasting. Default = "SA".

nIterations

number of iterations of the simulated annealing method. Default = 3000.

n.search

number of times for simulated annealing optimaization with different initial design, default = 2. Here it is suggested to be between 1 and 5. It should not to be too large because of the reaching computational burden.

endTemp

ending temperature of simulated annealing process. An important optimization parameter. Default = $1e^-10$.

startTemp

starting temperature of simulated annealing process. Default = 1.

maxTempStep

maximum temperature decreasing step for simulated annealing process. The parameter ensures that the multiplicative cooling factor is not smaller than that. If nIterations is too small, the preferred final temperature (endTemp) may not be reached. See Wit and McClure (2004) for details. Default = 0.9.

plotScores

If TRUE (default) it produces a plot of the optimazation by SA using the function plotAllScores.

directory

It tells where the resulting optimal design tables are to be stored. If NULL (default), it will take currect working directory.

fileName

the final optimal design table(s) in csv format and a plot (in png format) of all scores during SA process (if plotScores = TRUE) will be produced. The users can specify the table and plot name by setting fileName. If NULL (default) it produces files starting with "myDesignGG".

envFactorNames

a vector with names for all environmental factor(s). For example, for the experiment with two environmental factors of temperature and drug treatment: envFactorNames <- c( "Temperature", "Dosage" )
Default = NULL, then the output will use "F1" and "F2" to indicate the environmental factors.

writingProcess

If TRUE, it prints how much computation work has been finished in a file called "processing.txt". Default = TRUE.

Details

Given the genetic information of samples available for the experiment (genotype) and the information about experimental settings (nEnvFactors, nSlides,nLevels etc.), the algorithm searches for an A-optimal or D-optimal (see optimality) using simulated annealing (see method). A plot of the scores at each iterations can also be given using the plotAllScores function.
It also contains a number of the arguments:
region is used to specify the genome region that are of major interest to experimenters.
weight is used to define the weight of genetic and environmental factors, and interaction terms. Prior knowledge about expected effect sizes of interesting factors can also be incorporated as weight parameters for the algorithm. The weight is inversely proportional to the expected effect size of the corresponding parameter. Example parameter settings: Suppose to design an experiment with two environmental factors (F1, F2) and there are two diffferent levels for each environment. The levels are 16 and 24 for F1, and 5 and 10 for F2. Thus the following command can be used:
nEnvFactors <- 2
nLevels <- c ( 2, 2 )
levels <- list ( c(16, 24), c(5, 10) )
The length of parameter weight is dependent on the number of environmental factors:
When nEnvFactor = 0,
weight is 1 as there is only one parameter of interest (genotype).
When nEnvFactor = 1,
weight = c( $w_Q$, $w_F1$, $w_QF1$ )
When nEnvFactor = 2,
weight = c( $w_Q$, $w_F1$, $w_F2$, $w_QF1$, $w_QF2$, $w_F1F2$, $w_QF1F2$)
When nEnvFactor = 3,
weight = c( $w_Q$, $w_F1$, $w_F2$, $w_F2$, $w_QF1$, $w_QF2$, $w_QF3$, $w_F1F2$, $w_F1F3$, $w_F2F3$, $w_QF1F2$, $w_QF1F3$, $w_QF2F3$, $w_QF1F2F3$ )
Here $w_Q$ represents the weight for genotype effect, $w_F1$ represent the weight for F1 effect and $w_QF1$ represent the weight for interaction between genotype and F1 effect, etc.
It should be noted that the simulated annealing algorithm might find a locally and not globally optimal design. Running the optimization process multiple times is recommended. When nSearch > 1, the simulated annealing optimization will be run nSearch times, each run starts with a different initial design and will provide a (near-)optimal design. If the optimization problem is simple, all runs will converge to the same (optimal) design. Otherwise, the best one among all near-optimal designs will be selected as the output of the function. One can run the algorithm multiple times with nSearch = 1 to review a few (near-)optimal designs.

Value

An array design table (arrayDesign.csv) and a condition design table ( conditionDesign.csv) are generated.

Author(s)

Yang Li <yang.li@rug.nl>, Gonzalo Vera <gonzalo.vera.rodriguez@gmail.com>
Rainer Breitling <r.breitling@rug.nl>, Ritsert Jansen <r.c.jansen@rug.nl>

References

Y. Li, M. Swertz, G. Vera, J. Fu, R. Breitling, and R.C. Jansen. designGG: An R-package and Web tool for the optimal design of genetical genomics experiments. BMC Bioinformatics 10:188(2009)
http://gbic.biol.rug.nl/designGG
Y. Li, R. Breitling and R.C. Jansen. Generalizing genetical genomics: the added value from environmental perturbation, Trends Genet (2008) 24:518-524.
E. Wit and J. McClure. Statistics for Microarrays: Design, Analysis and Inference. (2004) Chichester: Wiley.

See Also

initialDesign, designScore, updateDesign, acceptanceProbability,
experimentDesignTable, plotAllScores,
exampleArrayDesignTable,exampleConditionDesignTable,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  library(designGG)
  #load genotype data
  data(genotype)
  #Example:  single-channel experiment with 2 environmental factors,
  #each with 2 levels, and there will be four samples per condition(nTuple=4).
  optimalDesign <- designGG ( genotype, nSlides=NULL, nTuple=4, nEnvFactors=2,
                        nLevels=c(2,2),Level=list(c(16,24),c(5,10)),  bTwoColorArray=FALSE,
                        initial=NULL, weight=1, region=seq(1,20), optimality="A", 
                        method="SA", nIterations=100, n.search=2, endTemp=1e-10,
                        startTemp=1, maxTempStep=0.9, plotScores=TRUE, 
                        directory=NULL, fileName=NULL, envFactorNames=NULL, 
                        writingProcess=FALSE )
  #Example 2:  dual-channel experiment with 2 environmental factors,
  #each with 2 levels. There are 50 slides available.
  optimalDesign <- designGG ( genotype, nSlides=50, nTuple=NULL, nEnvFactors=2,
                        nLevels=c(2,2),Level=list(c(16,24),c(5,10)),  bTwoColorArray=TRUE,
                        initial=NULL, weight=1, region=seq(1,20), optimality="A", 
                        method="SA", nIterations=100, n.search=2, endTemp=1e-10,
                        startTemp=1, maxTempStep=0.9, plotScores=TRUE, 
                        directory=NULL, fileName=NULL, envFactorNames=NULL, 
                        writingProcess=FALSE )
  #result
  optimalDesign$arrayDesign
  optimalDesign$conditionDesign
  plotAllScores(optimalDesign$plot.obj)
  
  #Use the following commands to see example output tables: 
  data(exampleArrayDesignTable)
  exampleArrayDesignTable
  data(exampleConditionDesignTable)
  exampleConditionDesignTable