# Simulation of microarray data

### Description

The function simulates microarray data for two-group comparison with user supplied parameters such as number of biomarkers (genes or proteins), sample size, biological and experimental (technical) variation, replication, differential expression, and correlation between biomarkers.

### Usage

1 2 3 4 5 6 7 8 9 10 |

### Arguments

`nTrain` |
Training set size,.i.e., the total number of biological
samples in group 1 ( |

`nGr1` |
Size of group 1. Defaults to |

`nBiom` |
Number of biomarkers (genes, probes or proteins). |

`nRep` |
Number of technical replications. |

`sdW` |
Experimental (technical) variation ( |

`sdB` |
Biological variation ( |

`rhoMax` |
Maximum Pearson's correlation coefficient between
biomarkers. To ensure positive definiteness, allowed values are
restricted between 0 and 0.95 inclusive. If |

`rhoMin` |
Minimum Pearson's correlation coefficient between
biomarkers. To ensure positive definiteness, allowed values are
restricted between 0 and 0.95 inclusive. If |

`nBlock` |
Number of blocks in the block diagonal (Hub-Toeplitz)
correlation matrix. If |

`bsMin` |
Minimum block size. |

`bSizes` |
A vector of length |

`gamma` |
Specifies a correlation structure. If |

`sigma` |
Standard deviation of the normal distribution (before truncation) where fold changes are generated from. See details. |

`diffExpr` |
Logical. Should systematic difference be introduced between the data of the two groups? |

`foldMin` |
Minimum value of fold changes. See details. |

`orderBiom` |
Logical. Should columns (biomarkers) be arranged in order of differential expression? |

`baseExpr` |
A vector of length |

### Details

Differential expressions are introduced by adding *zδ* to the data
of group 2 where *δ* values are generated from a truncated normal
distribution and *z* is randomly selected from `(-1,1)`

to
characterise up- or down-regulation.

Assuming that *Y ~is~ N(μ, σ^2)*, and *A=[a_1,a_2]*, a subset of
*-Inf <y < Inf*, the conditional distribution of *Y* given *A*
is called truncated normal distribution:

*f(y, μ, σ)= (1/σ) φ((y-μ)/σ) / (Φ((a2-μ)/σ) -
Φ((a_1-μ)/σ))*

for *a_1 <= y <= a_2*, and 0 otherwise,

where *μ* is the mean of the original Normal distribution before truncation,
*σ* is the corresponding standard deviation,*a_2* is the upper truncation point,
*a_1* is the lower truncation point, *φ(x)* is the density of the
standard normal distribution, and *Φ(x)* is the distribution function
of the standard normal distribution. For `simData`

function, we
consider *a_1=log_2(\code{foldMin})* and *a_2=Inf*. This ensures that the
biomarkers are differentially expressed by a fold change of
`foldMin`

or more.

### Value

A dataframe of dimension `nTrain`

by `nBiom+1`

. The first
column is a factor (`class`

) representing the group memberships of
the samples.

### Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal

Maintainer: Mizanur Khondoker mizanur.khondoker@gmail.com.

### References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. *et al.*(2010).
Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules.
*Journal of Bioinformatics and Computational Biology*, **8**, 945-965.

### See Also

`classificationError`

### Examples

1 | ```
simData(nTrain=10,nBiom=3)
``` |