Simulation of microarray data

Share:

Description

The function simulates microarray data for two-group comparison with user supplied parameters such as number of biomarkers (genes or proteins), sample size, biological and experimental (technical) variation, replication, differential expression, and correlation between biomarkers.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
simData(nTrain=100,
        nGr1=floor(nTrain/2),
        nBiom=50,nRep=3,
        sdW=1.0,
        sdB=1.0,
        rhoMax=NULL, rhoMin=NULL, nBlock=NULL,bsMin=3, bSizes=NULL, gamma=NULL,
        sigma=0.1,diffExpr=TRUE,
        foldMin=2,
        orderBiom=TRUE,
        baseExpr=NULL)

Arguments

nTrain

Training set size,.i.e., the total number of biological samples in group 1 (nGr1) and group 2.

nGr1

Size of group 1. Defaults to floor(nTrain/2).

nBiom

Number of biomarkers (genes, probes or proteins).

nRep

Number of technical replications.

sdW

Experimental (technical) variation (σ_e) of data in log (base 2) scale.

sdB

Biological variation (σ_b) of data in log (base 2) scale.

rhoMax

Maximum Pearson's correlation coefficient between biomarkers. To ensure positive definiteness, allowed values are restricted between 0 and 0.95 inclusive. If NULL, set to runif(1,min=0.6,max=0.8).

rhoMin

Minimum Pearson's correlation coefficient between biomarkers. To ensure positive definiteness, allowed values are restricted between 0 and 0.95 inclusive. If NULL, set to runif(1,min=0.2,max=0.4).

nBlock

Number of blocks in the block diagonal (Hub-Toeplitz) correlation matrix. If NULL, set to 1 for nBiom<5 and randomly selected from c(1:floor(nBiom/bsMin)) for nBiom>=5.

bsMin

Minimum block size. bsMin=3 by default.

bSizes

A vector of length nBlock representing the block sizes (should sum to nBlock). If NULL, set to c(bs+mod,rep(bs,nBlock-1), where bs is the integer part of nBiom/nBlock and mod is the remainder after integer division.

gamma

Specifies a correlation structure. If NULL, assumes independence.gamma=0 indicates a single block exchangeable correlation marix with constant correlation rho=0.5*(rhoMin+rhoMax). A value greater than zero indicates block diagonal (Hub-Toeplitz) correlation matrix with decline rate determined by the value of gamma. Decline rate is linear for gamma=1.

sigma

Standard deviation of the normal distribution (before truncation) where fold changes are generated from. See details.

diffExpr

Logical. Should systematic difference be introduced between the data of the two groups?

foldMin

Minimum value of fold changes. See details.

orderBiom

Logical. Should columns (biomarkers) be arranged in order of differential expression?

baseExpr

A vector of length nBiom to be used as base expressions μ. See realBiomarker for details.

Details

Differential expressions are introduced by adding to the data of group 2 where δ values are generated from a truncated normal distribution and z is randomly selected from (-1,1) to characterise up- or down-regulation.

Assuming that Y ~is~ N(μ, σ^2), and A=[a_1,a_2], a subset of -Inf <y < Inf, the conditional distribution of Y given A is called truncated normal distribution:

f(y, μ, σ)= (1/σ) φ((y-μ)/σ) / (Φ((a2-μ)/σ) - Φ((a_1-μ)/σ))

for a_1 <= y <= a_2, and 0 otherwise,

where μ is the mean of the original Normal distribution before truncation, σ is the corresponding standard deviation,a_2 is the upper truncation point, a_1 is the lower truncation point, φ(x) is the density of the standard normal distribution, and Φ(x) is the distribution function of the standard normal distribution. For simData function, we consider a_1=log_2(\code{foldMin}) and a_2=Inf. This ensures that the biomarkers are differentially expressed by a fold change of foldMin or more.

Value

A dataframe of dimension nTrain by nBiom+1. The first column is a factor (class) representing the group memberships of the samples.

Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal
Maintainer: Mizanur Khondoker mizanur.khondoker@gmail.com.

References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.

See Also

classificationError

Examples

1
simData(nTrain=10,nBiom=3)