Simulation of microarray data

Share:

Description

Simulation of 'n' samples. Each sample has 'sg' genes, only 'nsg' of them are called significant and have influence on class labels. All other '(ng - nsg)' genes are called ballanced. All gene ratios are drawn from a multivariate normal distribution. There is a posibility to create blocks of highly correlated genes.

Usage

1
2
3
4
5
6
7
8
sim.data(n = 256, ng = 1000, nsg = 100,
		 p.n.ratio = 0.5, 
		 sg.pos.factor= 1, sg.neg.factor= -1,
		 # correlation info:
		 corr = FALSE, corr.factor = 0.8,
		 # block info:
		 blocks = FALSE, n.blocks = 6, nsg.block = 1, ng.block = 5, 
		 seed = 123, ...)

Arguments

n

number of samples, logistic regression works well if n>200!

ng

number of genes

nsg

number of significant genes

p.n.ratio

ratio between positive and negative significant genes (default 0.5)

sg.pos.factor

impact factor of \_positive\_ significant genes on the classifaction, default: 1

sg.neg.factor

impact factor of \_negative\_ significant genes on the classifaction,default: -1

corr

are the genes correalted to each other? (default FALSE). see Details

corr.factor

correlation factorfor genes, between 0 and 1 (default 0.8)

blocks

are blocks of highly correlated genes are allowed? (default FALSE)

n.blocks

number of blocks

nsg.block

number of significant genes per block

ng.block

number of genes per block

seed

seed

...

additional argument(s)

Details

If no blockes (n.blocks=0 or blocks=FALSE) are defined and corr=TRUE create covarance matrix for all genes! with decrease of correlation : cov(i,j)=cov(j,i)= corr.factor^(i-j)

Value

x

matrix of simulated data. Genes in rows and samples in columns

y

named vector of class labels

seed

seed

Author(s)

Wiebke Werft, Natalia Becker

See Also

mvrnorm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
my.seed<-123

# 1. simulate 20 samples, with 100 genes in each. Only the first two genes
# have an impact on the class labels.
# All genes are assumed to be i.i.d. 
train<-sim.data(n = 20, ng = 100, nsg = 3, corr=FALSE, seed=my.seed )
print(str(train)) 

# 2. change the proportion between positive and negative significant genes 
#(from 0.5 to 0.8)
train<-sim.data(n = 20, ng = 100, nsg = 10, p.n.ratio = 0.8,  seed=my.seed )
rownames(train$x)[1:15]
# [1] "pos1" "pos2" "pos3" "pos4" "pos5" "pos6" "pos7" "pos8" 
# [2] "neg1" "neg2" "bal1" "bal2" "bal3" "bal4" "bal5"

# 3. assume to have correlation for positive significant genes, 
# negative significant genes and 'balanced' genes separatly. 
train<-sim.data(n = 20, ng = 100, nsg = 10, corr=TRUE, seed=my.seed )
#cor(t(train$x[1:15,]))

# 4. add 6 blocks of 5 genes each and only one significant gene per block.
# all genes in the block are correlated with constant correlation factor
#  corr.factor=0.8 		
train<-sim.data(n = 20, ng = 100, nsg = 6, corr=TRUE, corr.factor=0.8,
			 blocks=TRUE, n.blocks=6, nsg.block=1, ng.block=5, seed=my.seed )
print(str(train)) 
# first block
#cor(t(train$x[1:5,]))
# second block
#cor(t(train$x[6:10,]))