SimulatedDataGenerator: Simulation dataset generator

Description Usage Arguments Details Value Examples

View source: R/simDataGenerator.R

Description

Function used to generating simulated dataset. See details in simulation studies

Usage

1
2
3
4
SimulatedDataGenerator(net=NULL,nnode=NULL,maxpernull=0.7,class.label=NULL,
missloc=NULL,missing=c(FALSE,TRUE),missrate=0.1,nonmiss.hub.maxedges=7,
maxnsteps.merge.communties=1000,dist=c("norm","gamma","lognorm"),
plot=c(TRUE,FALSE),nbin=c(20,20,20),rng=1024)

Arguments

net

The adjacent matrix with 0/1 indicating "connected" or "not directly connected. If not given, generate a scale-free graphs according to the Barabasi-Albert model by applying the BA algorithm in igraph package.

nnode

Integer. Total number of gene nodes in network.

maxpernull

float. Max percent of null genes in the network. Used when class.label is not given and needs to be generated during merging process. Default=0.7

class.label

Vector of length(total number of nodes), giving the class indicators: -1, 0, 1 to each of the gene node. If not given, class labels are defined based on fast.greedy community detection algorithm, then merged to three sequentially based on the number of between-community-edges. Highly connected communited are merged to one first. Then the largest communities are assigned class indicator 0 as null genes. The up/down regulated class are assigned randomly.

missloc

Vector. Default NULL. If given, it is the location of the test statistics that is not been observed.

missing

Logical. Default FALSE. If TRUE, the missing location are generated based on missing rate.

missrate

A number between (0,1). The missing rate defined as the proportion of gene nodes without observed test statistics. Not recommend over 20% based on biological knowledge.

nonmiss.hub.maxedges

Integer. Based on biological knowledge, hub genes (with higher number of neighboring edges) are less likely to be missing gene nodes. Thus it is the cutoff value where only genes with less than the nonmiss.hub.maxedges neighbors can be assigned as missing genes. Default=7

maxnsteps.merge.communties

Integer. The maximum number of steps used for merging the small communities. In order to be 3, defaul=1000.

dist

Char. The distribution of DE genes, can be one of the following: c("norm","gamma","lognorm"). See details in simulation design table.

plot

Logical. Defaul=TRUE: whether to plot the histogram of test statistics being generated or not.

nbin

Vector of length 3. Default=c(5,20,5). The number of bins used for ploting the histogram for each of the class.

rng

Random seed Defaul=1024

Details

The function used for simulating test statistics:

Value

A list:

testcov

test statistics, missing observations are coded as NA if any

testcov.fullobs

test statistics when all the observations are fully observed

class.label

z values for each gene, class indicators

net

simulated network, binary adjacency matrix 1/0 connected or not

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
## The simulation settings based on real gene network. (takes time)
data(net)
data(class.label)
data(missloc)
simdata=SimulatedDataGenerator(net=net,class.label=class.label,missloc=missloc,
dist="norm",plot=TRUE,nbin=c(20,20,20),rng=1024)
str(simdata)
## A toy example
simdata=SimulatedDataGenerator(nnode=100,missing=TRUE,missrate=0.1,dist="norm",
plot=TRUE,nbin=c(20,20,10),rng=1024)
str(simdata)

## End(Not run)

BANFF documentation built on May 29, 2017, 11:59 a.m.