sim.mol.data: Simulate molecular data for pathview experiment
In pathview: a tool set for pathway based data integration and visualization

Description Usage Arguments Details Value Author(s) References See Also Examples

The molecular data simulator generates either gene.data or cpd.data of different ID types, molecule numbers, sample sizes, either continuous or discrete.

1 2	sim.mol.data(mol.type = c("gene", "gene.ko", "cpd")[1], id.type = NULL, species="hsa", discrete = FALSE, nmol = 1000, nexp = 1, rand.seed=100)

`mol.type`	character of length 1, specifing the molecular type, either "gene" (including transcripts, proteins), or "gene.ko" (KEGG ortholog genes, as defined in KEGG ortholog pathways), or "cpd" (including metabolites, glycans, drugs). Note that KEGG ortholog gene are considered "gene" in function `pathview`. Default mol.type="gene".
`id.type`	character of length 1, the molecular ID type. When mol.type="gene", proper ID types include "KEGG" and "ENTREZ" (Entrez Gene). Other ID types are also valid When species="hsa" only, check: `data(gene.idtype.list); gene.idtype.list` for other valid ID types. When mol.type="cpd", check `data(cpd.simtypes); cpd.simtypes` for valid ID types. Default id.type=NULL, then "Entrez" and "KEGG COMPOUND accession" will be assumed for mol.type = "gene" or "cpd".
`species`	character, either the kegg code, scientific name or the common name of the target species. This is only effective when mol.type = "gene". Setting species="ko" is equilvalent to mol.type="gene.ko". Default species="hsa", equivalent to either "Homo sapiens" (scientific name) or "human" (common name), gene data id.type has multiple other choices. When other species are specified, gene id.type is limited to "KEGG" and "ENTREZ".
`discrete`	logical, whether to generate discrete or continuous data. d discrete=FALSE, otherwise, mol.data will be a charactor vector of molecular IDs.
`nmol`	integer, the target number of different molecules. Note that the specified id.type may not have as many different IDs as nmol. In this case, all IDs of id.type are used.
`nexp`	integer, the sample size or the number of columns in the result simulated data.
`rand.seed`	numeric of length 1, the seed number to start the random sampling process. This argumemnt makes the simulation reproducible as long as its value keeps the same. Default rand.seed=100.

This function is written mainly for simulation or experiment with pathview package. With the simulated molecular data, you may check whether and how pathview works for molecular data of different types, IDs, format or sample sizes etc. You may also generate both gene.data and cpd.data and check data pathway based integration with pathview.

either vector (single sample) or a matrix-like data (multiple sample), depends on the value of nexp. Vector should be numeric with molecular IDs as names or it may also be character of molecular IDs depending on the value of discrete. Matrix-like data structure has molecules as rows and samples as columns. Row names should be molecular IDs.

This returned data can be used directly as gene.data or cpd.data input of pathview main function.

Weijun Luo <luo_weijun@yahoo.com>

Luo, W. and Brouwer, C., Pathview: an R/Bioconductor package for pathway based data integration and visualization. Bioinformatics, 2013, 29(14): 1830-1831, doi: 10.1093/bioinformatics/btt285

node.map the node data mapper function. mol.sum the auxillary molecular data mapper, id2eg, cpd2kegg etc the auxillary molecular ID mappers, pathview the main function,

#continuous compound data
cpd.data.c=sim.mol.data(mol.type="cpd", nmol=3000)
#discrete compound data
cpd.data.d=sim.mol.data(mol.type="cpd", nmol=3000, discrete=TRUE)
head(cpd.data.c)
head(cpd.data.d)
#continuous compound data named with "CAS Registry Number"
cpd.cas <- sim.mol.data(mol.type = "cpd", id.type = "CAS Registry Number", nmol = 10000)

#gene data with two samples
gene.data.2=sim.mol.data(mol.type="gene", nmol=1000, nexp=2)
head(gene.data.2)

#KEGG ortholog gene data
ko.data=sim.mol.data(mol.type="gene.ko", nmol=5000)