sim_data: Simulation of data sets by controlling the proportion of MCAR...
In imp4p: Imputation for Proteomics

Description Usage Arguments Details Value Author(s) Examples

This function simulates data sets similar to MS-based bottom-up proteomic data sets.

1 2	sim.data(nb.pept=15000,nb.miss=5000,pi.mcar=0.2,para=3,nb.cond=1,nb.repbio=3, nb.sample=3,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2)

`nb.pept`	The number of rows (identified peptides) of the generated data set.
`nb.miss`	The number of missing values to generate in each column.
`pi.mcar`	The proportion of MCAR values in each column.
`para`	Parameter used for simulating MNAR values in columns (see Details).
`nb.cond`	The number of studied biological conditions.
`nb.repbio`	The number of biological samples in each condition.
`nb.sample`	The number of samples coming from each biological sample.
`m.c`	The mean of the average values in each condition.
`sd.c`	The standard deviation of the average values in each condition.
`sd.rb`	The standard deviation of the average values in each biological sample.
`sd.r`	The standard deviation of values in each row among the samples coming from a same biological sample.

First, the average of intensities of a peptide i in a condition is generated by a Gaussian distribution m_{cond}\sim N(m.c,sd.c). Second, the effect of a biological sample is generated by m_{bio}\sim N(0,sd.rb). The value of a peptide i in the sample j belonging to a specific biological sample and a specific condition is finally generated by x_{ij}\sim N(m_{cond}+m_{bio},sd.r).

Next, the MCAR values are generated in each column by random draws without replacement among the indexes of rows. The MNAR values are generated in the remaining indexes of rows by random draws without replacement and by respecting the following probabilities:

P(x_{ij} is MNAR)=1-(x_{ij}-min_i(x_{ij}))/((max_i(x_{ij})-min_i(x_{ij}))*(para))

where para allows adjusting the distribution of MNAR values. If para=0, then the MNAR values are uniformly distributed among intensity level. More para is high and more the MNAR values arise for small intensity levels and not for high intensity levels.

`dat.obs`	The simulated data set.
`dat.comp`	The simulated data set without missing values.
`list.MCAR`	The index of MCAR values among the rows in each column of the data set.
`nMCAR`	The number of MCAR values in each sample (after deleting rows with only generated missing values).
`nNA`	The number of missing values in each sample (after deleting rows with only generated missing values).
`conditions`	A vector of factors indicating the biological condition to which each sample belongs.
`repbio`	A vector of factors indicating the biological sample to which each sample belongs.

Quentin Giai Gianetto <quentin2g@yahoo.fr>

## The function can be used as
res.sim=sim.data(nb.pept=2000,nb.miss=600);
## Simulated data matrix
data=res.sim$dat.obs;
## Vector of conditions of membership for each sample
cond=res.sim$conditions;
## Vector of biological sample of membership for each sample
repbio=res.sim$repbio;
## Percentage of generated MCAR values for each sample
pi_mcar=res.sim$nMCAR/res.sim$nNA