# Simulation of data sets by controlling the proportion of MCAR values and the distribution of MNAR values.

### Description

This function simulates data sets similar to MS-based bottom-up proteomic data sets.

### Usage

1 2 | ```
sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.2,para=10,nb.cond=2,nb.repbio=3,
nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2)
``` |

### Arguments

`nb.pept` |
The number of rows (identified peptides) of the generated data set. |

`nb.miss` |
The number of missing values to generate in each column. |

`pi.mcar` |
The proportion of MCAR values in each column. |

`para` |
Parameter of a Beta distribution used for simulating MNAR values in columns (see Details). |

`nb.cond` |
The number of studied biological conditions. |

`nb.repbio` |
The number of biological samples in each condition. |

`nb.sample` |
The number of samples coming from each biological sample. |

`m.c` |
The mean of the average values in each condition. |

`sd.c` |
The standard deviation of the average values in each condition. |

`sd.rb` |
The standard deviation of the average values in each biological sample. |

`sd.r` |
The standard deviation of values in each row among the samples coming from a same biological sample. |

### Details

First, the average of intensities of a peptide `i`

in a condition is generated by a Gaussian distribution *m_{cond}\sim N(m.c,sd.c)*. Second, the effect of a biological sample is generated by *m_{bio}\sim N(0,sd.rb)*. The value of a peptide `i`

in the sample `j`

belonging to a specific biological sample and a specific condition is finally generated by *x_{ij}\sim N(m_{cond}+m_{bio},sd.r)*.

Next, the MCAR values are generated in each column by random draws without replacement among the indexes of rows. The MNAR values are generated in the remaining indexes of rows by random draws without replacement and by respecting the following probabilities:

*P(x_{ij} is MNAR)=f_{B(1,para)}((x_{ij}-min_i(x_{ij}))/(max_i(x_{ij})-min_i(x_{ij})))/(para)*

where *f_{B(1,para)}* corresponds to the density of a Beta distribution with parameters *1* and *para*. If *para=1*, then the MNAR values are uniformly distributed among intensity level. More *para* is high and more the MNAR values arise for small intensity levels and not for high intensity levels.

### Value

`dat.obs` |
The simulated data set. |

`dat.comp` |
The simulated data set without missing values. |

`list.MCAR` |
The index of MCAR values among the rows in each column of the data set. |

`conditions` |
A vector of factors indicating the biological condition to which each sample belongs. |

`repbio` |
A vector of factors indicating the biological sample to which each sample belongs. |

### Author(s)

Quentin Giai Gianetto <quentin2g@yahoo.fr>

### Examples

1 2 3 4 5 6 7 8 9 | ```
## The function is currently defined as
res.sim=sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.2,para=10,nb.cond=2,nb.repbio=3,
nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2);
## Simulated data matrix
data=res.sim$dat.obs;
## Vector of conditions of membership for each sample
cond=res.sim$conditions;
## Vector of biological sample of membership for each sample
repbio=res.sim$repbio;
``` |