Simulation of data sets by controlling the proportion of MCAR values and the distribution of MNAR values.

Share:

Description

This function simulates data sets similar to MS-based bottom-up proteomic data sets.

Usage

1
2
sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.2,para=10,nb.cond=2,nb.repbio=3,
nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2)

Arguments

nb.pept

The number of rows (identified peptides) of the generated data set.

nb.miss

The number of missing values to generate in each column.

pi.mcar

The proportion of MCAR values in each column.

para

Parameter of a Beta distribution used for simulating MNAR values in columns (see Details).

nb.cond

The number of studied biological conditions.

nb.repbio

The number of biological samples in each condition.

nb.sample

The number of samples coming from each biological sample.

m.c

The mean of the average values in each condition.

sd.c

The standard deviation of the average values in each condition.

sd.rb

The standard deviation of the average values in each biological sample.

sd.r

The standard deviation of values in each row among the samples coming from a same biological sample.

Details

First, the average of intensities of a peptide i in a condition is generated by a Gaussian distribution m_{cond}\sim N(m.c,sd.c). Second, the effect of a biological sample is generated by m_{bio}\sim N(0,sd.rb). The value of a peptide i in the sample j belonging to a specific biological sample and a specific condition is finally generated by x_{ij}\sim N(m_{cond}+m_{bio},sd.r).

Next, the MCAR values are generated in each column by random draws without replacement among the indexes of rows. The MNAR values are generated in the remaining indexes of rows by random draws without replacement and by respecting the following probabilities:

P(x_{ij} is MNAR)=f_{B(1,para)}((x_{ij}-min_i(x_{ij}))/(max_i(x_{ij})-min_i(x_{ij})))/(para)

where f_{B(1,para)} corresponds to the density of a Beta distribution with parameters 1 and para. If para=1, then the MNAR values are uniformly distributed among intensity level. More para is high and more the MNAR values arise for small intensity levels and not for high intensity levels.

Value

dat.obs

The simulated data set.

dat.comp

The simulated data set without missing values.

list.MCAR

The index of MCAR values among the rows in each column of the data set.

conditions

A vector of factors indicating the biological condition to which each sample belongs.

repbio

A vector of factors indicating the biological sample to which each sample belongs.

Author(s)

Quentin Giai Gianetto <quentin2g@yahoo.fr>

Examples

1
2
3
4
5
6
7
8
9
## The function is currently defined as
res.sim=sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.2,para=10,nb.cond=2,nb.repbio=3,
nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2);
## Simulated data matrix
data=res.sim$dat.obs;
## Vector of conditions of membership for each sample
cond=res.sim$conditions;
## Vector of biological sample of membership for each sample
repbio=res.sim$repbio;

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.