estim_mix: Estimation of a mixture model of MCAR and MNAR values in each...
In imp4p: Imputation for Proteomics

Description Usage Arguments Details Value Author(s) See Also Examples

This function allows estimating a mixture model of MCAR and MNAR values in each column of data sets similar to the ones which can be studied in MS-based quantitative proteomics. Such data matrices contain intensity values of identified peptides.

1 2	estim.mix(tab, tab.imp, conditions, x.step.mod=150, x.step.pi=150, nb.rei=200)

`tab`	A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide.
`tab.imp`	A matrix where the missing values of `tab` have been imputed under the assumption that they are all MCAR. For instance, such a matrix can be obtained by using the function `impute.slsa` of this package.
`conditions`	A vector of factors indicating the biological condition to which each column (experimental sample) belongs.
`x.step.mod`	The number of points in the intervals used for estimating the cumulative distribution functions of the mixing model in each column.
`x.step.pi`	The number of points in the intervals used for estimating the proportion of MCAR values in each column.
`nb.rei`	The number of initializations of the minimization algorithm used to estimate the proportion of MCAR values (see Details).

This function aims to estimate the following mixture model in each column:

F_{tot}(x)=π_{na}\times F_{na}(x)+(1-π_{na})\times F_{obs}(x)

F_{na}(x)=π_{mcar}\times F_{tot}(x)+(1-π_{mcar})\times F_{mnar}(x)

where π_{na} is the proportion of missing values, π_{mcar} is the proportion of MCAR values, F_{tot} is the cumulative distribution function (cdf) of the complete values, F_{na} is the cdf of the missing values, F_{obs} is the cdf of the observed values, and F_{mnar} is the cdf of the MNAR values.

To estimate this model, a first step consists to compute a rough estimate of F_{na} by assuming that all missing values are MCAR (thanks to the argument tab.imp). This rough estimate is noted \hat{F}_{na}.

In a second step, the proportion of MCAR values is estimated. To do so, the ratio

\hat{π}(x)=(1-\hat{F}_{na}(x))/(1-\hat{F}_{tot}(x))

is computed for different x, where

\hat{F}_{tot}(x)=π_{na}\times \hat{F}_{na}(x)+(1-π_{na})\times \hat{F}_{obs}(x)

with \hat{F}_{obs} the empirical cdf of the observed values.

Next, the following minimization is performed:

\min_{1>k>0,a>0,d>0}f(k,a,d)

where

f(k,a,d)=∑_x \frac{1}{s(x)^2}\times [\hat{π}(x)-k-(1-k)\frac{\exp(-a\times [x-lower_x]^d)}{1-\hat{F}_{tot}(x)}]^2

where s(x)^2 is an estimate of the asymptotic variance of \hat{π}(x), lower_x is an estimate of the minimum of the complete values. To perform this minimization, the function optim with the method "L-BFGS-B" is used. Because it is function of its initialization, it is possible to reinitialize a number of times the minimisation algorithm with the argument nb.rei: the parameters leading to the lowest minimum are next kept.

Once k, a and d are estimated, one can use several methods to estimate π_{mcar}: it is estimated by k;

A list composed of:

`abs.pi`	A numeric matrix containing the intervals used for estimating the ratio `(1-F_na(x))/(1-F_tot(x))` in each column.
`pi.init`	A numeric matrix containing the estimated ratios `(1-F_na(x))/(1-F_tot(x))` where `x` belongs to `abs.pi[,j]` for each sample `j`.
`var.pi.init`	A numeric matrix containing the estimated asymptotic variances of `pi.init`.
`trend.pi.init`	A numeric matrix containing the estimated trend of the model used in the minimization algorithm.
`abs.mod`	A numeric vector containing the interval used for estimating the mixture models in each column.
`pi.na`	A numeric vector containing the proportions of missing values in each column.
`F.na`	A numeric matrix containing the estimated cumulative distribution functions of missing values in each column on the interval `abs.mod`.
`F.tot`	A numeric matrix containing the estimated cumulative distribution functions of complete values in each column on the interval `abs.mod`.
`F.obs`	A numeric matrix containing the estimated cumulative distribution functions of observed values in each column on the interval `abs.mod`.
`pi.mcar`	A numeric vector containing the estimations of the proportion of MCAR values in each column.
`MinRes`	A numeric matrix containing the three parameters of the model used in the minimization algorithm (three first rows), and the value of minimized function.

Quentin Giai Gianetto <quentin2g@yahoo.fr>

impute.slsa

#Simulating data
res.sim=sim.data(nb.pept=2000,nb.miss=600);

#Imputation of missing values with a MCAR-devoted algorithm: here the slsa algorithm
dat.slsa=impute.slsa(tab=res.sim$dat.obs,conditions=res.sim$condition,repbio=res.sim$repbio);

#Estimation of the mixture model
res=estim.mix(tab=res.sim$dat.obs, tab.imp=dat.slsa, conditions=res.sim$condition);