# estim_mix: Estimation of a mixture model of MCAR and MNAR values in each... In imp4p: Imputation for Proteomics

## Description

This function allows estimating a mixture model of MCAR and MNAR values in each column of data sets similar to the ones which can be studied in MS-based quantitative proteomics. Such data matrices contain intensity values of identified peptides.

## Usage

 1 2 estim.mix(tab, tab.imp, conditions, x.step.mod=150, x.step.pi=150, nb.rei=200) 

## Arguments

 tab A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide. tab.imp A matrix where the missing values of tab have been imputed under the assumption that they are all MCAR. For instance, such a matrix can be obtained by using the function impute.slsa of this package. conditions A vector of factors indicating the biological condition to which each column (experimental sample) belongs. x.step.mod The number of points in the intervals used for estimating the cumulative distribution functions of the mixing model in each column. x.step.pi The number of points in the intervals used for estimating the proportion of MCAR values in each column. nb.rei The number of initializations of the minimization algorithm used to estimate the proportion of MCAR values (see Details).

## Details

This function aims to estimate the following mixture model in each column:

F_{tot}(x)=π_{na}\times F_{na}(x)+(1-π_{na})\times F_{obs}(x)

F_{na}(x)=π_{mcar}\times F_{tot}(x)+(1-π_{mcar})\times F_{mnar}(x)

where π_{na} is the proportion of missing values, π_{mcar} is the proportion of MCAR values, F_{tot} is the cumulative distribution function (cdf) of the complete values, F_{na} is the cdf of the missing values, F_{obs} is the cdf of the observed values, and F_{mnar} is the cdf of the MNAR values.

To estimate this model, a first step consists to compute a rough estimate of F_{na} by assuming that all missing values are MCAR (thanks to the argument tab.imp). This rough estimate is noted \hat{F}_{na}.

In a second step, the proportion of MCAR values is estimated. To do so, the ratio

\hat{π}(x)=(1-\hat{F}_{na}(x))/(1-\hat{F}_{tot}(x))

is computed for different x, where

\hat{F}_{tot}(x)=π_{na}\times \hat{F}_{na}(x)+(1-π_{na})\times \hat{F}_{obs}(x)

with \hat{F}_{obs} the empirical cdf of the observed values.

Next, the following minimization is performed:

\min_{1>k>0,a>0,d>0}f(k,a,d)

where

f(k,a,d)=∑_x \frac{1}{s(x)^2}\times [\hat{π}(x)-k-(1-k)\frac{\exp(-a\times [x-lower_x]^d)}{1-\hat{F}_{tot}(x)}]^2

where s(x)^2 is an estimate of the asymptotic variance of \hat{π}(x), lower_x is an estimate of the minimum of the complete values. To perform this minimization, the function optim with the method "L-BFGS-B" is used. Because it is function of its initialization, it is possible to reinitialize a number of times the minimisation algorithm with the argument nb.rei: the parameters leading to the lowest minimum are next kept.

Once k, a and d are estimated, one can use several methods to estimate π_{mcar}: it is estimated by k;

## Value

A list composed of:

 abs.pi A numeric matrix containing the intervals used for estimating the ratio (1-F_na(x))/(1-F_tot(x)) in each column. pi.init A numeric matrix containing the estimated ratios (1-F_na(x))/(1-F_tot(x)) where x belongs to abs.pi[,j] for each sample j. var.pi.init A numeric matrix containing the estimated asymptotic variances of pi.init. trend.pi.init A numeric matrix containing the estimated trend of the model used in the minimization algorithm. abs.mod A numeric vector containing the interval used for estimating the mixture models in each column. pi.na A numeric vector containing the proportions of missing values in each column. F.na A numeric matrix containing the estimated cumulative distribution functions of missing values in each column on the interval abs.mod. F.tot A numeric matrix containing the estimated cumulative distribution functions of complete values in each column on the interval abs.mod. F.obs A numeric matrix containing the estimated cumulative distribution functions of observed values in each column on the interval abs.mod. pi.mcar A numeric vector containing the estimations of the proportion of MCAR values in each column. MinRes A numeric matrix containing the three parameters of the model used in the minimization algorithm (three first rows), and the value of minimized function.

## Author(s)

Quentin Giai Gianetto <quentin2g@yahoo.fr>

impute.slsa
 1 2 3 4 5 6 7 8 #Simulating data res.sim=sim.data(nb.pept=2000,nb.miss=600); #Imputation of missing values with a MCAR-devoted algorithm: here the slsa algorithm dat.slsa=impute.slsa(tab=res.sim$dat.obs,conditions=res.sim$condition,repbio=res.sim$repbio); #Estimation of the mixture model res=estim.mix(tab=res.sim$dat.obs, tab.imp=dat.slsa, conditions=res.sim\$condition);