# estim_mix: Estimation of a mixture model of MCAR and MNAR values in each... In imp4p: Imputation for Proteomics

## Description

This function allows estimating a mixture model of MCAR and MNAR values in each column of data sets similar to the ones which can be studied in MS-based quantitative proteomics. Such data matrices contain intensity values of identified peptides.

## Usage

 1 2 estim.mix(tab, tab.imp, conditions, x.step.mod=100, x.step.pi=100, nb.rei=50, method=1, gridsize=100) 

## Arguments

 tab A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide. tab.imp A matrix where the missing values of tab have been imputed under the assumption that they are all MCAR. For instance, such a matrix can be obtained by using the function impute.slsa of this package. conditions A vector of factors indicating the biological condition to which each column (experimental sample) belongs. x.step.mod The number of points in the intervals used for estimating the cumulative distribution functions of the mixing model in each column. x.step.pi The number of points in the intervals used for estimating the proportion of MCAR values in each column. nb.rei The number of initializations of the minimization algorithm used to estimate the proportion of MCAR values (see Details). method A numeric value indicating the method to use for estimating the proportion of MCAR values (see Details). gridsize A numeric value indicating the number of possible choices used for estimating the proportion of MCAR values with the method of Patra and Sen (2015) (see Details).

## Details

This function aims to estimate the following mixture model in each column:

F_{tot}(x)=π_{na}\times F_{na}(x)+(1-π_{na})\times F_{obs}(x)

F_{na}(x)=π_{mcar}\times F_{tot}(x)+(1-π_{mcar})\times F_{mnar}(x)

where π_{na} is the proportion of missing values, π_{mcar} is the proportion of MCAR values, F_{tot} is the cumulative distribution function (cdf) of the complete values, F_{na} is the cdf of the missing values, F_{obs} is the cdf of the observed values, and F_{mnar} is the cdf of the MNAR values.

To estimate this model, a first step consists to compute a rough estimate of F_{na} by assuming that all missing values are MCAR (thanks to the argument tab.imp). This rough estimate is noted \hat{F}_{na}.

In a second step, the proportion of MCAR values is estimated. To do so, the ratio

\hat{π}(x)=(1-\hat{F}_{na}(x))/(1-\hat{F}_{tot}(x))

is computed for different x, where

\hat{F}_{tot}(x)=π_{na}\times \hat{F}_{na}(x)+(1-π_{na})\times \hat{F}_{obs}(x)

with \hat{F}_{obs} the empirical cdf of the observed values.

Next, the following minimization is performed:

\min_{1>k>0,a>0,d>0}f(k,a,d)

where

f(k,a,d)=∑_x \frac{1}{s(x)^2}\times [\hat{π}(x)-k-(1-k)\frac{\exp(-a\times [x-lower_x]^d)}{1-\hat{F}_{tot}(x)}]^2

where s(x)^2 is an estimate of the asymptotic variance of \hat{π}(x), lower_x is an estimate of the minimum of the complete values. To perform this minimization, the function optim with the method "L-BFGS-B" is used. Because it is function of its initialization, it is possible to reinitialize a number of times the minimisation algorithm with the argument nb.rei: the parameters leading to the lowest minimum are next kept.

Once k, a and d are estimated, one can use several methods to estimate π_{mcar}: it is estimated

by k if method=1 (default);

by k+(1-k)\times\exp(-a\times [\max(x)-lower_x]^d)/(1-\hat{F}_{tot}(\max(x)))

if method=2;

by estimating a decreasing trend with the Pool-Adjacent-Violators (PAV) algorithm on

k+(1-k)\times\exp(-a\times [x-lower_x]^d)/(1-\hat{F}_{tot}(x))

and keeping the righmost value of this trend if method=3;

by estimating a decreasing trend with the PAV algorithm on \hat{π}(x) and keeping the righmost value of this trend if method=4;

by using the histogram of \hat{π}(x) if method=5;

by using the method of Patra and Sen (2015) adapted to our problematic if method=6.

## Value

A list composed of:

 abs.pi A numeric matrix containing the intervals used for estimating the ratio (1-F_na(x))/(1-F_tot(x)) in each column. pi.init A numeric matrix containing the estimated ratios (1-F_na(x))/(1-F_tot(x)) where x belongs to abs.pi[,j] for each sample j. var.pi.init A numeric matrix containing the estimated asymptotic variances of pi.init. abs.mod A numeric vector containing the interval used for estimating the mixture models in each column. pi.na A numeric vector containing the proportions of missing values in each column. F.na A numeric matrix containing the estimated cumulative distribution functions of missing values in each column on the interval abs.mod. F.tot A numeric matrix containing the estimated cumulative distribution functions of complete values in each column on the interval abs.mod. F.obs A numeric matrix containing the estimated cumulative distribution functions of observed values in each column on the interval abs.mod. F.mnar A numeric matrix containing the estimated cumulative distribution functions of MNAR values in each column on the interval abs.mod. pi.mcar A numeric vector containing the estimations of the proportion of MCAR values in each column.

## Author(s)

Quentin Giai Gianetto <[email protected]>

## References

Patra, R. K., & Sen, B. (2015). Estimation of a two component mixture model with applications to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

impute.slsa

## Examples

 1 2 3 4 5 6 7 8 9 #Simulating data res.sim=sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.1,para=0.5,nb.cond=2,nb.repbio=3, nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2); #Imputation of missing values with the slsa algorithm dat.slsa=impute.slsa(tab=res.sim$dat.obs,conditions=res.sim$condition,repbio=res.sim$repbio); #Estimation of the mixture model res=estim.mix(tab=res.sim$dat.obs, tab.imp=dat.slsa, conditions=res.sim\$condition); 

imp4p documentation built on Dec. 12, 2017, 1:04 a.m.