mi_mix: Multiple imputation from a matrix of probabilities of being...
In imp4p: Imputation for Proteomics

Description Usage Arguments Details Value Author(s) References Examples

This function allows imputing data sets with a multiple imputation strategy. For details, see Giai Gianetto Q. et al. (2020) (doi: doi: 10.1101/2020.05.29.122770).

mi.mix(tab, tab.imp, prob.MCAR, conditions, repbio=NULL, reptech=NULL, nb.iter=3, nknn=15,
weight=1, selec="all", siz=500, ind.comp=1, methodMCAR="mle", q=0.95,
progress.bar=TRUE, details=FALSE, ncp.max=5, maxiter = 10, ntree = 100,
variablewise = FALSE, decreasing = FALSE, verbose = FALSE, mtry = floor(sqrt(ncol(tab))),
replace = TRUE,classwt = NULL, cutoff = NULL, strata = NULL, sampsize = NULL,
nodesize = NULL, maxnodes = NULL,xtrue = NA, parallelize = c('no', 'variables',
'forests'), methodMNAR="igcda",q.min = 0.025, q.norm = 3, eps = 0, distribution = "unif",
param1 = 3, param2 = 1, R.q.min=1)

`tab`	A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide.
`tab.imp`	A matrix where the missing values of `tab` have been imputed under the assumption that they are all MCAR. For instance, such a matrix can be obtained from the function `impute.slsa` of this package.
`prob.MCAR`	A matrix of probabilities that each missing value is MCAR. For instance such a matrix can be obtained from the function `prob.mcar.tab` of this package.
`conditions`	A vector of factors indicating the biological condition to which each column (experimental sample) belongs.
`repbio`	A vector of factors indicating the biological replicate to which each column belongs. Default is NULL (no experimental design is considered).
`reptech`	A vector of factors indicating the technical replicate to which each column belongs. Default is NULL (no experimental design is considered).
`nb.iter`	The number of iterations used for the multiple imputation method.
`nknn`	The number of nearest neighbours used in the SLSA algorithm (see `impute.slsa`).
`selec`	A parameter to select a part of the dataset to find nearest neighbours between rows. This can be useful for big data sets (see `impute.slsa`).
`siz`	A parameter to select a part of the dataset to perform imputations with a MCAR-devoted algorithm. This can be useful for big data sets. Note that `siz` needs to be inferior to `selec`.
`weight`	The way of weighting in the algorithm (see `impute.slsa`).
`ind.comp`	If `ind.comp=1`, only nearest neighbours without missing values are selected to fit linear models (see `impute.slsa`). Else, they can contain missing values.
`methodMCAR`	The method used for imputing MCAR data. If `methodi="mle"` (default), then the `impute.mle` function is used (imputation using an EM algorithm). If `methodi="pca"`, then the `impute.PCA` function is used (imputation using Principal Component Analysis). If `methodi="rf"`, then the `impute.RF` function is used (imputation using Random Forest). Else, the `impute.slsa` function is used (imputation using Least Squares on nearest neighbours).
`methodMNAR`	The method used for imputing MNAR data. If `methodMNAR="igcda"` (default), then the `impute.igcda` function is used. Else, the `impute.pa` function is used.
`q`	A quantile value (see `impute.igcda`).
`progress.bar`	If `TRUE`, a progress bar is displayed.
`details`	If `TRUE`, the function gives a list of three values: `imputed.matrix` a matrix with the average of imputed values for each missing value, `sd.imputed.matrix` a matrix with the standard deviations of imputed values for each missing value, `all.imputed.matrices` an array with all the `nb.iter` matrices of imputed values that have been generated.
`ncp.max`	parameter of the `impute.PCA` function.
`maxiter`	parameter of the `impute.RF` function.
`ntree`	parameter of the `impute.RF` function.
`variablewise`	parameter of the `impute.RF` function.
`decreasing`	parameter of the `impute.RF` function.
`verbose`	parameter of the `impute.RF` function.
`mtry`	parameter of the `impute.RF` function.
`replace`	parameter of the `impute.RF` function.
`classwt`	parameter of the `impute.RF` function.
`cutoff`	parameter of the `impute.RF` function.
`strata`	parameter of the `impute.RF` function.
`sampsize`	parameter of the `impute.RF` function.
`nodesize`	parameter of the `impute.RF` function.
`maxnodes`	parameter of the `impute.RF` function.
`xtrue`	parameter of the `impute.RF` function.
`parallelize`	parameter of the `impute.RF` function.
`q.min`	parameter of the `impute.pa` function.
`q.norm`	parameter of the `impute.pa` function.
`eps`	parameter of the `impute.pa` function.
`distribution`	parameter of the `impute.pa` function.
`param1`	parameter of the `impute.pa` function.
`param2`	parameter of the `impute.pa` function.
`R.q.min`	parameter of the `impute.pa` function.

At each iteration, a matrix indicating the MCAR values is generated by Bernouilli distributions having parameters given by the matrix prob.MCAR. The generated MCAR values are next imputed thanks to the matrix tab.imp. For each row containing MNAR values, the other rows are imputed thanks to the function impute.igcda and, next, the considered row is imputed thanks to one of the MCAR-devoted imputation methods (impute.mle, impute.RF, impute.PCA or impute.slsa). So, the function impute.igcda allows to deform the correlation structure of the dataset in view to be closer to that of the true values, while the MCAR-devoted imputation method will impute by taking into account this modified correlation structure.

The input matrix tab with average imputed values instead of missing values if details=FALSE (default). If details=TRUE, a list of three values: imputed.matrix a matrix with the average of imputed values for each missing value, sd.imputed.matrix a matrix with the standard deviations of imputed values for each missing value, all.imputed.matrices an array with all the nb.iter matrices of imputed values that have been generated.

Quentin Giai Gianetto <quentin2g@yahoo.fr>

Giai Gianetto, Q., Wieczorek S., Couté Y., Burger, T. (2020). A peptide-level multiple imputation strategy accounting for the different natures of missing values in proteomics data. bioRxiv 2020.05.29.122770; doi: doi: 10.1101/2020.05.29.122770

#Simulating data
res.sim=sim.data(nb.pept=5000,nb.miss=1000);

#Fast imputation of missing values with the impute.rand algorithm
dat.rand=impute.rand(tab=res.sim$dat.obs,conditions=res.sim$condition);

#Estimation of the mixture model
res=estim.mix(tab=res.sim$dat.obs, tab.imp=dat.rand, conditions=res.sim$condition);

#Computing probabilities to be MCAR
born=estim.bound(tab=res.sim$dat.obs,conditions=res.sim$condition);
proba=prob.mcar.tab(tab.u=born$tab.upper,res=res);

#Multiple imputation strategy with 3 iterations (can be time consuming in function of the data set!)
data.mi=mi.mix(tab=res.sim$dat.obs, tab.imp=dat.rand, prob.MCAR=proba, conditions=
res.sim$conditions, repbio=res.sim$repbio, nb.iter=3);