Imputing missing values using an adaptation of the LSimpute algorithm (Bo et al. (2004)) to experimental designs. This algorithm is named "Structured Least Squares Algorithm" (SLSA).

Description

This function is an adaptation of the LSimpute algorithm (Bo et al. (2004)) to experimental designs usually met in MS-based quantitative proteomics.

Usage

1
2
3
impute.slsa(tab, conditions, repbio=NULL, reptech=NULL, nknn=15, selec="all", weight=1, 
ind.comp=1, progress.bar=TRUE)
  

Arguments

tab

A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide.

conditions

A vector of factors indicating the biological condition to which each sample belongs.

repbio

A vector of factors indicating the biological replicate to which each sample belongs. Default is NULL (no experimental design is considered).

reptech

A vector of factors indicating the technical replicate to which each sample belongs. Default is NULL (no experimental design is considered).

nknn

The number of nearest neighbours used in the algorithm (see Details).

selec

A parameter to select a part of the dataset to find nearest neighbours between rows. This can be useful for big data sets (see Details).

weight

The way of weighting in the algorithm (see Details).

ind.comp

If ind.comp=1, only nearest neighbours without missing values are selected to fit linear models (see Details). Else, they can contain missing values.

progress.bar

If TRUE, a progress bar is displayed.

Details

This function imputes the missing values condition by condition. The rows of the input matrix are imputed when they have at least one observed value in the considered condition. For the rows having only missing values in a condition, you can use the impute.pa function.

For each row, a similarity measure between the observed values of this row and the ones of the other rows is computed. The similarity measure which is used is the absolute pairwise correlation coefficient if at least three side-by-side values are observed, and the inverse of the euclidean distance between side-by-side observed values in the other cases.

For big data sets, this step can be time consuming and that is why the input parameter selec allows to select random rows in the data set. If selec="all", then all the rows of the data set are considered; while if selec is a numeric value, for instance selec=100, then only 100 random rows are selected in the data set for computing similarity measures with each row containing missing values.

Once similarity measures are computed for a specific row, then the nknn rows with the highest similarity measures are considered to fit linear models and to predict several estimates for each missing value (see Bo et al. (2004)). If ind.comp=1, then only nearest neighbours without missing values in the condition are considered. However, unlike the original algorithm, our algorithm allows to consider the design of experiments that are specified in input through the vectors conditions, repbio and reptech. Note that conditions has to get a lower number of levels than repbio; and repbio has to get a lower number of levels than reptech.

In the original algorithm, several predictions of each missing value are done from the estimated linear models and, then, they are weighted in function of their similarity measure and summed (see Bo et al. (2004)). In our algorithm, one can use the original weighting function of Bo et al. (2004) if weight="o", i.e. (sim^2/(1-sim^2+1e-06))^2 where sim is the similarity measure; or the weighting function sim^weight if weight is a numeric value.

Value

The input matrix tab with imputed values instead of missing values.

Author(s)

Quentin Giai Gianetto <quentin2g@yahoo.fr>

References

Bo, T. H., Dysvik, B., & Jonassen, I. (2004). LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic acids research, 32(3), e34-e34.

Examples

1
2
3
4
5
6
#Simulating data
res.sim=sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.2,para=10,nb.cond=2,nb.repbio=3,
nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2);

#Imputation of missing values with the slsa algorithm
dat.slsa=impute.slsa(tab=res.sim$dat.obs,conditions=res.sim$condition,repbio=res.sim$repbio);

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.