# Imputing missing values using an adaptation of the LSimpute algorithm (Bo et al. (2004)) to experimental designs. This algorithm is named "Structured Least Squares Algorithm" (SLSA).

### Description

This function is an adaptation of the LSimpute algorithm (Bo et al. (2004)) to experimental designs usually met in MS-based quantitative proteomics.

### Usage

1 2 3 | ```
impute.slsa(tab, conditions, repbio=NULL, reptech=NULL, nknn=15, selec="all", weight=1,
ind.comp=1, progress.bar=TRUE)
``` |

### Arguments

`tab` |
A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide. |

`conditions` |
A vector of factors indicating the biological condition to which each sample belongs. |

`repbio` |
A vector of factors indicating the biological replicate to which each sample belongs. Default is NULL (no experimental design is considered). |

`reptech` |
A vector of factors indicating the technical replicate to which each sample belongs. Default is NULL (no experimental design is considered). |

`nknn` |
The number of nearest neighbours used in the algorithm (see Details). |

`selec` |
A parameter to select a part of the dataset to find nearest neighbours between rows. This can be useful for big data sets (see Details). |

`weight` |
The way of weighting in the algorithm (see Details). |

`ind.comp` |
If |

`progress.bar` |
If |

### Details

This function imputes the missing values condition by condition. The rows of the input matrix are imputed when they have at least one observed value in the considered condition. For the rows having only missing values in a condition, you can use the `impute.pa`

function.

For each row, a similarity measure between the observed values of this row and the ones of the other rows is computed. The similarity measure which is used is the absolute pairwise correlation coefficient if at least three side-by-side values are observed, and the inverse of the euclidean distance between side-by-side observed values in the other cases.

For big data sets, this step can be time consuming and that is why the input parameter `selec`

allows to select random rows in the data set. If `selec="all"`

, then all the rows of the data set are considered; while if `selec`

is a numeric value, for instance `selec=100`

, then only 100 random rows are selected in the data set for computing similarity measures with each row containing missing values.

Once similarity measures are computed for a specific row, then the `nknn`

rows with the highest similarity measures are considered to fit linear models and to predict several estimates for each missing value (see Bo et al. (2004)). If `ind.comp=1`

, then only nearest neighbours without missing values in the condition are considered. However, unlike the original algorithm, our algorithm allows to consider the design of experiments that are specified in input through the vectors `conditions`

, `repbio`

and `reptech`

. Note that `conditions`

has to get a lower number of levels than `repbio`

; and `repbio`

has to get a lower number of levels than `reptech`

.

In the original algorithm, several predictions of each missing value are done from the estimated linear models and, then, they are weighted in function of their similarity measure and summed (see Bo et al. (2004)). In our algorithm, one can use the original weighting function of Bo et al. (2004) if `weight="o"`

, i.e. `(sim^2/(1-sim^2+1e-06))^2`

where `sim`

is the similarity measure; or the weighting function `sim^weight`

if `weight`

is a numeric value.

### Value

The input matrix `tab`

with imputed values instead of missing values.

### Author(s)

Quentin Giai Gianetto <quentin2g@yahoo.fr>

### References

Bo, T. H., Dysvik, B., & Jonassen, I. (2004). LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic acids research, 32(3), e34-e34.

### Examples

1 2 3 4 5 6 | ```
#Simulating data
res.sim=sim.data(nb.pept=2000,nb.miss=600,pi.mcar=0.2,para=10,nb.cond=2,nb.repbio=3,
nb.sample=5,m.c=25,sd.c=2,sd.rb=0.5,sd.r=0.2);
#Imputation of missing values with the slsa algorithm
dat.slsa=impute.slsa(tab=res.sim$dat.obs,conditions=res.sim$condition,repbio=res.sim$repbio);
``` |