lrSVD: Log-ratio SVD algorithm

View source: R/lrSVD.R

lrSVDR Documentation

Log-ratio SVD algorithm

Description

This function implements an iterative algorithm to impute left-censored data (e.g. values below detection limit, rounded zeros) based on the singular value decomposition (SVD) of a compositional data set.

This function can be also used to impute missing data instead by setting imp.missing = TRUE (see lrSVDplus to treat censored and missing data simultaneously).

Usage

lrSVD(X, label = NULL, dl = NULL, frac = 0.65, ncp = 2, 
         imp.missing=FALSE, beta = 0.5, method = c("ridge", "EM"),
         row.w = NULL, coeff.ridge = 1, threshold = 1e-04, seed = NULL,
         nb.init = 1, max.iter = 1000, z.warning = 0.8, ...)

Arguments

X

Compositional data set (matrix or data.frame class).

label

Unique label (numeric or character) used to denote zeros/unobserved values in X.

dl

Numeric vector or matrix of detection limits/thresholds. These must be given on the same scale as X.

frac

Parameter for initial multiplicative simple replacement of left-censored data (see multRepl) (default = 0.65).

ncp

Number of components for low-rank matrix approximation (default = 2).

imp.missing

If TRUE then unobserved data identified by label are treated as missing data (default = FALSE).

beta

Weighting parameter, balance between the two conditions in objective function (default = 0.5).

method

Parameter estimation method for the iterative algorithm (method = "ridge", default).

row.w

row weights (default = NULL, a vector of 1 for uniform row weights).

coeff.ridge

Used when method = "ridge" (default = 1).

threshold

Threshold for assessing convergence (default = 1e-04).

seed

Seed for random initialisation of the algorithm (default seed = NULL, unobserved values initially imputed by the column mean).

nb.init

Number of random initialisations (default = 1).

max.iter

Maximum number of iterations for the algorithm (default = 1000).

z.warning

Threshold used to delete individual rows or columns including an excess of zeros/unobserved values (to be specify in proportions, default z.warning=0.8).

...

Further arguments.

Details

This function implements an efficient imputation algorithm particularly suitable for the case of continuous high-dimensional and wide compositional data sets (more columns than rows), although it is equally applicable to regular data sets. It is based on a low-rank representation of the data set by a principal components (PC) model as derived by singular value decomposition (SVD) of the data matrix, extending recent work on principal component imputation and matrix completion methods to the case of censored compositional data (the code builds on the function imputePCA; see missMDA package for more details). A preliminary imputation by multiplicative replacement (see multRepl) is conducted to initiate the iterative algorithm in log-ratio coordinates. Two steps, estimation of latent PC model loadings and imputation of empty data matrix cells using the model, are iteratively repeated until convergence. Parameter fitting in this context is performed by a regularisation method (ridge regression in this case) or by the expectation-maximisation (EM) algorithm. Regularization has been shown generally preferable and it is set as default method (note the regularisation parameter coeff.ridge set to 1 by default. If it is < 1 the result is closer to EM estimation, whereas for values > 1 it is closer to mean estimation).

An imputed data set is produced on the same scale as the input data set. If X is not closed to a constant sum, then the results are adjusted to provide a compositionally equivalent data set, expressed in the original scale, which leaves the absolute values of the observed components unaltered.

Missing data imputation

When imp.missing = TRUE, unobserved values are treated as general missing data. For this case, the argument label indicates the unique label for missing values and the argument dl is ignored.

Value

A data.frame object containing the imputed compositional data set expressed in the original scale.

References

Palarea-Albaladejo, J, Antoni Martín-Fernández, J, Ruiz-Gazen, A, Thomas-Agnan, C. lrSVD: An efficient imputation algorithm for incomplete high-throughput compositional data. Journal of Chemometrics 2022; 36: e3459.

See Also

zPatterns, lrSVD, lrDA, multRepl, multLN, multKM, cmultRepl, lrSVDplus

Examples

 # Data set closed to 100 (percentages, common dl = 1%)
 X <- matrix(c(26.91,8.08,12.59,31.58,6.45,14.39,
               39.73,26.20,0.00,15.22,6.80,12.05,
               10.76,31.36,7.10,12.74,31.34,6.70,
               10.85,46.40,31.89,10.86,0.00,0.00,
               7.57,11.35,30.24,6.39,13.65,30.80,
               38.09,7.62,23.68,9.70,20.91,0.00,
               27.67,7.15,13.05,32.04,6.54,13.55,
               44.41,15.04,7.95,0.00,10.82,21.78,
               11.50,30.33,6.85,13.92,30.82,6.58,
               19.04,42.59,0.00,38.37,0.00,0.00),byrow=TRUE,ncol=6)
 
 X_lrSVD<- lrSVD(X,label=0,dl=rep(1,6))
 
 # Multiple limits of detection by component
 mdl <- matrix(0,ncol=6,nrow=10)
 mdl[2,] <- rep(1,6)
 mdl[4,] <- rep(0.75,6)
 mdl[6,] <- rep(0.5,6)
 mdl[8,] <- rep(0.5,6)
 mdl[10,] <- c(0,0,1,0,0.8,0.7)
 
 X_lrSVD2 <- lrSVD(X,label=0,dl=mdl)
 
 # Non-closed compositional data set
 data(LPdata) # data (ppm/micrograms per gram)
 dl <- c(2,1,0,0,2,0,6,1,0.6,1,1,0,0,632,10) # limits of detection (0 for no limit)
 LPdata2 <- subset(LPdata,select=-c(Cu,Ni,La))  # select a subset for illustration purposes
 dl2 <- dl[-c(5,7,10)]
 
 LPdata2_lrSVD <- lrSVD(LPdata2,label=0,dl=dl2)
 
 # Treating zeros as general missing data for illustration purposes only
 LPdata2_miss <- lrSVD(LPdata2,label=0,imp.missing=TRUE)

zCompositions documentation built on Aug. 24, 2023, 1:08 a.m.