lrSVD | R Documentation |
This function implements an iterative algorithm to impute left-censored data (e.g. values below detection limit, rounded zeros) based on the singular value decomposition (SVD) of a compositional data set. It is particularly indicated for the case in which the data contain more variables than observations.
This function can be also used to impute missing data instead by setting imp.missing = TRUE
(see lrSVDplus
to treat censored and missing data simultaneously).
lrSVD(X, label = NULL, dl = NULL, frac = 0.65, ncp = 2,
imp.missing=FALSE, beta = 0.5, method = c("ridge", "EM"),
row.w = NULL, coeff.ridge = 1, threshold = 1e-04, seed = NULL,
nb.init = 1, max.iter = 1000, z.warning = 0.8, z.delete = TRUE,
...)
X |
Compositional data set ( |
label |
Unique label ( |
dl |
Numeric vector or matrix of detection limits/thresholds. These must be given on the same scale as |
frac |
Parameter for initial multiplicative simple replacement of left-censored data (see |
ncp |
Number of components for low-rank matrix approximation (default = 2). |
imp.missing |
If |
beta |
Weighting parameter, balance between the two conditions in objective function (default = 0.5). |
method |
Parameter estimation method for the iterative algorithm ( |
row.w |
row weights (default = NULL, a vector of 1 for uniform row weights). |
coeff.ridge |
Used when |
threshold |
Threshold for assessing convergence (default = 1e-04). |
seed |
Seed for random initialisation of the algorithm (default |
nb.init |
Number of random initialisations (default = 1). |
max.iter |
Maximum number of iterations for the algorithm (default = 1000). |
z.warning |
Threshold used to identify individual rows or columns including an excess of zeros/unobserved values (to be specify in proportions, default |
z.delete |
Logical value. If set to |
... |
Further arguments. |
This function implements an efficient imputation algorithm particularly suitable for the case of continuous high-dimensional (wide) compositional data sets (more columns than rows), although it is equally applicable to regular data sets. It is based on a low-rank representation of the data set by a principal components (PC) model as derived by singular value decomposition (SVD) of the data matrix, extending recent work on principal component imputation and matrix completion methods to the case of censored compositional data (the code builds on the function imputePCA
; see missMDA
package for more details). A preliminary imputation by multiplicative replacement (see multRepl
) is conducted to initiate the iterative algorithm in log-ratio coordinates. Two steps, estimation of latent PC model loadings and imputation of empty data matrix cells using the model, are iteratively repeated until convergence. Parameter fitting in this context is performed by a regularisation method (ridge regression in this case) or by the expectation-maximisation (EM) algorithm. Regularization has been shown generally preferable and it is set as default method (note the regularisation parameter coeff.ridge
set to 1 by default. If it is < 1 the result is closer to EM estimation, whereas for values > 1 it is closer to mean estimation).
An imputed data set is produced on the same scale as the input data set. If X
is not closed to a constant sum, then the results are adjusted to provide a compositionally equivalent data set, expressed in the original scale, which leaves the absolute values of the observed components unaltered.
Missing data imputation
When imp.missing = TRUE
, unobserved values are treated as general missing data. For this case, the argument label
indicates the unique label for missing values and the argument dl
is ignored.
A data.frame
object containing the imputed compositional data set expressed in the original scale.
Palarea-Albaladejo, J, Antoni Martín-Fernández, J, Ruiz-Gazen, A, Thomas-Agnan, C. lrSVD: An efficient imputation algorithm for incomplete high-throughput compositional data. Journal of Chemometrics 2022; 36: e3459.
zPatterns
, lrSVD
, lrDA
, multRepl
, multLN
, multKM
, cmultRepl
, lrSVDplus
# Data set closed to 100 (percentages, common dl = 1%)
X <- matrix(c(26.91,8.08,12.59,31.58,6.45,14.39,
39.73,26.20,0.00,15.22,6.80,12.05,
10.76,31.36,7.10,12.74,31.34,6.70,
10.85,46.40,31.89,10.86,0.00,0.00,
7.57,11.35,30.24,6.39,13.65,30.80,
38.09,7.62,23.68,9.70,20.91,0.00,
27.67,7.15,13.05,32.04,6.54,13.55,
44.41,15.04,7.95,0.00,10.82,21.78,
11.50,30.33,6.85,13.92,30.82,6.58,
19.04,42.59,0.00,38.37,0.00,0.00),byrow=TRUE,ncol=6)
X_lrSVD<- lrSVD(X,label=0,dl=rep(1,6))
# Multiple limits of detection by component
mdl <- matrix(0,ncol=6,nrow=10)
mdl[2,] <- rep(1,6)
mdl[4,] <- rep(0.75,6)
mdl[6,] <- rep(0.5,6)
mdl[8,] <- rep(0.5,6)
mdl[10,] <- c(0,0,1,0,0.8,0.7)
X_lrSVD2 <- lrSVD(X,label=0,dl=mdl)
# Non-closed compositional data set
data(LPdata) # data (ppm/micrograms per gram)
dl <- c(2,1,0,0,2,0,6,1,0.6,1,1,0,0,632,10) # limits of detection (0 for no limit)
LPdata2 <- subset(LPdata,select=-c(Cu,Ni,La)) # select a subset for illustration purposes
dl2 <- dl[-c(5,7,10)]
LPdata2_lrSVD <- lrSVD(LPdata2,label=0,dl=dl2)
# Treating zeros as general missing data for illustration purposes only
LPdata2_miss <- lrSVD(LPdata2,label=0,imp.missing=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.