pca_rob: Robust PCA algorithms

View source: R/pca_rob.R

pca_robR Documentation

Robust PCA algorithms

Description

Algorithms implementing a robust PCA of a matrix X.

- pca_cr:

Croux & Ruiz-Gazen (C-R) PCA algorithm using projection pursuit (PP) method (Croux & Ruiz-Gazen 2005). The observations are robustly centered, and projected to specific "PP" directions (see below) of the space spaned by the variables (X columns). The PCA loading vectors are the directions (within the PP directions) that maximize a given "projection index", usually a robust spread measure such as MAD. The 1st loading vector is choosen within the n directions corresponding the observations (rows of X). The next loading vector is choosen in n directions corresponding the rows of the deflated matrix X). And so on.

A possible extension of the algorithm is to randomly simulate additionnal candidate PP directions to the n row observations. In function pca_cr, this is done when argument nsim > 0. In such a case, the function simulates nsim additional PP directions to the n initial ones, as proposed in Hubert et al. (2005): random couples of observations are sampled in X and, for each couple, the direction passes through the two observations of the couple (see functions .simpp.hub in file zfunctions.R).

- pca_sph:

Spherical PCA (see Locantore et al. 1990, Maronna 2005, Daszykowski et al. 2007). The spatial median used for centering matrix X is calculated by function .xmedspa (available in file zfunctions.R). This function uses the fast code of rrcov v.1.4-3 available on R CRAN (Thanks to V. Todorov, 2016).

- pca_rob:

Robust PCA combining outlyingness measures and weighted PCA (WPCA). Rows of matrix X receive weights depending on outlyingness. The algorithm is in the same spirit (but quite different) as the ROBPCA method (Hubert et al. 2005, 2009). It is built in the two following steps. The second step is facultative (argument step2).

Step 1 intends to detect and remove multivariate X-outliers that have potentially bad leverages. First, the Stahel-Donoho outlyingness (Maronna and Yohai, 1995) is computed for each of the rows of matrix X (see outstah). Noting 1 - p.rm a proportion (by default in the function, p.rm = .30, as in Hubert et al. 2005, 2009), the p.rm * n observations that have the highest outlyingness receive a weight w1 = 0 (the other receive a weight w1 = 1). Then, a robust Euclidean outlyingness (see outeucl) is calculated. The (p.rm * n observations that have the highest outlyingness receive a weight w2 = 0 (the other receive a weight w2 = 1). The final weight for Step 1 is w = w1 * w2), which is used for a first weighted PCA.

Step2 (facultative) intends to detect outliers in the robust score space computed in Step 1. The score and orthogonal distances, standardized by given cutoffs (see scordis and odis), of each of the n observations are calculated from this score space. An outlyingness is computed by (.5 * SD.stand^2 + .5 * OD.stand^2)^.5. This is implemented using outsdod. Outlyingness values higher than 1 receive a weight w = 0 (the other receive a weight w = 1), and a new weighted PCA is fitted.

In all the functions, matrix X is centered before the analyses, but X is not column-wise scaled (there is no argument scale available). If a scaling is needed, the user has to scale X before using the functions.

Usage


pca_cr(X, ncomp, obj = mad, nsim = 0)

pca_sph(X, ncomp, weights = NULL)

pca_rob(X, ncomp, 
    nsim = 1500, p.rm = .30, 
    step2 = TRUE, weights = NULL, 
    robust = TRUE, alpha = 0.01)

Arguments

X

A n x p matrix or data frame of variables.

ncomp

The maximal number of PCA scores (= components = latent variables) to be calculated.

obj

For pca_cr. The objective function (projection index) to maximize with the PP scores. Default to obj = mad.

nsim

For pca_cr and pca_rob. Used for the Stahel-Donoho outlyingness calculation (See outstah): a number of randomly simulated PP (projection pursuit) directions considered in addition to the n observations (rows of X).

weights

For pca_sph and pca_rob. A vector of length n defining a priori weights to apply to the observations. Internally, weights are "normalized" to sum to 1. Default to NULL (weights are set to 1 / n).

p.rm

For pca_rob. Proportion p.rm of the data used as hard rejection (outliers) in Step 1 (default to p.rm = .30, i.e. 30pct are rejected).

step2

For pca_rob. If TRUE (default), Step 2 is implemented.

robust

For pca_rob ans step2 (used in outsdod). Logical. If TRUE, the moment estimation of the cutoff is robustified (this is advised in particular after robust PCA or PLS on small data sets containing strong outliers). Default to FALSE.

alpha

For pca_rob and step2 (used in outsdod). Risk I level for defining the cutoff detecting extreme values (see the code).

Value

A list of outputs, such as:

T

The score matrix (n x ncomp).

P

The loadings matrix (p x ncomp).

R

The projection matrix (= P ; p x ncomp).

sv

The singular values (vector of length ncomp).

eigs

The eigenvalues (= sv^2; vector of length ncomp).

xmeans

The centering vector of X (length p).

References

Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B., 2007. Robust statistics in data analysis - A review. Chemometrics and Intelligent Laboratory Systems 85, 203-219. https://doi.org/10.1016/j.chemolab.2006.06.016

Hubert, M., Rousseeuw, P.J., Vanden Branden, K., 2005. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64-79. https://doi.org/10.1198/004017004000000563

Hubert, M., Rousseeuw, P., Verdonck, T., 2009. Robust PCA for skewed data and its outlier map. Computational Statistics & Data Analysis 53, 2264-2274. https://doi.org/10.1016/j.csda.2008.05.027

N. Locantore, J.S. Marron, D.G. Simpson, N. Tripoli, J.T. Zhang, K.L. Cohen, Robust principal component analysis for functional data, Test 8 (1999) 1–7

Maronna, R.A., Yohai, V.J., 1995. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330-341. https://doi.org/10.1080/01621459.1995.10476517

Maronna, R., 2005. Principal Components and Orthogonal Regression Based on Robust Scales, Technometrics, 47:3, 264-273, DOI: 10.1198/004017005000000166

Examples


data(datoctane)

## 6 samples (25, 26, and 36–39) are outliers
## (contain added alcohol)
## See Hubert et al. 2005

X <- datoctane$X
plotsp(X)

ncomp <- 2
fm <- pca_cr(X, ncomp = ncomp)
#fm <- pca_sph(X, ncomp = ncomp)
#fm <- pca_rob(X, ncomp = ncomp)
#fm <- pca_eigen(X, ncomp = ncomp)
z <- scordis(fm, robust = TRUE)
sd <- z$dr$d
cutsd <- z$cutoff
z <- odis(fm, X, robust = TRUE)
od <- z$dr$d
cutod <- z$cutoff
plot(sd, od, pch = 16,
  xlab = "SD", ylab = "OD")
abline(h = cutod, v = cutsd, lty = 2)
u <- c(25, 26, 36:39)
points(sd[u], od[u], col = "green", cex = 1.5)


mlesnoff/rnirs documentation built on April 24, 2023, 4:17 a.m.