PcaHubert: ROBPCA - ROBust method for Principal Components Analysis

Description Usage Arguments Details Value Note Author(s) References Examples

Description

The ROBPCA algorithm was proposed by Hubert et al (2005) and stays for 'ROBust method for Principal Components Analysis'. It is resistant to outliers in the data. The robust loadings are computed using projection-pursuit techniques and the MCD method. Therefore ROBPCA can be applied to both low and high-dimensional data sets. In low dimensions, the MCD method is applied.

Usage

1
2
3
4
5
PcaHubert(x, ...)
## Default S3 method:
PcaHubert(x, k = 0, kmax = 10, alpha = 0.75, mcd = TRUE, maxdir=250, scale = FALSE, signflip = TRUE, trace=FALSE, ...)
## S3 method for class 'formula'
PcaHubert(formula, data = NULL, subset, na.action, ...)

Arguments

formula

a formula with no response variable, referring only to numeric variables.

data

an optional data frame (or similar: see model.frame) containing the variables in the formula formula.

subset

an optional vector used to select rows (observations) of the data matrix x.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The default is na.omit.

...

arguments passed to or from other methods.

x

a numeric matrix (or data frame) which provides the data for the principal components analysis.

k

number of principal components to compute. If k is missing, or k = 0, the algorithm itself will determine the number of components by finding such k that l_k/l_1 >= 10.E-3 and Σ_{j=1}^k l_j/Σ_{j=1}^r l_j >= 0.8. It is preferable to investigate the scree plot in order to choose the number of components and then run again. Default is k=0.

kmax

maximal number of principal components to compute. Default is kmax=10. If k is provided, kmax does not need to be specified, unless k is larger than 10.

alpha

this parameter measures the fraction of outliers the algorithm should resist. In MCD alpha controls the size of the subsets over which the determinant is minimized, i.e. alpha*n observations are used for computing the determinant. Allowed values are between 0.5 and 1 and the default is 0.75.

mcd

Logical - when the number of variables is sufficiently small, the loadings are computed as the eigenvectors of the MCD covariance matrix, hence the function CovMcd() is automatically called. The number of principal components is then taken as k = rank(x). Default is mcd=TRUE. If mcd=FALSE, the ROBPCA algorithm is always applied.

maxdir

maximal number of random directions to use for computing the outlyingness of the data points. Default is maxdir=250. If the number n of observations is small all possible n*(n-1)/2 pairs of observations are taken to generate the directions.

scale

a logical value indicating whether the variables should be scaled to have unit variance (only possible if there are no constant variables). As a scale function mad is used but alternatively, a vector of length equal the number of columns of x can be supplied. The value is passed to scale and the result of the scaling is stored in the scale slot. Default is scale = FALSE

signflip

a logical value indicating wheather to try to solve the sign indeterminancy of the loadings - ad hoc approach setting the maximum element in a singular vector to be positive. Default is signflip = FALSE

trace

whether to print intermediate results. Default is trace = FALSE

Details

PcaHubert, serving as a constructor for objects of class PcaHubert-class is a generic function with "formula" and "default" methods. The calculation is done using the ROBPCA method of Hubert et al (2005) which can be described briefly as follows. For details see the relevant references.

Let n denote the number of observations, and p the number of original variables in the input data matrix X. The ROBPCA algorithm finds a robust center M (p x 1) of the data and a loading matrix P which is (p x k) dimensional. Its columns are orthogonal and define a new coordinate system. The scores T, an (n x k) matrix, are the coordinates of the centered observations with respect to the loadings:

T=(X-M)P

The ROBPCA algorithm also yields a robust covariance matrix (often singular) which can be computed as

S=PLP^t

where L is the diagonal matrix with the eigenvalues l_1, …, \l_k.

This is done in the following three main steps:

Step 1: The data are preprocessed by reducing their data space to the subspace spanned by the n observations. This is done by singular value decomposition of the input data matrix. As a result the data are represented using at most n-1=rank(X) without loss of information.

Step 2: In this step for each data point a measure of outlyingness is computed. For this purpose the high-dimensional data points are projected on many univariate directions, each time the univariate MCD estimator of location and scale is computed and the standardized distance to the center is measured. The largest of these distances (over all considered directions) is the outlyingness measure of the data point. The h data points with smallest outlyingness measure are used to compute the covariance matrix Σ_h and to select the number k of principal components to retain. This is done by finding such k that l_k/l_1 >= 10.E-3 and Σ_{j=1}^k l_j/Σ_{j=1}^r l_j >= 0.8 Alternatively the number of principal components k can be specified by the user after inspecting the scree plot.

Step 3: The data points are projected on the k-dimensional subspace spanned by the k eigenvectors corresponding to the largest k eigenvalues of the matrix Σ_h. The location and scatter of the projected data are computed using the reweighted MCD estimator and the eigenvectors of this scatter matrix yield the robust principal components.

Value

An S4 object of class PcaHubert-class which is a subclass of the virtual class PcaRobust-class.

Note

The ROBPCA algorithm is implemented on the bases of the Matlab implementation, available as part of LIBRA, a Matlab Library for Robust Analysis to be found at www.wis.kuleuven.ac.be/stat/robust.html

Author(s)

Valentin Todorov valentin.todorov@chello.at

References

M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005), ROBPCA: a new approach to robust principal components analysis, Technometrics, 47, 64–79.

Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. URL http://www.jstatsoft.org/v32/i03/.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
## PCA of the Hawkins Bradu Kass's Artificial Data
##  using all 4 variables
    data(hbk)
    pca <- PcaHubert(hbk)
    pca

## Compare with the classical PCA
    prcomp(hbk)

## or  
    PcaClassic(hbk)
    
## If you want to print the scores too, use
    print(pca, print.x=TRUE)

## Using the formula interface
    PcaHubert(~., data=hbk)

## To plot the results:

    plot(pca)                    # distance plot
    pca2 <- PcaHubert(hbk, k=2)  
    plot(pca2)                   # PCA diagnostic plot (or outlier map)
    
## Use the standard plots available for for prcomp and princomp
    screeplot(pca)    
    biplot(pca)    
    
## Restore the covraiance matrix     
    py <- PcaHubert(hbk)
    cov.1 <- py@loadings 
    cov.1    

armstrtw/rrcov documentation built on May 10, 2019, 1:43 p.m.