pca: Principal Component Analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/pca.R

Description

pca is used to build and explore a principal component analysis (PCA) model.

Usage

1
2
3
pca(x, ncomp = 15, center = T, scale = F, cv = NULL, exclrows = NULL,
  exclcols = NULL, x.test = NULL, method = "svd", rand = NULL,
  lim.type = "jm", alpha = 0.05, gamma = 0.01, info = "")

Arguments

x

a numerical matrix with calibration data.

ncomp

maximum number of components to calculate.

center

logical, do mean centering of data or not.

scale

logical, do sdandardization of data or not.

cv

number of segments for random cross-validation (1 for full cross-validation).

exclrows

rows to be excluded from calculations (numbers, names or vector with logical values)

exclcols

columns to be excluded from calculations (numbers, names or vector with logical values)

x.test

a numerical matrix with test data.

method

method to compute principal components ('svd', 'nipals').

rand

vector with parameters for randomized PCA methods (if NULL, conventional PCA is used instead)

lim.type

which method to use for calculation of critical limits for residuals (see details)

alpha

significance level for calculating critical limits for T2 and Q residuals.

gamma

significance level for calculating outlier limits for T2 and Q residuals.

info

a short text line with model description.

Details

By default pca uses number of components (ncomp) as a minimum of number of objects - 1, number of variables and default or provided value. Besides that, there is also a parameter for selecting an optimal number of components (ncomp.selected). The optimal number of components is used to build a residuals plot (with Q residuals vs. Hotelling T2 values), calculate confidence limits for Q residuals, as well as for SIMCA classification.

You can provde number, names or logical values to exclode rows or columns from calibration and validation of PCA model. In this case the outcome, e.g. scores and loadings will correspond to the original size of the data, but:

  1. Loadings (and all performance statistics) will be computed without excluded objects and variables

  2. Matrix with loadings will have zero values for the excluded variables and the corresponding columns will be hidden.

  3. Matrix with scores will have score values calculated for the hidden objects but the rows will be hidden.

You can see scores and loadings for hidden rows and columns by using parameter 'show.excluded = T' in plots. If you see other packages to make plots (e.g. ggplot2) you will not be able to distinguish between hidden and normal objects.

By default loadings are computed for the original dataset using either SVD or NIPALS algorithm. However, for datasets with large number of rows (e.g. hyperspectral images), there is a possibility to run algorithms based on random permutations [1, 2]. In this case you have to define parameter rand as a vector with two values: p - oversampling parameter and k - number of iterations. Usually rand = c(15, 0) or rand = c(5, 1) are good options, which give quite precise solution using several times less computational time. It must be noted that statistical limits for residuals will not be computed in this case.

There are several ways to calculate critical limits for Q and T2 residuals. In mdatools you can specify one of the following methods via parameter lim.type: 'jm' - method based on Jackson-Mudholkar approach [3], 'chisq' - method based on chi-square distribution [4] and 'ddrobust' and 'ddmoments' - both related to data driven method proposed by Pomerantsev and Rodionova [5]. The 'ddmoments' is based on method of moments for estimation of distribution parameters while 'ddrobust' is based in robust estimation.

It must be noted that the first two methods calculate limits for Q-residuals only, assuming, that limits for T2 residuals must be computed using Hotelling's T-squared distribution. The methods based on the data driven approach calculate limits for both Q and T2 residuals based on chi-square distribution and parameters estimated from the calibration data.

The critical limits are calculated for a significance level defined by parameter 'alpha'. You can also specify another parameter, 'gamma', which is used to calculate acceptance limit for outliers (shown as dashed line on residuals plot).

You can also recalculate the limits for existent model by using different values for alpha and gamme, without recomputing the model itself. In this case use the following code (it is assumed that you current PCA/SIMCA model is stored in variable m): m = setResLimits(m, alpha, gamma).

In case of PCA the critical limits are just shown on residual plot as lines and can be used for detection of extreme objects (solid line) and outliers (dashed line). When PCA model is used for classification in SIMCA (see simca) the limits are utilized for classification of objects.

Value

Returns an object of pca class with following fields:

ncomp

number of components included to the model.

ncomp.selected

selected (optimal) number of components.

loadings

matrix with loading values (nvar x ncomp).

eigenvals

vector with eigenvalues for all existent components.

expvar

vector with explained variance for each component (in percent).

cumexpvar

vector with cumulative explained variance for each component (in percent).

T2lim

statistical limit for T2 distance.

Qlim

statistical limit for Q residuals.

info

information about the model, provided by user when build the model.

calres

an object of class pcares with PCA results for a calibration data.

testres

an object of class pcares with PCA results for a test data, if it was provided.

cvres

an object of class pcares with PCA results for cross-validation, if this option was chosen.

More details and examples can be found in the Bookdown tutorial.

Author(s)

Sergey Kucheryavskiy ([email protected])

References

1. N. Halko, P.G. Martinsson, J.A. Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53 (2010) pp. 217-288. 2. S. Kucheryavskiy, Blessing of randomness against the curse of dimensionality, Journal of Chemometrics, 32 (2018), pp. 3. J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons, New York, NY (1991). 4. A.L. Pomerantsev, Acceptance areas for multivariate classification derived by projection methods, Journal of Chemometrics, 22 (2008) pp. 601-609. 5. A.L. Pomerantsev, O.Ye. Rodionova, Concept and role of extreme objects in PCA/SIMCA, Journal of Chemometrics, 28 (2014) pp. 429-438.

See Also

Methods for pca objects:

plot.pca makes an overview of PCA model with four plots.
summary.pca shows some statistics for the model.
selectCompNum.pca set number of optimal components in the model
setResLimits.pca set critical limits for residuals
predict.pca applies PCA model to a new data.
plotScores.pca shows scores plot.
plotLoadings.pca shows loadings plot.
plotVariance.pca shows explained variance plot.
plotCumVariance.pca shows cumulative explained variance plot.
plotResiduals.pca shows Q vs. T2 residuals plot.

Most of the methods for plotting data are also available for PCA results (pcares) objects. Also check pca.mvreplace, which replaces missing values in a data matrix with approximated using iterative PCA decomposition.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
library(mdatools)
### Examples for PCA class

## 1. Make PCA model for People data with autoscaling
## and full cross-validation

data(people)
model = pca(people, scale = TRUE, cv = 1, info = 'Simple PCA model')
model = selectCompNum(model, 4)
summary(model)
plot(model, show.labels = TRUE)

## 3. Show scores and loadings plots for the model
par(mfrow = c(2, 2))
plotScores(model, comp = c(1, 3), show.labels = TRUE)
plotScores(model, comp = 2, type = 'h', show.labels = TRUE)
plotLoadings(model, comp = c(1, 3), show.labels = TRUE)
plotLoadings(model, comp = c(1, 2), type = 'h', show.labels = TRUE)
par(mfrow = c(1, 1))

## 4. Show residuals and variance plots for the model
par(mfrow = c(2, 2))
plotVariance(model, type = 'h')
plotCumVariance(model, show.labels = TRUE, legend.position = 'bottomright')
plotResiduals(model, show.labels = TRUE)
plotResiduals(model, ncomp = 2, show.labels = TRUE)
par(mfrow = c(1, 1))

mdatools documentation built on July 7, 2018, 1 a.m.