| pca | R Documentation |
pca is used to build and explore a principal component analysis (PCA) model.
pca(
x,
ncomp = min(nrow(x) - 1, ncol(x), 20),
center = TRUE,
scale = FALSE,
exclrows = NULL,
exclcols = NULL,
x.test = NULL,
method = "svd",
rand = NULL,
lim.type = "ddmoments",
alpha = 0.05,
gamma = 0.01,
info = "",
prep = NULL,
do.round = FALSE
)
x |
calibration data (matrix or data frame). |
ncomp |
maximum number of components to calculate. |
center |
logical, do mean centering of data or not. |
scale |
logical, do standardization of data or not. |
exclrows |
rows to be excluded from calculations (numbers, names or vector with logical values) |
exclcols |
columns to be excluded from calculations (numbers, names or vector with logical values) |
x.test |
test data (matrix or data frame). |
method |
method to compute principal components ("svd", "nipals"). |
rand |
vector with parameters for randomized PCA methods (if NULL, conventional PCA is used instead) |
lim.type |
which method to use for calculation of critical limits for residual distances (see details) |
alpha |
significance level for extreme limits for T2 and Q distances. |
gamma |
significance level for outlier limits for T2 and Q distances. |
info |
a short text with model description. |
prep |
optional list with preprocessing methods created using ' |
do.round |
logical, round or not DoF for distances. |
If number of components is not specified, a minimum of number of objects - 1 and number of
variables in calibration set is used. One can also specify an optimal number of components,
once model is calibrated (ncomp.selected). The optimal number of components is used to
build a residual distance plot, as well as for SIMCA classification.
If some of rows of calibration set should be excluded from calculations (e.g. because they are
outliers) you can provide row numbers, names, or logical values as parameter exclrows. In
this case they will be completely ignored when the model is calibrated. However, score and residual
distances will be computed for these rows as well and then hidden. You can show them
on corresponding plots by using parameter show.excluded = TRUE.
It is also possible to exclude selected columns from calculations by providing parameter
exclcols in form of column numbers, names or logical values. In this case loading matrix
will have zeros for these columns. This allows computing PCA models for selected variables
without removing them physically from a dataset.
Take into account that if you use other packages to make plots (e.g. ggplot2) you will not be able to distinguish between hidden and normal objects.
By default loadings are computed for the original dataset using either SVD or NIPALS algorithm.
However, for datasets with large number of rows (e.g. hyperspectral images), there is a
possibility to run algorithms based on random permutations [1, 2]. In this case you have
to define parameter rand as a vector with two values: p - oversampling parameter
and q - number of power iterations. Usually rand = c(15, 0) or rand = c(5, 1)
are good options, which give quite precise solutions but much faster.
There are several ways to calculate critical limits for orthogonal (Q, q) and score (T2, h)
distances. In mdatools you can specify one of the following methods via parameter
lim.type: "jm" Jackson-Mudholkar approach [3], "chisq" - method based on
chi-square distribution [4], "ddmoments" and "ddrobust" - related to data
driven method proposed in [5]. The "ddmoments" is based on method of moments for
estimation of distribution parameters (also known as "classical" approach) while
"ddrobust" is based on robust estimation.
If lim.type="chisq" or lim.type="jm" is used, only limits for Q-distances are
computed based on corresponding approach, limits for T2-distances are computed using
Hotelling's T-squared distribution. The methods utilizing the data driven approach calculate
limits for combination of the distances based on chi-square distribution and parameters
estimated from the calibration data.
The critical limits are calculated for a significance level defined by parameter 'alpha'.
You can also specify another parameter, 'gamma', which is used to calculate acceptance
limit for outliers (shown as dashed line on residual distance plot).
You can also recalculate the limits for existent model by using different values for alpha and
gamma, without recomputing the model itself. In this case use the following code (it is assumed
that you current PCA/SIMCA model is stored in variable m):
m = setDistanceLimits(m, lim.type, alpha, gamma).
In case of PCA the critical limits are just shown on residual plot as lines and can be used for
detection of extreme objects (solid line) and outliers (dashed line). When PCA model is used for
classification in SIMCA (see simca) or DD-SIMCA (ddsimca)
the limits are also employed for classification of objects.
If you provide a list with preprocessing methods, PCA will apply them to the training set before excluding the columns and rows (if specified). The list will be used to train a preprocessing model which becomes a part of the PCA model object. So when you use method 'predict()' the provided dataset will be automatically preprocessed by the preprocessing model.
Any PCA model (with or without preprocessing) developed in this package can be saved as JSON file
using method writeJSON and then be loaded to interactive web-application for
PCA available at https://mda.tools/pca. Likewise one can develop a model in the app, save it to
JSON file and then load it to R by using method readJSON. In this case,
however, the model object will not contain calibration/training results, so some of
the plots and statistics will not be available.
Returns an object of pca class with following fields:
ncomp |
number of components included to the model. |
ncomp.selected |
selected (optimal) number of components. |
loadings |
matrix with loading values (nvar x ncomp). |
eigenvals |
vector with eigenvalues for all existent components. |
expvar |
vector with explained variance for each component (in percent). |
cumexpvar |
vector with cumulative explained variance for each component (in percent). |
T2lim |
statistical limit for T2 distance. |
Qlim |
statistical limit for Q residuals. |
info |
information about the model, provided by user when building the model. |
prep |
trained preprocessing model (if specified) |
calres |
an object of class |
testres |
an object of class |
More details and examples can be found in the Bookdown tutorial.
Sergey Kucheryavskiy (svkucheryavski@gmail.com)
1. N. Halko, P.G. Martinsson, J.A. Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53 (2010) pp. 217-288.
2. S. Kucheryavskiy, Blessing of randomness against the curse of dimensionality, Journal of Chemometrics, 32 (2018).
3. J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons, New York, NY (1991).
4. A.L. Pomerantsev, Acceptance areas for multivariate classification derived by projection methods, Journal of Chemometrics, 22 (2008) pp. 601-609.
5. A.L. Pomerantsev, O.Ye. Rodionova, Concept and role of extreme objects in PCA/SIMCA, Journal of Chemometrics, 28 (2014) pp. 429-438.
Methods for pca objects:
plot.pca | makes an overview of PCA model with four plots. |
summary.pca | shows some statistics for the model. |
categorize.pca | categorize data rows as "normal", "extreme" or "outliers". |
selectCompNum.pca | set number of optimal components in the model |
setDistanceLimits.pca | set critical limits for residuals |
predict.pca | applies PCA model to a new data. |
writeJSON.pca | saves PCA model to a JSON file so it can be used in web-applications. |
pca.readJSON | loads PCA model from a JSON file to pca object. |
Plotting methods for pca objects:
plotScores.pca | shows scores plot. |
plotLoadings.pca | shows loadings plot. |
plotVariance.pca | shows explained variance plot. |
plotCumVariance.pca | shows cumulative explained variance plot. |
plotEigenvalues.pca | shows eigenvalues plot. |
plotDistances.pca | shows plot for residual distances (Q vs. T2). |
plotBiplot.pca | shows bi-plot. |
plotExtremes.pca | shows extreme plot. |
plotT2DoF | plot with degrees of freedom for score distance. |
plotQDoF | plot with degrees of freedom for orthogonal distance. |
plotDistDoF | plot with degrees of freedom for both distances. |
Most of the methods for plotting data are also available for PCA results (pcares)
objects. Also check pca.mvreplace, which replaces missing values in a data matrix
with approximated using iterative PCA decomposition.
library(mdatools)
### Examples for PCA class
## 1. Make PCA model for People data with autoscaling
data(people)
model = pca(people, scale = TRUE, info = "Simple PCA model")
model = selectCompNum(model, 4)
summary(model)
plot(model, show.labels = TRUE)
## 2. Show scores and loadings plots for the model
par(mfrow = c(2, 2))
plotScores(model, comp = c(1, 3), show.labels = TRUE)
plotScores(model, comp = 2, type = "h", show.labels = TRUE)
plotLoadings(model, comp = c(1, 3), show.labels = TRUE)
plotLoadings(model, comp = c(1, 2), type = "h", show.labels = TRUE)
par(mfrow = c(1, 1))
## 3. Show residual distance and variance plots for the model
par(mfrow = c(2, 2))
plotVariance(model, type = "h")
plotCumVariance(model, show.labels = TRUE, legend.position = "bottomright")
plotResiduals(model, show.labels = TRUE)
plotResiduals(model, ncomp = 2, show.labels = TRUE)
par(mfrow = c(1, 1))
## 4. Use list with preprocessing methods
# get the data (calibration and test set)
data(simdata)
Xc <- simdata$spectra.c
Xt <- simdata$spectra.t
# create a list with two preprocessing methods
p <- list(
prep("savgol", width = 7, porder = 2, dorder = 2),
prep("norm", type = "snv")
)
# build a PCA model with and without preprocessing
m1 <- pca(Xc, 5, prep = p)
m2 <- pca(Xc, 5)
# apply the models to test set
r1 <- predict(m1, Xt)
r2 <- predict(m2, Xt)
# check scores
par(mfrow = c(1, 2))
plotScores(m1, c(1, 2), res = list(cal = m1$calres, test = r1), main = "With preprocessing")
plotScores(m2, c(1, 2), res = list(cal = m2$calres, test = r2), main = "Without preprocessing")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.