PCA, PLS, and OPLS regression, classification, and cross-validation with the NIPALS algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13 | ```
opls(x, ...)
## S4 method for signature 'ExpressionSet'
opls(x, y = NULL, ...)
## S4 method for signature 'data.frame'
opls(x, ...)
## S4 method for signature 'matrix'
opls(x, y = NULL, predI = NA, orthoI = 0,
algoC = c("default", "nipals", "svd")[1], crossvalI = 7, log10L = FALSE,
permI = 20, scaleC = c("none", "center", "pareto", "standard")[4],
subset = NULL, printL = TRUE, plotL = TRUE, .sinkC = NULL, ...)
``` |

`x` |
Numerical data frame or matrix (observations x variables; NAs are allowed); or ExpressionSet object with non empty assayData, for PCA, and phenoData@data, for (O)PLS(-DA), slots |

`...` |
Currently not used. |

`y` |
Response to be modelled: Either 1) 'NULL' for PCA (default) or 2) a numerical vector (same length as 'x' row number) for single response (O)PLS, or 3) a numerical matrix (same row number as 'x') for multiple response PLS, 4) a factor (same length as 'x' row number) for (O)PLS-DA, or 5) a character indicating the name of the column of the phenoData@data to be used, when x is an ExpressionSet object. Note that, for convenience, character vectors are also accepted for (O)PLS-DA as well as single column numerical (resp. character) matrix for (O)PLS (respectively (O)PLS-DA). NAs are allowed in numeric responses. |

`predI` |
Integer: number of components (predictive componenents in case of PLS and OPLS) to extract; for OPLS, predI is (automatically) set to 1; if set to NA [default], autofit is performed: a maximum of 10 components are extracted until (i) PCA case: the variance is less than the mean variance of all components (note that this rule requires all components to be computed and can be quite time-consuming for large datasets) or (ii) PLS case: either R2Y of the component is < 0.01 (N4 rule) or Q2Y is < 0 (for more than 100 observations) or 0.05 otherwise (R1 rule) |

`orthoI` |
Integer: number of orthogonal components (for OPLS only); when set to 0 [default], PLS will be performed; otherwise OPLS will be peformed; when set to NA, OPLS is performed and the number of orthogonal components is automatically computed by using the cross-validation (with a maximum of 9 orthogonal components). |

`algoC` |
Default algorithm is 'svd' for PCA (in case of no missing values in 'x'; 'nipals' otherwise) and 'nipals' for PLS and OPLS; when asking to use 'svd' for PCA on an 'x' matrix containing missing values, NAs are set to half the minimum of non-missing values and a warning is generated |

`crossvalI` |
Integer: number of cross-validation segments (default is 7); The number of samples (rows of 'x') must be at least >= crossvalI |

`log10L` |
Should the 'x' matrix be log10 transformed? Zeros are set to 1 prior to transformation |

`permI` |
Integer: number of random permutations of response labels to estimate R2Y and Q2Y significance by permutation testing [default is 20 for single response models (without train/test partition), and 0 otherwise] |

`scaleC` |
Character: either no centering nor scaling ('none'), mean-centering only ('center'), mean-centering and pareto scaling ('pareto'), or mean-centering and unit variance scaling ('standard') [default] |

`subset` |
Integer vector: indices of the observations to be used for training (in a classification scheme); use NULL [default] for no partition of the dataset; use 'odd' for a partition of the dataset in two equal sizes (with respect to the classes proportions) |

`printL` |
Logical: Should informations regarding the data set and the model be printed? [default = TRUE] |

`plotL` |
Logical: Should the 'summary' plot be displayed? [default = TRUE] |

`.sinkC` |
Character: Name of the file for R output diversion [default = NULL: no diversion]; Diversion of messages is required for the integration into Galaxy |

An S4 object of class 'opls' containing the following slots:

typeC Character: model type (PCA, PLS, PLS-DA, OPLS, or OPLS-DA)

descriptionMC Character matrix: Description of the data set (number of samples, variables, etc.)

modelDF Data frame with the model overview (number of components, R2X, R2X(cum), R2Y, R2Y(cum), Q2, Q2(cum), significance, iterations)

summaryDF Data frame with the model summary (cumulated R2X, R2Y and Q2); RMSEE is the square root of the mean error between the actual and the predicted responses

subsetVi Integer vector: Indices of observations in the training data set

pcaVarVn PCA: Numerical vector of variances of length: predI

vipVn PLS(-DA): Numerical vector of Variable Importance in Projection; OPLS(-DA): Numerical vector of Variable Importance for Prediction (VIP4,p from Galindo-Prieto et al, 2014)

orthoVipVn OPLS(-DA): Numerical vector of Variable Importance for Orthogonal Modeling (VIP4,o from Galindo-Prieto et al, 2014)

xMeanVn Numerical vector: variable means of the 'x' matrix

xSdVn Numerical vector: variable standard deviations of the 'x' matrix

yMeanVn (O)PLS: Numerical vector: variable means of the 'y' response (transformed into a dummy matrix in case it is of 'character' mode initially)

ySdVn (O)PLS: Numerical vector: variable standard deviations of the 'y' response (transformed into a dummy matrix in case it is of 'character' mode initially)

xZeroVarVi Numerical vector: indices of variables with variance < 2.22e-16 which were excluded from 'x' before building the model

scoreMN Numerical matrix of x scores (T; dimensions: nrow(x) x predI) X = TP' + E; Y = TC' + F

loadingMN Numerical matrix of x loadings (P; dimensions: ncol(x) x predI) X = TP' + E

weightMN (O)PLS: Numerical matrix of x weights (W; same dimensions as loadingMN)

orthoScoreMN OPLS: Numerical matrix of orthogonal scores (Tortho; dimensions: nrow(x) x number of orthogonal components)

orthoLoadingMN OPLS: Numerical matrix of orthogonal loadings (Portho; dimensions: ncol(x) x number of orthogonal components)

orthoWeightMN OPLS: Numerical matrix of orthogonal weights (same dimensions as orthoLoadingMN)

cMN (O)PLS: Numerical matrix of Y weights (C; dimensions: number of responses or number of classes in case of qualitative response) x number of predictive components; Y = TC' + F

coMN) (O)PLS: Numerical matrix of Y orthogonal weights; dimensions: number of responses or number of classes in case of qualitative response with more than 2 classes x number of orthogonal components

uMN (O)PLS: Numerical matrix of Y scores (U; same dimensions as scoreMN); Y = UC' + G

weightStarMN Numerical matrix of projections (W*; same dimensions as loadingMN); whereas columns of weightMN are derived from successively deflated matrices, columns of weightStarMN relate to the original 'x' matrix: T = XW*; W*=W(P'W)inv

suppLs List of additional objects to be used internally by the 'print', 'plot', and 'predict' methods

Etienne Thevenot, etienne.thevenot@cea.fr

Eriksson et al. (2006). Multi- and Megarvariate Data Analysis. Umetrics Academy. Rosipal and Kramer (2006). Overview and recent advances in partial least squares Tenenhaus (1990). La regression PLS : theorie et pratique. Technip. Wehrens (2011). Chemometrics with R. Springer. Wold et al. (2001). PLS-regression: a basic tool of chemometrics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | ```
#### PCA
data(foods) ## see Eriksson et al. (2001); presence of 3 missing values (NA)
head(foods)
foodMN <- as.matrix(foods[, colnames(foods) != "Country"])
rownames(foodMN) <- foods[, "Country"]
head(foodMN)
foo.pca <- opls(foodMN)
#### PLS with a single response
data(cornell) ## see Tenenhaus, 1998
head(cornell)
cornell.pls <- opls(as.matrix(cornell[, grep("x", colnames(cornell))]),
cornell[, "y"])
## Complementary graphics
plot(cornell.pls, typeVc = c("outlier", "predict-train", "xy-score", "xy-weight"))
#### PLS with multiple (quantitative) responses
data(lowarp) ## see Eriksson et al. (2001); presence of NAs
head(lowarp)
lowarp.pls <- opls(as.matrix(lowarp[, c("glas", "crtp", "mica", "amtp")]),
as.matrix(lowarp[, grepl("^wrp", colnames(lowarp)) |
grepl("^st", colnames(lowarp))]))
#### PLS-DA
data(sacurine)
attach(sacurine)
sacurine.plsda <- opls(dataMatrix, sampleMetadata[, "gender"])
#### OPLS-DA
sacurine.oplsda <- opls(dataMatrix, sampleMetadata[, "gender"], predI = 1, orthoI = NA)
detach(sacurine)
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.