opls: PCA, PLS(-DA), and OPLS(-DA)

Description Usage Arguments Value Author(s) References Examples

Description

PCA, PLS, and OPLS regression, classification, and cross-validation with the NIPALS algorithm

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## S4 method for signature 'MultiDataSet'
opls(
  x,
  y = NULL,
  fig.pdfC = c("none", "interactive", "myfile.pdf")[2],
  info.txtC = c("none", "interactive", "myfile.txt")[2],
  ...
)

## S4 method for signature 'ExpressionSet'
opls(x, y = NULL, ...)

## S4 method for signature 'data.frame'
opls(x, ...)

## S4 method for signature 'matrix'
opls(
  x,
  y = NULL,
  predI = NA,
  orthoI = 0,
  algoC = c("default", "nipals", "svd")[1],
  crossvalI = 7,
  log10L = FALSE,
  permI = 20,
  scaleC = c("none", "center", "pareto", "standard")[4],
  subset = NULL,
  plotSubC = NA,
  fig.pdfC = c("none", "interactive", "myfile.pdf")[2],
  info.txtC = c("none", "interactive", "myfile.txt")[2],
  printL = TRUE,
  plotL = TRUE,
  .sinkC = NULL,
  ...
)

Arguments

x

Numerical data frame or matrix (observations x variables; NAs are allowed); or ExpressionSet object with non empty exprs, for PCA, and phenoData@data, for (O)PLS(-DA), slots

y

Response to be modelled: Either 1) 'NULL' for PCA (default) or 2) a numerical vector (same length as 'x' row number) for single response (O)PLS, or 3) a numerical matrix (same row number as 'x') for multiple response PLS, 4) a factor (same length as 'x' row number) for (O)PLS-DA, or 5) a character indicating the name of the column of the phenoData@data to be used, when x is an ExpressionSet object. Note that, for convenience, character vectors are also accepted for (O)PLS-DA as well as single column numerical (resp. character) matrix for (O)PLS (respectively (O)PLS-DA). NAs are allowed in numeric responses.

fig.pdfC

Character: File name with '.pdf' extension for the figure; if 'interactive' (default), figures will be displayed interactively; if 'none', no figure will be generated

info.txtC

Character: File name with '.txt' extension for the printed results (call to sink()'); if 'interactive' (default), messages will be printed on the screen; if 'none', no verbose will be generated

...

Currently not used.

predI

Integer: number of components (predictive componenents in case of PLS and OPLS) to extract; for OPLS, predI is (automatically) set to 1; if set to NA [default], autofit is performed: a maximum of 10 components are extracted until (i) PCA case: the variance is less than the mean variance of all components (note that this rule requires all components to be computed and can be quite time-consuming for large datasets) or (ii) PLS case: either R2Y of the component is < 0.01 (N4 rule) or Q2Y is < 0 (for more than 100 observations) or 0.05 otherwise (R1 rule)

orthoI

Integer: number of orthogonal components (for OPLS only); when set to 0 [default], PLS will be performed; otherwise OPLS will be peformed; when set to NA, OPLS is performed and the number of orthogonal components is automatically computed by using the cross-validation (with a maximum of 9 orthogonal components).

algoC

Default algorithm is 'svd' for PCA (in case of no missing values in 'x'; 'nipals' otherwise) and 'nipals' for PLS and OPLS; when asking to use 'svd' for PCA on an 'x' matrix containing missing values, NAs are set to half the minimum of non-missing values and a warning is generated

crossvalI

Integer: number of cross-validation segments (default is 7); The number of samples (rows of 'x') must be at least >= crossvalI

log10L

Should the 'x' matrix be log10 transformed? Zeros are set to 1 prior to transformation

permI

Integer: number of random permutations of response labels to estimate R2Y and Q2Y significance by permutation testing [default is 20 for single response models (without train/test partition), and 0 otherwise]

scaleC

Character: either no centering nor scaling ('none'), mean-centering only ('center'), mean-centering and pareto scaling ('pareto'), or mean-centering and unit variance scaling ('standard') [default]

subset

Integer vector: indices of the observations to be used for training (in a classification scheme); use NULL [default] for no partition of the dataset; use 'odd' for a partition of the dataset in two equal sizes (with respect to the classes proportions)

plotSubC

Character: Graphic subtitle

printL

Logical: deprecated: use the 'info.txtC' argument instead

plotL

Logical: deprecated: use the 'fig.pdfC' argument instead

.sinkC

Character: deprecated: use the 'info.txtC' argument instead

Value

An S4 object of class 'opls' containing the following slots:

Author(s)

Etienne Thevenot, etienne.thevenot@cea.fr

References

Eriksson et al. (2006). Multi- and Megarvariate Data Analysis. Umetrics Academy. Rosipal and Kramer (2006). Overview and recent advances in partial least squares Tenenhaus (1990). La regression PLS : theorie et pratique. Technip. Wehrens (2011). Chemometrics with R. Springer. Wold et al. (2001). PLS-regression: a basic tool of chemometrics

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
## PCA

data(foods) ## see Eriksson et al. (2001); presence of 3 missing values (NA)
head(foods)
foodMN <- as.matrix(foods[, colnames(foods) != "Country"])
rownames(foodMN) <- foods[, "Country"]
head(foodMN)
foo.pca <- opls(foodMN)

## PLS with a single response

data(cornell) ## see Tenenhaus, 1998
head(cornell)
cornell.pls <- opls(as.matrix(cornell[, grep("x", colnames(cornell))]),
                    cornell[, "y"])

## Complementary graphics

plot(cornell.pls, typeVc = c("outlier", "predict-train", "xy-score", "xy-weight"))

#### PLS with multiple (quantitative) responses

data(lowarp) ## see Eriksson et al. (2001); presence of NAs
head(lowarp)
lowarp.pls <- opls(as.matrix(lowarp[, c("glas", "crtp", "mica", "amtp")]),
                   as.matrix(lowarp[, grepl("^wrp", colnames(lowarp)) |
                                      grepl("^st", colnames(lowarp))]))

## PLS-DA

data(sacurine)
attach(sacurine)
sacurine.plsda <- opls(dataMatrix, sampleMetadata[, "gender"])

## OPLS-DA

sacurine.oplsda <- opls(dataMatrix, sampleMetadata[, "gender"], predI = 1, orthoI = NA)

## Application to an ExpressionSet

sacSet <- Biobase::ExpressionSet(assayData = t(dataMatrix), 
                                 phenoData = new("AnnotatedDataFrame", 
                                                 data = sampleMetadata), 
                                 featureData = new("AnnotatedDataFrame", 
                                                   data = variableMetadata),
                                 experimentData = new("MIAME", 
                                                      title = "sacurine"))
                                                      
sacPlsda <- opls(sacSet, "gender")
sacSet <- getEset(sacPlsda)
head(Biobase::pData(sacSet))
head(Biobase::fData(sacSet))

detach(sacurine)

## Application to a MultiDataSet

# Loading the 'NCI60_4arrays' from the 'omicade4' package
data("NCI60_4arrays", package = "omicade4")
# Selecting two of the four datasets
setNamesVc <- c("agilent", "hgu95")
# Creating the MultiDataSet instance
nciMset <- MultiDataSet::createMultiDataSet()
# Adding the two datasets as ExpressionSet instances
for (setC in setNamesVc) {
  # Getting the data
  exprMN <- as.matrix(NCI60_4arrays[[setC]])
  pdataDF <- data.frame(row.names = colnames(exprMN),
                        cancer = substr(colnames(exprMN), 1, 2),
                        stringsAsFactors = FALSE)
  fdataDF <- data.frame(row.names = rownames(exprMN),
                        name = rownames(exprMN),
                        stringsAsFactors = FALSE)
  # Building the ExpressionSet
  eset <- Biobase::ExpressionSet(assayData = exprMN,
                                 phenoData = new("AnnotatedDataFrame",
                                                 data = pdataDF),
                                 featureData = new("AnnotatedDataFrame",
                                                   data = fdataDF),
                                 experimentData = new("MIAME",
                                                      title = setC))
  # Adding to the MultiDataSet
  nciMset <- MultiDataSet::add_eset(nciMset, eset, dataset.type = setC,
                                    GRanges = NA, warnings = FALSE)
}
# Summary of the MultiDataSet
nciMset
# Principal Component Analysis of each data set
nciPca <- ropls::opls(nciMset)
# Coloring the Score plot according to cancer types
ropls::plot(nciPca, y = "cancer", typeVc = "x-score")
# Getting the updated MultiDataSet (now including scores and loadings)
nciMset <- ropls::getMset(nciPca)
# Restricting to the 'ME' and 'LE' cancer types
sampleNamesVc <- Biobase::sampleNames(nciMset[["agilent"]])
cancerTypeVc <- Biobase::pData(nciMset[["agilent"]])[, "cancer"]
nciMset <- nciMset[sampleNamesVc[cancerTypeVc %in% c("ME", "LE")], ]
# Building PLS-DA models for the cancer type, and getting back the updated MultiDataSet
nciPlsda <- ropls::opls(nciMset, "cancer", predI = 2)
nciMset <- ropls::getMset(nciPlsda)
# Viewing the new variable metadata (including VIP and coefficients)
lapply(Biobase::fData(nciMset), head)

ropls documentation built on Nov. 8, 2020, 7:46 p.m.