Description Usage Arguments Value folds nrepeat measure-pls t-test-process more Author(s) References See Also Examples
This function uses repeated cross-validation to tune hyperparameters such as the number of features to select and possibly the number of components to extract.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
X |
numeric matrix of predictors with the rows as individual
observations. missing values ( |
Y |
numeric matrix of response(s) with the rows as individual
observations matching |
test.keepX |
numeric vector for the different number of variables to test from the X data set. |
test.keepY |
numeric vector for the different number of variables to
test from the Y data set. Default to |
ncomp |
Positive Integer. The number of components to include in the model. Default to 2. |
validation |
character. What kind of (internal) validation to use,
matching one of |
nrepeat |
Positive integer. Number of times the Cross-Validation process
should be repeated. |
folds |
Positive Integer, The folds in the Mfold cross-validation. |
mode |
Character string indicating the type of PLS algorithm to use. One
of |
measure |
One of c('cor', 'RSS') indicating the tuning measure. See details. |
BPPARAM |
A BiocParallelParam object indicating the type
of parallelisation. See examples in |
progressBar |
Logical. If |
limQ2 |
Q2 threshold for recommending optimal |
... |
Optional parameters passed to |
A list that contains:
cor.pred |
The correlation of predicted vs actual components from X (t) and Y (u) for each component |
RSS.pred |
The Residual Sum of Squares of predicted vs actual components from X (t) and Y (u) for each component |
choice.keepX |
returns the number of variables selected for X (optimal keepX) on each component. |
choice.keepY |
returns the number of variables selected for Y (optimal keepY) on each component. |
choice.ncomp |
returns the optimal number of components for the model
fitted with |
call |
The functioncal call including the parameteres used. |
During a cross-validation (CV), data are randomly split into M
subgroups (folds). M-1
subgroups are then used to train submodels
which would be used to predict prediction accuracy statistics for the
held-out (test) data. All subgroups are used as the test data exactly once.
If validation = "loo"
, leave-one-out CV is used where each group
consists of exactly one sample and hence M == N
where N is the number
of samples.
The cross-validation process is repeated nrepeat
times and the
accuracy measures are averaged across repeats. If validation = "loo"
,
the process does not need to be repeated as there is only one way to split N
samples into N groups and hence nrepeat is forced to be 1.
Two measures of accuracy are available: Correlation (cor
), as well as
the Residual Sum of Squares (RSS
). For cor
, the parameters
which would maximise the correlation between the predicted and the actual
components are chosen. The RSS
measure tries to predict the held-out
data by matrix reconstruction and seeks to minimise the error between actual
and predicted values. For mode='canonical'
, The X matrix is used to
calculate the RSS
, while for others modes the Y
matrix is used.
This measure gives more weight to any large errors and is thus sensitive to
outliers. It also intrinsically selects less number of features on the
Y
block compared to measure='cor'
.
The optimisation process is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when incrementing the number of features or components in the model. However, it will assess all the provided grid through pair-wise comparisons as the performance criteria do not always change linearly with respect to the added number of features or components.
See also ?perf
for more details.
Kim-Anh Lê Cao, Al J Abadi, Benoit Gautier, Francois Bartolo, Florian Rohart,
mixOmics article:
Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752
PLS and PLS citeria for PLS regression: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.
Chavent, Marie and Patouille, Brigitte (2003). Calcul des coefficients de regression et du PRESS en regression PLS1. Modulad n, 30 1-11. (this is the formula we use to calculate the Q2 in perf.pls and perf.spls)
Mevik, B.-H., Cederkvist, H. R. (2004). Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Journal of Chemometrics 18(9), 422-429.
sparse PLS regression mode:
Lê Cao, K. A., Rossouw D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.
One-sided t-tests (suppl material):
Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells. PeerJ 4:e1845.
splsda
, predict.splsda
and
http://www.mixOmics.org for more details.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ## Not run:
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
set.seed(42)
tune.res = tune.spls( X, Y, ncomp = 3,
test.keepX = c(5, 10, 15),
test.keepY = c(3, 6, 8), measure = "cor",
folds = 5, nrepeat = 3, progressBar = TRUE)
tune.res$choice.ncomp
tune.res$choice.keepX
tune.res$choice.keepY
# plot the results
plot(tune.res)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.