tune.splsda | R Documentation |
Computes M-fold or Leave-One-Out Cross-Validation scores on a user-input
grid to determine optimal values for the parameters in splsda
.
tune.splsda(
X,
Y,
ncomp = 1,
test.keepX = NULL,
already.tested.X,
scale = TRUE,
logratio = c("none", "CLR"),
max.iter = 100,
tol = 1e-06,
near.zero.var = FALSE,
multilevel = NULL,
validation = "Mfold",
folds = 10,
nrepeat = 1,
signif.threshold = 0.01,
dist = "max.dist",
measure = "BER",
auc = FALSE,
progressBar = FALSE,
light.output = TRUE,
BPPARAM = SerialParam(),
seed = NULL
)
X |
numeric matrix of predictors. |
Y |
|
ncomp |
the number of components to include in the model. |
test.keepX |
numeric vector for the different number of variables to
test from the |
already.tested.X |
Optional, if |
scale |
Logical. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE) |
logratio |
one of ('none','CLR'). Default to 'none' |
max.iter |
integer, the maximum number of iterations. |
tol |
Convergence stopping value. |
near.zero.var |
Logical, see the internal |
multilevel |
Design matrix for multilevel analysis (for repeated measurements) that indicates the repeated measures on each individual, i.e. the individuals ID. See Details. |
validation |
character. What kind of (internal) validation to use,
matching one of |
folds |
the folds in the Mfold cross-validation. See Details. |
nrepeat |
Number of times the Cross-Validation process is repeated. |
signif.threshold |
numeric between 0 and 1 indicating the significance threshold required for improvement in error rate of the components. Default to 0.01. |
dist |
distance metric to use for |
measure |
Three misclassification measure are available: overall
misclassification error |
auc |
if |
progressBar |
by default set to |
light.output |
if set to FALSE, the prediction/classification of each
sample for each of |
BPPARAM |
A BiocParallelParam object indicating the type
of parallelisation. See examples in |
seed |
set a number here if you want the function to give reproducible outputs. Not recommended during exploratory analysis. Note if RNGseed is set in 'BPPARAM', this will be overwritten by 'seed'. |
This tuning function should be used to tune the parameters in the
splsda
function (number of components and number of variables in
keepX
to select).
For a sPLS-DA, M-fold or LOO cross-validation is performed with stratified subsampling where all classes are represented in each fold.
If validation = "loo"
, leave-one-out cross-validation is performed.
By default folds
is set to the number of unique individuals.
The function outputs the optimal number of components that achieve the best
performance based on the overall error rate or BER. The assessment is
data-driven and similar to the process detailed in (Rohart et al., 2016),
where one-sided t-tests assess whether there is a gain in performance when
adding a component to the model. Our experience has shown that in most case,
the optimal number of components is the number of categories in Y
-
1, but it is worth tuning a few extra components to check (see our website
and case studies for more details).
For sPLS-DA multilevel one-factor analysis, M-fold or LOO cross-validation is performed where all repeated measurements of one sample are in the same fold. Note that logratio transform and the multilevel analysis are performed internally and independently on the training and test set.
For a sPLS-DA multilevel two-factor analysis, the correlation between
components from the within-subject variation of X and the cond
matrix
is computed on the whole data set. The reason why we cannot obtain a
cross-validation error rate as for the spls-DA one-factor analysis is
because of the difficulty to decompose and predict the within matrices
within each fold.
For a sPLS two-factor analysis a sPLS canonical mode is run, and the correlation between components from the within-subject variation of X and Y is computed on the whole data set.
If validation = "Mfold"
, M-fold cross-validation is performed. How
many folds to generate is selected by specifying the number of folds in
folds
.
If auc = TRUE
and there are more than 2 categories in Y
, the
Area Under the Curve is averaged using one-vs-all comparison. Note however
that the AUC criteria may not be particularly insightful as the prediction
threshold we use in sPLS-DA differs from an AUC threshold (sPLS-DA relies on
prediction distances for predictions, see ?predict.splsda
for more
details) and the supplemental material of the mixOmics article (Rohart et
al. 2017). If you want the AUC criterion to be insightful, you should use
measure==AUC
as this will output the number of variable that
maximises the AUC; in this case there is no prediction threshold from
sPLS-DA (dist
is not used). If measure==AUC
, we do not output
SD as this measure can be a mean (over nrepeat
) of means (over the
categories).
BER is appropriate in case of an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. BER is less biased towards majority classes during the performance assessment.
More details about the prediction distances in ?predict
and the
supplemental material of the mixOmics article (Rohart et al. 2017).
If test.keepX is set to NULL, the perf()
function will be run internally,
which performs cross-validation to identify optimal number of components and
distance measure. Running tuning initially using test.keepX = NULL
speeds
up the parameter tuning workflow, as then a lower ncomp value can be used for
variable selection tuning.
Depending on the type of analysis performed, a list that contains:
error.rate |
returns the prediction error for each |
choice.keepX |
returns the number of variables selected (optimal keepX) on each component. |
choice.ncomp |
returns the optimal number of
components for the model fitted with |
error.rate.class |
returns the error rate for each level of |
If test.keepX = FALSE,produces a matrix of classification
error rate estimation. The dimensions correspond to the components in the
model and to the prediction method used, respectively. Note that error rates
reported in any component include the performance of the model in earlier
components for the specified keepX
parameters (e.g. error rate
reported for component 3 for keepX = 20
already includes the fitted
model on components 1 and 2 for keepX = 20
).
predict |
Prediction values for each sample, each |
class |
Predicted class for each sample, each |
auc |
AUC mean and standard deviation if the number of categories in
|
cor.value |
only if multilevel analysis with 2 factors: correlation between latent variables. |
Kim-Anh Lê Cao, Benoit Gautier, Francois Bartolo, Florian Rohart, Al J Abadi
mixOmics article:
Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752
splsda
, predict.splsda
and
http://www.mixOmics.org for more details.
## First example: analysis with sPLS-DA
data(breast.tumors)
X = breast.tumors$gene.exp
Y = as.factor(breast.tumors$sample$treatment)
# first tune on components only
tune = tune.splsda(X, Y, ncomp = 5, logratio = "none",
nrepeat = 10, folds = 10,
test.keepX = NULL,
dist = "all",
progressBar = TRUE,
seed = 20) # set for reproducibility of example only
plot(tune) # optimal distance = centroids.dist
tune$choice.ncomp # optimal component number = 3
# then tune optimal keepX for each component
tune = tune.splsda(X, Y, ncomp = 3, logratio = "none",
nrepeat = 10, folds = 10,
test.keepX = c(5, 10, 15), dist = "centroids.dist",
progressBar = TRUE,
seed = 20)
plot(tune)
tune$choice.keepX # optimal number of variables to keep c(15, 5, 15)
## With already tested variables:
tune = tune.splsda(X, Y, ncomp = 3, logratio = "none",
nrepeat = 10, folds = 10,
test.keepX = c(5, 10, 15), already.tested.X = c(5, 10),
dist = "centroids.dist",
progressBar = TRUE,
seed = 20)
plot(tune)
## Second example: multilevel one-factor analysis with sPLS-DA
data(vac18)
X = vac18$genes
Y = vac18$stimulation
# sample indicates the repeated measurements
design = data.frame(sample = vac18$sample)
# tune on components
tune = tune.splsda(X, Y = Y, ncomp = 5, nrepeat = 10, logratio = "none",
test.keepX = NULL, folds = 10, dist = "max.dist", multilevel = design)
plot(tune)
# tune on variables
tune = tune.splsda(X, Y = Y, ncomp = 3, nrepeat = 10, logratio = "none",
test.keepX = c(5,50,100),folds = 10, dist = "max.dist", multilevel = design)
plot(tune)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.