tune.splsda: Tuning functions for sPLS-DA method

tune.splsdaR Documentation

Tuning functions for sPLS-DA method

Description

Computes M-fold or Leave-One-Out Cross-Validation scores on a user-input grid to determine optimal values for the parameters in splsda.

Usage

tune.splsda(
  X,
  Y,
  ncomp = 1,
  test.keepX = NULL,
  already.tested.X,
  scale = TRUE,
  logratio = c("none", "CLR"),
  max.iter = 100,
  tol = 1e-06,
  near.zero.var = FALSE,
  multilevel = NULL,
  validation = "Mfold",
  folds = 10,
  nrepeat = 1,
  signif.threshold = 0.01,
  dist = "max.dist",
  measure = "BER",
  auc = FALSE,
  progressBar = FALSE,
  light.output = TRUE,
  BPPARAM = SerialParam(),
  seed = NULL
)

Arguments

X

numeric matrix of predictors. NAs are allowed.

Y

if(method = 'spls') numeric vector or matrix of continuous responses (for multi-response models) NAs are allowed.

ncomp

the number of components to include in the model.

test.keepX

numeric vector for the different number of variables to test from the X data set. If set to NULL, tuning will be performed on ncomp using all variables in the X data set.

already.tested.X

Optional, if ncomp > 1 A numeric vector indicating the number of variables to select from the X data set on the firsts components.

scale

Logical. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE)

logratio

one of ('none','CLR'). Default to 'none'

max.iter

integer, the maximum number of iterations.

tol

Convergence stopping value.

near.zero.var

Logical, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Default value is FALSE

multilevel

Design matrix for multilevel analysis (for repeated measurements) that indicates the repeated measures on each individual, i.e. the individuals ID. See Details.

validation

character. What kind of (internal) validation to use, matching one of "Mfold" or "loo" (short for 'leave-one-out'). Default is "Mfold".

folds

the folds in the Mfold cross-validation. See Details.

nrepeat

Number of times the Cross-Validation process is repeated.

signif.threshold

numeric between 0 and 1 indicating the significance threshold required for improvement in error rate of the components. Default to 0.01.

dist

distance metric to use for splsda to estimate the classification error rate, should one of "centroids.dist", "mahalanobis.dist" or "max.dist" (see Details). If test.keepX = NULL multiple distances can be inputted or "all".

measure

Three misclassification measure are available: overall misclassification error overall, the Balanced Error Rate BER or the Area Under the Curve AUC. Only used when test.keepX is not NULL.

auc

if TRUE calculate the Area Under the Curve (AUC) performance of the model based on the optimisation measure measure.

progressBar

by default set to TRUE to output the progress bar of the computation.

light.output

if set to FALSE, the prediction/classification of each sample for each of test.keepX and each comp is returned.

BPPARAM

A BiocParallelParam object indicating the type of parallelisation. See examples in ?tune.spca.

seed

set a number here if you want the function to give reproducible outputs. Not recommended during exploratory analysis. Note if RNGseed is set in 'BPPARAM', this will be overwritten by 'seed'.

Details

This tuning function should be used to tune the parameters in the splsda function (number of components and number of variables in keepX to select).

For a sPLS-DA, M-fold or LOO cross-validation is performed with stratified subsampling where all classes are represented in each fold.

If validation = "loo", leave-one-out cross-validation is performed. By default folds is set to the number of unique individuals.

The function outputs the optimal number of components that achieve the best performance based on the overall error rate or BER. The assessment is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when adding a component to the model. Our experience has shown that in most case, the optimal number of components is the number of categories in Y - 1, but it is worth tuning a few extra components to check (see our website and case studies for more details).

For sPLS-DA multilevel one-factor analysis, M-fold or LOO cross-validation is performed where all repeated measurements of one sample are in the same fold. Note that logratio transform and the multilevel analysis are performed internally and independently on the training and test set.

For a sPLS-DA multilevel two-factor analysis, the correlation between components from the within-subject variation of X and the cond matrix is computed on the whole data set. The reason why we cannot obtain a cross-validation error rate as for the spls-DA one-factor analysis is because of the difficulty to decompose and predict the within matrices within each fold.

For a sPLS two-factor analysis a sPLS canonical mode is run, and the correlation between components from the within-subject variation of X and Y is computed on the whole data set.

If validation = "Mfold", M-fold cross-validation is performed. How many folds to generate is selected by specifying the number of folds in folds.

If auc = TRUE and there are more than 2 categories in Y, the Area Under the Curve is averaged using one-vs-all comparison. Note however that the AUC criteria may not be particularly insightful as the prediction threshold we use in sPLS-DA differs from an AUC threshold (sPLS-DA relies on prediction distances for predictions, see ?predict.splsda for more details) and the supplemental material of the mixOmics article (Rohart et al. 2017). If you want the AUC criterion to be insightful, you should use measure==AUC as this will output the number of variable that maximises the AUC; in this case there is no prediction threshold from sPLS-DA (dist is not used). If measure==AUC, we do not output SD as this measure can be a mean (over nrepeat) of means (over the categories).

BER is appropriate in case of an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. BER is less biased towards majority classes during the performance assessment.

More details about the prediction distances in ?predict and the supplemental material of the mixOmics article (Rohart et al. 2017).

If test.keepX is set to NULL, the perf() function will be run internally, which performs cross-validation to identify optimal number of components and distance measure. Running tuning initially using test.keepX = NULL speeds up the parameter tuning workflow, as then a lower ncomp value can be used for variable selection tuning.

Value

Depending on the type of analysis performed, a list that contains:

error.rate

returns the prediction error for each test.keepX on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.

choice.keepX

returns the number of variables selected (optimal keepX) on each component.

choice.ncomp

returns the optimal number of components for the model fitted with $choice.keepX

error.rate.class

returns the error rate for each level of Y and for each component computed with the optimal keepX

If test.keepX = FALSE,produces a matrix of classification error rate estimation. The dimensions correspond to the components in the model and to the prediction method used, respectively. Note that error rates reported in any component include the performance of the model in earlier components for the specified keepX parameters (e.g. error rate reported for component 3 for keepX = 20 already includes the fitted model on components 1 and 2 for keepX = 20).

predict

Prediction values for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

class

Predicted class for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

auc

AUC mean and standard deviation if the number of categories in Y is greater than 2, see details above. Only if auc = TRUE

cor.value

only if multilevel analysis with 2 factors: correlation between latent variables.

Author(s)

Kim-Anh Lê Cao, Benoit Gautier, Francois Bartolo, Florian Rohart, Al J Abadi

References

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

See Also

splsda, predict.splsda and http://www.mixOmics.org for more details.

Examples

## First example: analysis with sPLS-DA
data(breast.tumors)
X = breast.tumors$gene.exp
Y = as.factor(breast.tumors$sample$treatment)

# first tune on components only
tune = tune.splsda(X, Y, ncomp = 5, logratio = "none",
                   nrepeat = 10, folds = 10,
                   test.keepX = NULL, 
                   dist = "all",
                   progressBar = TRUE,
                   seed = 20) # set for reproducibility of example only
plot(tune) # optimal distance = centroids.dist
tune$choice.ncomp # optimal component number = 3

# then tune optimal keepX for each component
tune = tune.splsda(X, Y, ncomp = 3, logratio = "none",
                   nrepeat = 10, folds = 10, 
                   test.keepX = c(5, 10, 15), dist = "centroids.dist",
                   progressBar = TRUE,
                   seed = 20)

plot(tune)
tune$choice.keepX # optimal number of variables to keep c(15, 5, 15)

## With already tested variables:
tune = tune.splsda(X, Y, ncomp = 3, logratio = "none",
                   nrepeat = 10, folds = 10, 
                   test.keepX = c(5, 10, 15), already.tested.X = c(5, 10),
                   dist = "centroids.dist",
                   progressBar = TRUE,
                   seed = 20)
plot(tune)

## Second example: multilevel one-factor analysis with sPLS-DA

data(vac18)
X = vac18$genes
Y = vac18$stimulation
# sample indicates the repeated measurements
design = data.frame(sample = vac18$sample)

# tune on components
tune = tune.splsda(X, Y = Y, ncomp = 5, nrepeat = 10, logratio = "none",
                   test.keepX = NULL, folds = 10, dist = "max.dist", multilevel = design)

plot(tune)

# tune on variables
tune = tune.splsda(X, Y = Y, ncomp = 3, nrepeat = 10, logratio = "none",
                   test.keepX = c(5,50,100),folds = 10, dist = "max.dist", multilevel = design)

plot(tune)

mixOmicsTeam/mixOmics documentation built on Dec. 3, 2024, 11:15 p.m.