tune.mint.splsda: Estimate the parameters of mint.splsda method

tune.mint.splsdaR Documentation

Estimate the parameters of mint.splsda method

Description

Computes Leave-One-Group-Out-Cross-Validation (LOGOCV) scores on a user-input grid to determine optimal values for the parameters in mint.splsda.

Usage

tune.mint.splsda(
  X,
  Y,
  ncomp = 1,
  study,
  test.keepX = NULL,
  already.tested.X,
  scale = TRUE,
  tol = 1e-06,
  max.iter = 100,
  near.zero.var = FALSE,
  signif.threshold = 0.01,
  dist = c("max.dist", "centroids.dist", "mahalanobis.dist"),
  measure = c("BER", "overall"),
  auc = FALSE,
  progressBar = FALSE,
  light.output = TRUE
)

Arguments

X

numeric matrix of predictors. NAs are allowed.

Y

Outcome. Numeric vector or matrix of responses (for multi-response models)

ncomp

Number of components to include in the model (see Details). Default to 1

study

grouping factor indicating which samples are from the same study

test.keepX

numeric vector for the different number of variables to test from the X data set. If set to NULL only number of components will be tuned.

already.tested.X

if ncomp > 1 Numeric vector indicating the number of variables to select from the X data set on the firsts components

scale

Logical. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE)

tol

Convergence stopping value.

max.iter

integer, the maximum number of iterations.

near.zero.var

Logical, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Default value is FALSE

signif.threshold

numeric between 0 and 1 indicating the significance threshold required for improvement in error rate of the components. Default to 0.01.

dist

only applies to an object inheriting from "plsda" or "splsda" to evaluate the classification performance of the model. Should be a subset of "max.dist", "centroids.dist", "mahalanobis.dist". Default is "all". See predict.

measure

Two misclassification measure are available: overall misclassification error overall or the Balanced Error Rate BER. Only used when test.keepX = NULL.

auc

if TRUE calculate the Area Under the Curve (AUC) performance of the model.

progressBar

by default set to TRUE to output the progress bar of the computation.

light.output

if set to FALSE, the prediction/classification of each sample for each of test.keepX and each comp is returned.

Details

This function performs a Leave-One-Group-Out-Cross-Validation (LOGOCV), where each of study is left out once.

When test.keepX is not NULL, all component 1:\code{ncomp} are tuned to identify number of variables to keep, except the first ones for which a already.tested.X is provided. See examples below.

The function outputs the optimal number of components that achieve the best performance based on the overall error rate or BER. The assessment is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when adding a component to the model. Our experience has shown that in most case, the optimal number of components is the number of categories in Y - 1, but it is worth tuning a few extra components to check (see our website and case studies for more details).

BER is appropriate in case of an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. BER is less biased towards majority classes during the performance assessment.

More details about the prediction distances in ?predict and the supplemental material of the mixOmics article (Rohart et al. 2017).

Value

The returned value is a list with components:

error.rate

returns the prediction error for each test.keepX on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.

choice.keepX

returns the number of variables selected (optimal keepX) on each component.

choice.ncomp

returns the optimal number of components for the model fitted with $choice.keepX

error.rate.class

returns the error rate for each level of Y and for each component computed with the optimal keepX

predict

Prediction values for each sample, each test.keepX and each comp.

class

Predicted class for each sample, each test.keepX and each comp.

If test.keepX = NULL, returns:

study.specific.error

A list that gives BER, overall error rate and error rate per class, for each study

global.error

A list that gives BER, overall error rate and error rate per class for all samples

predict

A list of length ncomp that produces the predicted values of each sample for each class

class

A list which gives the predicted class of each sample for each dist and each of the ncomp components. Directly obtained from the predict output.

auc

AUC values

auc.study

AUC values for each study in mint models

.

Author(s)

Florian Rohart, Al J Abadi

References

Rohart F, Eslami A, Matigian, N, Bougeard S, Lê Cao K-A (2017). MINT: A multivariate integrative approach to identify a reproducible biomarker signature across multiple experiments and platforms. BMC Bioinformatics 18:128.

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

See Also

mint.splsda and http://www.mixOmics.org for more details.

Examples

# set up data
data(stemcells)
data <- stemcells$gene
type.id <- stemcells$celltype
exp <- stemcells$study

# tune number of components
tune_res <- tune.mint.splsda(X = data,Y = type.id, ncomp=5, 
                             near.zero.var=FALSE,
                             study=exp,
                             test.keepX = NULL)

plot(tune_res)
tune_res$choice.ncomp # 1 component

## tune number of variables to keep
tune_res <- tune.mint.splsda(X = data,Y = type.id, ncomp = 1, 
                             near.zero.var = FALSE,
                             study=exp,
                             test.keepX=seq(1,10,1))

plot(tune_res)
tune_res$choice.keepX # 9 variables to keep on component 1

## only tune component 3 and keeping 10 genes on comp1
tune_res <- tune.mint.splsda(X = data, Y = type.id, ncomp = 2, study = exp,
                             already.tested.X = c(9),
                             test.keepX = seq(1,10,1))
plot(tune_res)
tune_res$choice.keepX # 10 variables to keep on comp2

mixOmicsTeam/mixOmics documentation built on Dec. 3, 2024, 11:15 p.m.