tune: Wrapper function to tune pls-derived methods.

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

This function uses repeated cross-validation to tune hyperparameters such as the number of features to select and possibly the number of components to extract.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
tune(
  method = c("spls", "splsda", "mint.splsda", "rcc", "pca", "spca"),
  X,
  Y,
  multilevel = NULL,
  ncomp,
  study,
  test.keepX = c(5, 10, 15),
  test.keepY = NULL,
  already.tested.X,
  already.tested.Y,
  mode = c("regression", "canonical", "invariant", "classic"),
  nrepeat = 1,
  grid1 = seq(0.001, 1, length = 5),
  grid2 = seq(0.001, 1, length = 5),
  validation = "Mfold",
  folds = 10,
  dist = "max.dist",
  measure = ifelse(method == "spls", "cor", "BER"),
  auc = FALSE,
  progressBar = FALSE,
  near.zero.var = FALSE,
  logratio = c("none", "CLR"),
  center = TRUE,
  scale = TRUE,
  max.iter = 100,
  tol = 1e-09,
  light.output = TRUE,
  cpus = 1
)

Arguments

method

This parameter is used to pass all other argument to the suitable function. method has to be one of the following: "spls", "splsda", "mint.splsda", "rcc", "pca", "spca" or "pls".

X

numeric matrix of predictors. NAs are allowed.

Y

Either a factor or a class vector for the discrete outcome, or a numeric vector or matrix of continuous responses (for multi-response models).

multilevel

Design matrix for multilevel analysis (for repeated measurements) that indicates the repeated measures on each individual, i.e. the individuals ID. See Details.

ncomp

the number of components to include in the model.

study

grouping factor indicating which samples are from the same study

test.keepX

numeric vector for the different number of variables to test from the X data set

test.keepY

If method = 'spls', numeric vector for the different number of variables to test from the Y data set

already.tested.X

Optional, if ncomp > 1 A numeric vector indicating the number of variables to select from the X data set on the firsts components.

already.tested.Y

if method = 'spls' and if(ncomp > 1) numeric vector indicating the number of variables to select from the Y data set on the first components

mode

character string. What type of algorithm to use, (partially) matching one of "regression", "canonical", "invariant" or "classic". See Details.

nrepeat

Number of times the Cross-Validation process is repeated.

grid1, grid2

vector numeric defining the values of lambda1 and lambda2 at which cross-validation score should be computed. Defaults to grid1=grid2=seq(0.001, 1, length=5).

validation

character. What kind of (internal) validation to use, matching one of "Mfold" or "loo" (see below). Default is "Mfold".

folds

the folds in the Mfold cross-validation. See Details.

dist

distance metric to estimate the classification error rate, should be a subset of "centroids.dist", "mahalanobis.dist" or "max.dist" (see Details).

measure

The tuning measure used for different methods. See details.

auc

if TRUE calculate the Area Under the Curve (AUC) performance of the model.

progressBar

by default set to TRUE to output the progress bar of the computation.

near.zero.var

Logical, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Default value is FALSE

logratio

one of ('none','CLR'). Default to 'none'

center

a logical value indicating whether the variables should be shifted to be zero centered. Alternately, a vector of length equal the number of columns of X can be supplied. The value is passed to scale.

scale

a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE for consistency with prcomp function, but in general scaling is advisable. Alternatively, a vector of length equal the number of columns of X can be supplied. The value is passed to scale.

max.iter

Integer, the maximum number of iterations.

tol

Numeric, convergence tolerance criteria.

light.output

if set to FALSE, the prediction/classification of each sample for each of test.keepX and each comp is returned.

cpus

Integer, number of cores to use for parallel processing. Currently only available for method = "spls"

Details

See the help file corresponding to the corresponding method, e.g. tune.splsda for further details. Note that only the arguments used in the tune function corresponding to method are passed on.

More details about the prediction distances in ?predict and the supplemental material of the mixOmics article (Rohart et al. 2017). More details about the PLS modes are in ?pls.

Value

Depending on the type of analysis performed and the input arguments, a list that may contain:

error.rate

returns the prediction error for each test.keepX on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.

choice.keepX

returns the number of variables selected (optimal keepX) on each component.

choice.ncomp

For supervised models; returns the optimal number of components for the model for each prediction distance using one-sided t-tests that test for a significant difference in the mean error rate (gain in prediction) when components are added to the model. See more details in Rohart et al 2017 Suppl. For more than one block, an optimal ncomp is returned for each prediction framework.

error.rate.class

returns the error rate for each level of Y and for each component computed with the optimal keepX

predict

Prediction values for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

class

Predicted class for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

auc

AUC mean and standard deviation if the number of categories in Y is greater than 2, see details above. Only if auc = TRUE

cor.value

only if multilevel analysis with 2 factors: correlation between latent variables.

Author(s)

Florian Rohart, Francois Bartolo, Kim-Anh Lê Cao, Al J Abadi

References

Singh A., Shannon C., Gautier B., Rohart F., Vacher M., Tebbutt S. and Lê Cao K.A. (2019), DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 3055–3062.

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

MINT:

Rohart F, Eslami A, Matigian, N, Bougeard S, Lê Cao K-A (2017). MINT: A multivariate integrative approach to identify a reproducible biomarker signature across multiple experiments and platforms. BMC Bioinformatics 18:128.

PLS and PLS citeria for PLS regression: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.

Chavent, Marie and Patouille, Brigitte (2003). Calcul des coefficients de regression et du PRESS en regression PLS1. Modulad n, 30 1-11. (this is the formula we use to calculate the Q2 in perf.pls and perf.spls)

Mevik, B.-H., Cederkvist, H. R. (2004). Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Journal of Chemometrics 18(9), 422-429.

sparse PLS regression mode:

Lê Cao, K. A., Rossouw D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.

One-sided t-tests (suppl material):

Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells. PeerJ 4:e1845.

See Also

tune.rcc, tune.mint.splsda, tune.pca, tune.splsda, tune.splslevel and http://www.mixOmics.org for more details.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## sPLS-DA
data(breast.tumors)
X <- breast.tumors$gene.exp
Y <- as.factor(breast.tumors$sample$treatment)
tune= tune(method = "splsda", X, Y, ncomp=1, nrepeat=10, logratio="none",
test.keepX = c(5, 10, 15), folds=10, dist="max.dist", progressBar = TRUE)

plot(tune)

## Not run: 
## mint.splsda

data(stemcells)
data = stemcells$gene
type.id = stemcells$celltype
exp = stemcells$study

out = tune(method="mint.splsda", X=data,Y=type.id, ncomp=2, study=exp, test.keepX=seq(1,10,1))
out$choice.keepX

plot(out)

## End(Not run)

mixOmics documentation built on April 15, 2021, 6:01 p.m.