rgcca_cv: Tune RGCCA parameters in 'supervised' mode with...

View source: R/rgcca_cv.r

rgcca_cvR Documentation

Tune RGCCA parameters in 'supervised' mode with cross-validation

Description

Tune the sparsity coefficient (if the model is sparse) or tau (otherwise) in a supervised approach by estimating by crossvalidation the predictive quality of the models. In this purpose, the samples are divided into k folds where the model will be tested on each fold and trained on the others. For small datasets (<30 samples), it is recommended to use as many folds as there are individuals (leave-one-out; loo).

Usage

rgcca_cv(
  blocks,
  method = "rgcca",
  response = NULL,
  par_type = "tau",
  par_value = NULL,
  par_length = 10,
  validation = "kfold",
  prediction_model = "lm",
  k = 5,
  n_run = 1,
  n_cores = 1,
  quiet = TRUE,
  superblock = FALSE,
  scale = TRUE,
  scale_block = TRUE,
  tol = 1e-08,
  scheme = "factorial",
  NA_method = "nipals",
  rgcca_res = NULL,
  tau = 1,
  ncomp = 1,
  sparsity = 1,
  init = "svd",
  bias = TRUE,
  verbose = TRUE,
  n_iter_max = 1000,
  comp_orth = TRUE,
  metric = NULL,
  ...
)

Arguments

blocks

A list that contains the J blocks of variables X1, X2, ..., XJ. Block Xj is a matrix of dimension n x p_j where n is the number of observations and p_j the number of variables.

method

A character string indicating the multi-block component method to consider. See available_methods for the list of the available methods.

response

Numerical value giving the position of the response block. When the response argument is filled the supervised mode is automatically activated.

par_type

A character giving the parameter to tune among "sparsity" or "tau".

par_value

A matrix (n*p, with p the number of blocks and n the number of combinations to be tested), a vector (of p length) or a numeric value giving sets of penalties (tau for RGCCA, sparsity for SGCCA) to be tested, one row by combination. By default, it takes 10 sets between min values (0 for RGCCA and $1/sqrt(ncol)$ for SGCCA) and 1.

par_length

An integer indicating the number of sets of parameters to be tested (if par_value = NULL). The parameters are uniformly distributed.

validation

A character for the type of validation among "loo", "kfold".

prediction_model

A character giving the function used to compare the trained and the tested models.

k

An integer giving the number of folds (if validation = 'kfold').

n_run

An integer giving the number of cross-validations to be run (if validation = 'kfold').

n_cores

Number of cores for parallelization.

quiet

Logical value indicating if warning messages are reported.

superblock

Boolean indicating the presence of a superblock (deflation strategy must be adapted when a superblock is used).

scale

Logical value indicating if blocks are standardized.

scale_block

Value indicating if each block is divided by a constant value. If TRUE or "inertia", each block is divided by the sum of eigenvalues of its empirical covariance matrix. If "lambda1", each block is divided by the square root of the highest eigenvalue of its empirical covariance matrix. Otherwise the blocks are not scaled. If standardization is applied (scale = TRUE), the block scaling is applied on the result of the standardization.

tol

The stopping value for the convergence of the algorithm.

scheme

Character string or a function giving the scheme function for covariance maximization among "horst" (the identity function), "factorial" (the squared values), "centroid" (the absolute values). The scheme function can be any continously differentiable convex function and it is possible to design explicitely the scheme function (e.g. function(x) x^4) as argument of rgcca function. See (Tenenhaus et al, 2017) for details.

NA_method

Character string corresponding to the method used for handling missing values ("nipals", "complete"). (default: "nipals").

  • "complete"corresponds to perform RGCCA on the fully observed observations (observations with missing values are removed)

  • "nipals"corresponds to perform RGCCA algorithm on available data (NIPALS-type algorithm)

rgcca_res

A fitted RGCCA object (see rgcca).

tau

Either a 1 x J vector or a max(ncomp) x J matrix containing the values of the regularization parameters (default: tau = 1, for each block and each dimension). The regularization parameters varies from 0 (maximizing the correlation) to 1 (maximizing the covariance). If tau = "optimal" the regularization parameters are estimated for each block and each dimension using the Schafer and Strimmer (2005) analytical formula. If tau is a 1 x J vector, tau[j] is identical across the dimensions of block Xj. If tau is a matrix, tau[k, j] is associated with Xjk (kth residual matrix for block j). The regularization parameters can also be estimated using rgcca_permutation or rgcca_cv.

ncomp

Vector of length J indicating the number of block components for each block.

sparsity

Either a 1*J vector or a max(ncomp) * J matrix encoding the L1 constraints applied to the outer weight vectors. The amount of sparsity varies between 1/sqrt(p_j) and 1 (larger values of sparsity correspond to less penalization). If sparsity is a vector, L1-penalties are the same for all the weights corresponding to the same block but different components:

for all h, |a_{j,h}|_{L_1} ≤ c_1[j] √{p_j},

with p_j the number of variables of X_j. If sparsity is a matrix, each row h defines the constraints applied to the weights corresponding to components h:

for all h, |a_{j,h}|_{L_1} ≤ c_1[h,j] √{p_j}.

It can be estimated by using rgcca_permutation.

init

Character string giving the type of initialization to use in the algorithm. It could be either by Singular Value Decompostion ("svd") or by random initialisation ("random") (default: "svd").

bias

A logical value for biaised (1/n) or unbiaised (1/(n-1)) estimator of the var/cov (default: bias = TRUE).

verbose

Logical value indicating if the progress of the algorithm is reported while computing.

n_iter_max

Integer giving the algorithm's maximum number of iterations.

comp_orth

Logical value indicating if the deflation should lead to orthogonal components or orthogonal weights.

metric

A character giving the the metric to report.

...

Additional parameters to be passed to the model fitted on top of RGCCA.

Details

At each round of cross-validation, for each variable, a predictive model of the first RGCCA component of each block (calculated on the training set) is constructed. Then the Root Mean Square of Errors (RMSE) or the Accuracy of the model is computed on the testing dataset. Finally, the metrics are averaged on the different folds. The best combination of parameters is the one where the average of RMSE on the testing datasets is the lowest or the accuracy is the highest.

Value

cv

A matrix giving the root-mean-square error (RMSE) between the predicted R/SGCCA and the observed R/SGCCA for each combination and each prediction (n_prediction = n_samples for validation = 'loo'; n_prediction = 'k' * 'n_run' for validation = 'kfold').

call

A list of the input parameters

bestpenalties

Penalties giving the best RMSE for each blocks (for regression) or the best proportion of wrong predictions (for classification)

penalties

A matrix giving, for each blocks, the penalty combinations (tau or sparsity)

stats

A data.frame containing the set of parameter values, and the mean, standard deviation, median, 1st and 3rd quartiles of the associated cross-validated scores.

Examples

data("Russett")
blocks <- list(
  agriculture = Russett[, seq(3)],
  industry = Russett[, 4:5],
  politic = Russett[, 6:8]
)
res <- rgcca_cv(blocks,
  response = 3, method = "rgcca",
  par_type = "sparsity",
  par_value = c(0.6, 0.75, 0.8),
  n_run = 2, n_cores = 1
)
plot(res)
## Not run: 
rgcca_cv(blocks,
  response = 3, par_type = "tau",
  par_value = c(0.6, 0.75, 0.8),
  n_run = 2, n_cores = 1
)$bestpenalties

rgcca_cv(blocks,
  response = 3, par_type = "sparsity",
  par_value = 0.8, n_run = 2, n_cores = 1
)

rgcca_cv(blocks,
  response = 3, par_type = "tau",
  par_value = 0.8, n_run = 2, n_cores = 1
)

## End(Not run)


Tenenhaus/RGCCA documentation built on March 16, 2023, 2:04 p.m.