rgcca_cv | R Documentation |
Tune the sparsity coefficient (if the model is sparse) or tau (otherwise) in a supervised approach by estimating by crossvalidation the predictive quality of the models. In this purpose, the samples are divided into k folds where the model will be tested on each fold and trained on the others. For small datasets (<30 samples), it is recommended to use as many folds as there are individuals (leave-one-out; loo).
rgcca_cv( blocks, method = "rgcca", response = NULL, par_type = "tau", par_value = NULL, par_length = 10, validation = "kfold", prediction_model = "lm", k = 5, n_run = 1, n_cores = 1, quiet = TRUE, superblock = FALSE, scale = TRUE, scale_block = TRUE, tol = 1e-08, scheme = "factorial", NA_method = "nipals", rgcca_res = NULL, tau = 1, ncomp = 1, sparsity = 1, init = "svd", bias = TRUE, verbose = TRUE, n_iter_max = 1000, comp_orth = TRUE, metric = NULL, ... )
blocks |
A list that contains the J blocks of variables X1, X2, ..., XJ. Block Xj is a matrix of dimension n x p_j where n is the number of observations and p_j the number of variables. |
method |
A character string indicating the multi-block component method to consider. See available_methods for the list of the available methods. |
response |
Numerical value giving the position of the response block. When the response argument is filled the supervised mode is automatically activated. |
par_type |
A character giving the parameter to tune among "sparsity" or "tau". |
par_value |
A matrix (n*p, with p the number of blocks and n the number of combinations to be tested), a vector (of p length) or a numeric value giving sets of penalties (tau for RGCCA, sparsity for SGCCA) to be tested, one row by combination. By default, it takes 10 sets between min values (0 for RGCCA and $1/sqrt(ncol)$ for SGCCA) and 1. |
par_length |
An integer indicating the number of sets of parameters to be tested (if par_value = NULL). The parameters are uniformly distributed. |
validation |
A character for the type of validation among "loo", "kfold". |
prediction_model |
A character giving the function used to compare the trained and the tested models. |
k |
An integer giving the number of folds (if validation = 'kfold'). |
n_run |
An integer giving the number of cross-validations to be run (if validation = 'kfold'). |
n_cores |
Number of cores for parallelization. |
quiet |
Logical value indicating if warning messages are reported. |
superblock |
Boolean indicating the presence of a superblock (deflation strategy must be adapted when a superblock is used). |
scale |
Logical value indicating if blocks are standardized. |
scale_block |
Value indicating if each block is divided by a constant value. If TRUE or "inertia", each block is divided by the sum of eigenvalues of its empirical covariance matrix. If "lambda1", each block is divided by the square root of the highest eigenvalue of its empirical covariance matrix. Otherwise the blocks are not scaled. If standardization is applied (scale = TRUE), the block scaling is applied on the result of the standardization. |
tol |
The stopping value for the convergence of the algorithm. |
scheme |
Character string or a function giving the scheme function for covariance maximization among "horst" (the identity function), "factorial" (the squared values), "centroid" (the absolute values). The scheme function can be any continously differentiable convex function and it is possible to design explicitely the scheme function (e.g. function(x) x^4) as argument of rgcca function. See (Tenenhaus et al, 2017) for details. |
NA_method |
Character string corresponding to the method used for handling missing values ("nipals", "complete"). (default: "nipals").
|
rgcca_res |
A fitted RGCCA object (see |
tau |
Either a 1 x J vector or a max(ncomp) x J matrix containing the values of the regularization parameters (default: tau = 1, for each block and each dimension). The regularization parameters varies from 0 (maximizing the correlation) to 1 (maximizing the covariance). If tau = "optimal" the regularization parameters are estimated for each block and each dimension using the Schafer and Strimmer (2005) analytical formula. If tau is a 1 x J vector, tau[j] is identical across the dimensions of block Xj. If tau is a matrix, tau[k, j] is associated with Xjk (kth residual matrix for block j). The regularization parameters can also be estimated using rgcca_permutation or rgcca_cv. |
ncomp |
Vector of length J indicating the number of block components for each block. |
sparsity |
Either a 1*J vector or a max(ncomp) * J matrix encoding the L1 constraints applied to the outer weight vectors. The amount of sparsity varies between 1/sqrt(p_j) and 1 (larger values of sparsity correspond to less penalization). If sparsity is a vector, L1-penalties are the same for all the weights corresponding to the same block but different components: for all h, |a_{j,h}|_{L_1} ≤ c_1[j] √{p_j}, with p_j the number of variables of X_j. If sparsity is a matrix, each row h defines the constraints applied to the weights corresponding to components h: for all h, |a_{j,h}|_{L_1} ≤ c_1[h,j] √{p_j}. It can be estimated by using rgcca_permutation. |
init |
Character string giving the type of initialization to use in the algorithm. It could be either by Singular Value Decompostion ("svd") or by random initialisation ("random") (default: "svd"). |
bias |
A logical value for biaised (1/n) or unbiaised (1/(n-1)) estimator of the var/cov (default: bias = TRUE). |
verbose |
Logical value indicating if the progress of the algorithm is reported while computing. |
n_iter_max |
Integer giving the algorithm's maximum number of iterations. |
comp_orth |
Logical value indicating if the deflation should lead to orthogonal components or orthogonal weights. |
metric |
A character giving the the metric to report. |
... |
Additional parameters to be passed to the model fitted on top of RGCCA. |
At each round of cross-validation, for each variable, a predictive model of the first RGCCA component of each block (calculated on the training set) is constructed. Then the Root Mean Square of Errors (RMSE) or the Accuracy of the model is computed on the testing dataset. Finally, the metrics are averaged on the different folds. The best combination of parameters is the one where the average of RMSE on the testing datasets is the lowest or the accuracy is the highest.
cv |
A matrix giving the root-mean-square error (RMSE) between the predicted R/SGCCA and the observed R/SGCCA for each combination and each prediction (n_prediction = n_samples for validation = 'loo'; n_prediction = 'k' * 'n_run' for validation = 'kfold'). |
call |
A list of the input parameters |
bestpenalties |
Penalties giving the best RMSE for each blocks (for regression) or the best proportion of wrong predictions (for classification) |
penalties |
A matrix giving, for each blocks, the penalty combinations (tau or sparsity) |
stats |
A data.frame containing the set of parameter values, and the mean, standard deviation, median, 1st and 3rd quartiles of the associated cross-validated scores. |
data("Russett") blocks <- list( agriculture = Russett[, seq(3)], industry = Russett[, 4:5], politic = Russett[, 6:8] ) res <- rgcca_cv(blocks, response = 3, method = "rgcca", par_type = "sparsity", par_value = c(0.6, 0.75, 0.8), n_run = 2, n_cores = 1 ) plot(res) ## Not run: rgcca_cv(blocks, response = 3, par_type = "tau", par_value = c(0.6, 0.75, 0.8), n_run = 2, n_cores = 1 )$bestpenalties rgcca_cv(blocks, response = 3, par_type = "sparsity", par_value = 0.8, n_run = 2, n_cores = 1 ) rgcca_cv(blocks, response = 3, par_type = "tau", par_value = 0.8, n_run = 2, n_cores = 1 ) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.