tune.spca: Tune number of selected variables for spca
In mixOmics: Omics Data Integration Project

Description Usage Arguments Details Value Examples

This function performs sparse pca and optimises the number of variables to keep on each component using repeated cross-validation.

tune.spca(
  X,
  ncomp = 2,
  nrepeat = 1,
  folds,
  test.keepX,
  center = TRUE,
  scale = TRUE,
  BPPARAM = SerialParam()
)

`X`	a numeric matrix (or data frame) which provides the data for the sparse principal components analysis. It should not contain missing values.
`ncomp`	Integer, if data is complete `ncomp` decides the number of components and associated eigenvalues to display from the `pcasvd` algorithm and if the data has missing values, `ncomp` gives the number of components to keep to perform the reconstitution of the data using the NIPALS algorithm. If `NULL`, function sets `ncomp = min(nrow(X), ncol(X))`
`nrepeat`	Number of times the Cross-Validation process is repeated.
`folds`	Number of folds in 'Mfold' cross-validation. See details.
`test.keepX`	numeric vector for the different number of variables to test from the X data set
`center`	(Default=TRUE) Logical, whether the variables should be shifted to be zero centered. Only set to FALSE if data have already been centered. Alternatively, a vector of length equal the number of columns of `X` can be supplied. The value is passed to `scale`. If the data contain missing values, columns should be centered for reliable results.
`scale`	(Default=TRUE) Logical indicating whether the variables should be scaled to have unit variance before the analysis takes place.
`BPPARAM`	A BiocParallelParam object indicating the type of parallelisation. See examples.

Essentially, for the first component, and for a grid of the number of variables to select (keepX), a number of repeats and folds, data are split to train and test and the extracted components are compared against those from a spca model with all the data to ascertain the optimal keepX. In order to keep at least 3 samples in each test set for reliable scaling of the test data for comparison, folds must be <= floor(nrow(X)/3)

The number of selected variables for the following components will then be sequentially optimised. If the number of observations are small (e.g. < 30), it is recommended to use Leave-One-Out Cross-Validation which can be achieved by setting folds = nrow(X).

A tune.spca object containing:

call: The function call
choice.keepX: The selected number of components on each component
cor.comp: The correlations between the components from the cross-validated studies and those from the study which used all of the data in training.

data("nutrimouse")
set.seed(42)
nrepeat <- 5
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5)
)
tune.spca.res
plot(tune.spca.res)
## Not run: 
## parallel processing using BiocParallel on repeats with more workers (cpus)
## You can use BiocParallel::MulticoreParam() on non_Windows machines 
## for faster computation
BPPARAM <- BiocParallel::SnowParam(workers = max(parallel::detectCores()-1, 2))
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5),
    BPPARAM = BPPARAM
)
plot(tune.spca.res)

## End(Not run)