tune.spca: Tune number of selected variables for spca

View source: R/tune.spca.R

tune.spcaR Documentation

Tune number of selected variables for spca

Description

This function performs sparse pca and optimises the number of variables to keep on each component using repeated cross-validation.

Usage

tune.spca(
  X,
  ncomp = 2,
  nrepeat = 1,
  folds,
  test.keepX,
  center = TRUE,
  scale = TRUE,
  BPPARAM = SerialParam(),
  seed = NULL
)

Arguments

X

a numeric matrix (or data frame) which provides the data for the sparse principal components analysis. It should not contain missing values.

ncomp

Integer, if data is complete ncomp decides the number of components and associated eigenvalues to display from the pcasvd algorithm and if the data has missing values, ncomp gives the number of components to keep to perform the reconstitution of the data using the NIPALS algorithm. If NULL, function sets ncomp = min(nrow(X), ncol(X))

nrepeat

Number of times the Cross-Validation process is repeated.

folds

Number of folds in 'Mfold' cross-validation. See details.

test.keepX

numeric vector for the different number of variables to test from the X data set.

center

(Default=TRUE) Logical, whether the variables should be shifted to be zero centered. Only set to FALSE if data have already been centered. Alternatively, a vector of length equal the number of columns of X can be supplied. The value is passed to scale. If the data contain missing values, columns should be centered for reliable results.

scale

(Default=TRUE) Logical indicating whether the variables should be scaled to have unit variance before the analysis takes place.

BPPARAM

A BiocParallelParam object indicating the type of parallelisation. See examples.

seed

set a number here if you want the function to give reproducible outputs. Not recommended during exploratory analysis. Note if RNGseed is set in 'BPPARAM', this will be overwritten by 'seed'.

Details

Essentially, for the first component, and for a grid of the number of variables to select (keepX), a number of repeats and folds, data are split to train and test and the extracted components are compared against those from a spca model with all the data to ascertain the optimal keepX. In order to keep at least 3 samples in each test set for reliable scaling of the test data for comparison, folds must be <= floor(nrow(X)/3)

The number of selected variables for the following components will then be sequentially optimised. If the number of observations are small (e.g. < 30), it is recommended to use Leave-One-Out Cross-Validation which can be achieved by setting folds = nrow(X).

Value

A tune.spca object containing:

call

The function call

choice.keepX

The selected number of components on each component

cor.comp

The correlations between the components from the cross-validated studies and those from the study which used all of the data in training.

Examples

data("nutrimouse")
nrepeat <- 5
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5),
    seed = 42
)
tune.spca.res
plot(tune.spca.res)
## Not run: 
## parallel processing using BiocParallel on repeats with more workers (cpus)
# Check if the environment variable exists (during R CMD check) and limit cores accordingly
max_cores <- if (Sys.getenv("_R_CHECK_LIMIT_CORES_") != "") 2 else parallel::detectCores() - 1
# Setup the parallel backend with the appropriate number of workers
BPPARAM <- BiocParallel::MulticoreParam(workers = max_cores)
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5),
    BPPARAM = BPPARAM
)
plot(tune.spca.res)

## End(Not run)

mixOmicsTeam/mixOmics documentation built on Dec. 3, 2024, 11:15 p.m.