tune.spca: Tune number of selected variables for spca

Description Usage Arguments Details Value Examples

View source: R/tune.spca.R

Description

This function performs sparse pca and optimises the number of variables to keep on each component using repeated cross-validation.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
tune.spca(
  X,
  ncomp = 2,
  nrepeat = 1,
  folds,
  test.keepX,
  center = TRUE,
  scale = TRUE,
  BPPARAM = SerialParam()
)

Arguments

X

a numeric matrix (or data frame) which provides the data for the sparse principal components analysis. It should not contain missing values.

ncomp

Integer, if data is complete ncomp decides the number of components and associated eigenvalues to display from the pcasvd algorithm and if the data has missing values, ncomp gives the number of components to keep to perform the reconstitution of the data using the NIPALS algorithm. If NULL, function sets ncomp = min(nrow(X), ncol(X))

nrepeat

Number of times the Cross-Validation process is repeated.

folds

Number of folds in 'Mfold' cross-validation. See details.

test.keepX

numeric vector for the different number of variables to test from the X data set

center

(Default=TRUE) Logical, whether the variables should be shifted to be zero centered. Only set to FALSE if data have already been centered. Alternatively, a vector of length equal the number of columns of X can be supplied. The value is passed to scale. If the data contain missing values, columns should be centered for reliable results.

scale

(Default=TRUE) Logical indicating whether the variables should be scaled to have unit variance before the analysis takes place.

BPPARAM

A BiocParallelParam object indicating the type of parallelisation. See examples.

Details

Essentially, for the first component, and for a grid of the number of variables to select (keepX), a number of repeats and folds, data are split to train and test and the extracted components are compared against those from a spca model with all the data to ascertain the optimal keepX. In order to keep at least 3 samples in each test set for reliable scaling of the test data for comparison, folds must be <= floor(nrow(X)/3)

The number of selected variables for the following components will then be sequentially optimised. If the number of observations are small (e.g. < 30), it is recommended to use Leave-One-Out Cross-Validation which can be achieved by setting folds = nrow(X).

Value

A tune.spca object containing:

call

The function call

choice.keepX

The selected number of components on each component

cor.comp

The correlations between the components from the cross-validated studies and those from the study which used all of the data in training.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
data("nutrimouse")
set.seed(42)
nrepeat <- 5
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5)
)
tune.spca.res
plot(tune.spca.res)
## Not run: 
## parallel processing using BiocParallel on repeats with more workers (cpus)
## You can use BiocParallel::MulticoreParam() on non_Windows machines 
## for faster computation
BPPARAM <- BiocParallel::SnowParam(workers = max(parallel::detectCores()-1, 2))
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5),
    BPPARAM = BPPARAM
)
plot(tune.spca.res)

## End(Not run)

mixOmics documentation built on April 15, 2021, 6:01 p.m.