Sparse Contrastive Principal Component Analysis
Description
Given target and background data frames or matrices,
scPCA
will perform the sparse contrastive principal component
analysis (scPCA) of the target data for a given number of eigenvectors, a
vector of real-valued contrast parameters, and a vector of sparsity inducing
penalty terms.
If instead you wish to perform contrastive principal component analysis
(cPCA), set the penalties
argument to 0
. So long as the
n_centers
parameter is larger than one, the automated hyperparameter
tuning heuristic described in \insertCiteboileau2020;textualscPCA is
used. Otherwise, the semi-automated approach of
\insertCiteabid2018exploring;textualscPCA is used to select the
appropriate hyperparameter.
Usage
scPCA(
target,
background,
center = TRUE,
scale = FALSE,
n_eigen = 2,
cv = NULL,
alg = c("iterative", "var_proj", "rand_var_proj"),
contrasts = exp(seq(log(0.1), log(1000), length.out = 40)),
penalties = seq(0.05, 1, length.out = 20),
clust_method = c("kmeans", "pam", "hclust"),
n_centers = NULL,
max_iter = 10,
linkage_method = "complete",
n_medoids = 8,
parallel = FALSE,
clusters = NULL,
eigdecomp_tol = 1e-10,
eigdecomp_iter = 1000,
scaled_matrix = FALSE
)
Arguments
target |
The target (experimental) data set, in a standard format such
as a data.frame or matrix . dgCMatrix and
DelayedMatrix objects are also supported.
|
background |
The background data set, in a standard format such as a
data.frame or matrix . The features must match the features of
the target data set. dgCMatrix and DelayedMatrix objects are
also supported.
|
center |
A logical indicating whether the target and background
data sets' features should be centered to mean zero.
|
scale |
A logical indicating whether the target and background
data sets' features should be scaled to unit variance.
|
n_eigen |
A numeric indicating the number of eigenvectors (or
(sparse) contrastive components) to be computed. Two eigenvectors are
computed by default.
|
cv |
A numeric indicating the number of cross-validation folds
to use in choosing the optimal contrastive and penalization parameters from
over the grids of contrasts and penalties . Cross-validation
is expected to improve the robustness and generalization of the choice of
these parameters. However, it increases the time the procedure costs.
The default is therefore NULL , corresponding to no cross-validation.
|
alg |
A character indicating the sparse PCA algorithm used to
sparsify the contrastive loadings. Currently supports iterative for
the \insertCitezou2006sparse;textualscPCA implementation, var_proj
for the non-randomized \insertCiteerichson2018sparse;textualscPCA
solution, and rand_var_proj for the randomized
\insertCiteerichson2018sparse;textualscPCA implementation. Defaults to
iterative .
|
contrasts |
A numeric vector of the contrastive parameters. Each
element must be a unique, non-negative real number. By default, 40
logarithmically spaced values between 0.1 and 1000 are used. If a single
value is provided and penalties is set to 0, then n_centers ,
clust_method , max_iter , linkage_method ,
n_medoids , and parallel can be safely ignored.
|
penalties |
A numeric vector of the L1 penalty terms on the
loadings. The default is to use 20 equidistant values between 0.05 and 1.
If penalties is set to 0, then cPCA is performed in place of scPCA.
See contrasts and n_centers arguments for more infotmation.
|
clust_method |
A character specifying the clustering method to
use for choosing the optimal contrastive parameter. Currently, this is
limited to either k-means, partitioning around medoids (PAM), and
hierarchical clustering. The default is k-means clustering.
|
n_centers |
A numeric giving the number of centers to use in the
clustering algorithm. If set to 1, cPCA, as first proposed by
\insertCiteabid2018exploring;textualscPCA, is performed, regardless of
what the penalties argument is set to.
|
max_iter |
A numeric giving the maximum number of iterations to
be used in k-means clustering. Defaults to 10.
|
linkage_method |
A character specifying the agglomerative
linkage method to be used if clust_method = "hclust" . The options
are ward.D2 , single , complete , average ,
mcquitty , median , and centroid . The default is
complete .
|
n_medoids |
A numeric indicating the number of medoids to
consider if n_centers is set to 1 and contrasts is a vector of
length 2 or more. The default is 8 medoids.
|
parallel |
A logical indicating whether to invoke parallel
processing via the BiocParallel infrastructure. The default is
FALSE for sequential evaluation.
|
clusters |
A numeric vector of cluster labels for observations in
the target data. Defaults to NULL , but is otherwise used to
identify the optimal set of hyperparameters when fitting the scPCA and the
automated version of cPCA. If a vector is provided, the
n_centers , clust_method , max_iter ,
linkage_method , and n_medoids arguments can be safely ignored.
|
eigdecomp_tol |
A numeric providing the level of precision used by
eigendecompositon calculations. Defaults to 1e-10 .
|
eigdecomp_iter |
A numeric indicating the maximum number of
interations performed by eigendecompositon calculations. Defaults to
1000 .
|
scaled_matrix |
A logical indicating whether to output a
ScaledMatrix object. The centering and scaling
procedure is delayed until later, permitting more efficient matrix
multiplication and row or column sums downstream. However, this comes at the
at the cost of numerical precision. Defaults to FALSE .
|
Value
A list containing the following components:
-
rotation
: The matrix of variable loadings if n_centers
is larger than one. Otherwise, a list of rotation matrices is returned,
one for each medoid. The number of medoids is specified by
n_medoids
.
-
x
: The rotated data, centred and scaled if requested,
multiplied by the rotation matrix if n_centers
is larger than
one. Otherwise, a list of rotated data matrices is returned, one for
each medoid. The number of medoids is specified by n_medoids
.
contrast: The optimal contrastive parameter.
penalty: The optimal L1 penalty term.
center: A logical indicating whether the target dataset was centered.
scale: A logical indicating whether the target dataset was scaled.
References
\insertAllCited
Examples
# perform cPCA on the simulated data set
scPCA(
target = toy_df[, 1:30],
background = background_df,
contrasts = exp(seq(log(0.1), log(100), length.out = 5)),
penalties = 0,
n_centers = 4
)
# perform scPCA on the simulated data set
scPCA(
target = toy_df[, 1:30],
background = background_df,
contrasts = exp(seq(log(0.1), log(100), length.out = 5)),
penalties = seq(0.1, 1, length.out = 3),
n_centers = 4
)
# perform cPCA on the simulated data set with known clusters
scPCA(
target = toy_df[, 1:30],
background = background_df,
contrasts = exp(seq(log(0.1), log(100), length.out = 5)),
penalties = 0,
clusters = toy_df[, 31]
)
# cPCA as implemented in Abid et al.
scPCA(
target = toy_df[, 1:30],
background = background_df,
contrasts = exp(seq(log(0.1), log(100), length.out = 10)),
penalties = 0,
n_centers = 1
)