cPCAplus: Contrastive PCA++ (cPCA++) Performs Contrastive PCA++...
In bbuchsbaum/multivarious: Extensible Data Structures for Multivariate Analysis

cPCAplus

R Documentation

Contrastive PCA++ (cPCA++) Performs Contrastive PCA++ (cPCA++) to find directions that capture variation enriched in a "foreground" dataset relative to a "background" dataset. This implementation follows the cPCA++ approach which directly solves the generalized eigenvalue problem Rf v = lambda Rb v, where Rf and Rb are the covariance matrices of the foreground and background data, centered using the background mean.

Description

Contrastive PCA++ (cPCA++) Performs Contrastive PCA++ (cPCA++) to find directions that capture variation enriched in a "foreground" dataset relative to a "background" dataset. This implementation follows the cPCA++ approach which directly solves the generalized eigenvalue problem Rf v = lambda Rb v, where Rf and Rb are the covariance matrices of the foreground and background data, centered using the background mean.

Usage

cPCAplus(
  X_f,
  X_b,
  ncomp = NULL,
  center_background = TRUE,
  lambda = 0,
  method = c("geigen", "primme", "sdiag", "corpcor"),
  strategy = c("auto", "feature", "sample"),
  ...
)

Arguments

`X_f`	A numeric matrix representing the foreground dataset (samples x features).
`X_b`	A numeric matrix representing the background dataset (samples x features). `X_f` and `X_b` must have the same number of features (columns).
`ncomp`	Integer. The number of contrastive components to compute. Defaults to `min(ncol(X_f))`, but will be capped by the effective rank based on sample sizes.
`center_background`	Logical. If TRUE (default), both `X_f` and `X_b` are centered using the column means of `X_b`. If FALSE, it assumes data is already appropriately centered.
`lambda`	Shrinkage intensity for covariance estimation (0 <= lambda <= 1). Defaults to 0 (no shrinkage). Uses `corpcor::cov.shrink`. Can help stabilize results if `Rb` is ill-conditioned or singular.
`method`	A character string specifying the primary computation method. Options include: `"geigen"` (Default): Use `geneig` from the `geigen` package. `"primme"`: Use `geneig` with the PRIMME library backend (requires special `geigen` build). `"sdiag"`: Use `geneig` with a spectral decomposition method. `"corpcor"`: Use a corpcor-based whitening approach followed by standard PCA.
`strategy`	Controls the GEVD approach when `method` is not `"corpcor"`. Options include: `"auto"` (Default): Chooses based on dimensions (feature vs. sample space). `"feature"`: Forces direct computation via `⁠p x p⁠` covariance matrices. `"sample"`: Forces sample-space computation via SVD and a smaller GEVD (efficient for large `p`).
`...`	Additional arguments passed to the underlying computation functions (`geigen::geneig` or `irlba::irlba` based on `method` and `strategy`).

Details

Preprocessing: Following the cPCA++ paper, if center_background = TRUE, both X_f and X_b are centered by subtracting the column means calculated only from the background data X_b. This is crucial for isolating variance specific to X_f.

Core Algorithm (methods "geigen", "primme", "sdiag", strategy="feature"):

Center X_f and X_b using the mean of X_b.
Compute potentially shrunk p \times p covariance matrices Rf (from centered X_f) and Rb (from centered X_b) using corpcor::cov.shrink.
Solve the generalized eigenvalue problem ⁠Rf v = lambda Rb v⁠ for the top ncomp eigenvectors v using geigen::geneig. These eigenvectors are the contrastive principal components (loadings).
Compute scores by projecting the centered foreground data onto the eigenvectors: S = X_f_centered %*% v.

Core Algorithm (Large-D / Sample Space Strategy, strategy="sample"): When p \gg n, forming p \times p matrices Rf and Rb is infeasible. The "sample" strategy follows cPCA++ §3.2:

Center X_f and X_b using the mean of X_b.
Compute the SVD of centered X_b = Ub Sb Vb^T (using irlba for efficiency).
Project centered X_f into the background's principal subspace: Zf = X_f_centered %*% Vb.
Form small r \times r matrices: Rf_small = cov(Zf) and Rb_small = (1/(n_b-1)) * Sb^2.
Solve the small r \times r GEVD: ⁠Rf_small w = lambda Rb_small w⁠ using geigen::geneig.
Lift eigenvectors back to feature space: v = Vb %*% w.
Compute scores: S = X_f_centered %*% v.

Alternative Algorithm (method "corpcor"):

Center X_f and X_b using the mean of X_b.
Compute Rb and its inverse square root Rb_inv_sqrt.
Whiten the foreground data: X_f_whitened = X_f_centered %*% Rb_inv_sqrt.
Perform standard PCA (stats::prcomp) on X_f_whitened.
The returned v and s are the loadings and scores in the whitened space. The loadings are not the generalized eigenvectors v. A specific class corpcor_pca is added to signal this.

Value

A bi_projector-like object with classes c("cPCAplus", "<method_class>", "bi_projector") containing:

v: Loadings matrix (features x ncomp). Interpretation depends on method (see Details).
s: Scores matrix (samples_f x ncomp).
sdev: Vector (length ncomp). Standard deviations (sqrt of generalized eigenvalues for geigen methods, PCA std devs for corpcor).
values: Vector (length ncomp). Generalized eigenvalues (for geigen methods) or PCA eigenvalues (for corpcor).
strategy: The strategy used ("feature" or "sample") if method was not "corpcor".
preproc: The initialized preprocessor object used.
method: The computation method used.
ncomp: The number of components computed.
nfeatures: The number of features.

References

Salloum, R., Kuo, C. C. J. (2022). cPCA++: An efficient method for contrastive feature learning. Pattern Recognition, 124, 108378. (Algorithm 1)

Examples

# Simulate data where foreground has extra variance in first few dimensions
set.seed(123)
n_f <- 100
n_b <- 150
n_features <- 50

# Background: standard normal noise
X_b <- matrix(rnorm(n_b * n_features), nrow=n_b, ncol=n_features)
colnames(X_b) <- paste0("Feat_", 1:n_features)

# Foreground: background noise + extra variance in first 5 features
X_f_signal <- matrix(rnorm(n_f * 5, mean=0, sd=2), nrow=n_f, ncol=5)
X_f_noise <- matrix(rnorm(n_f * (n_features-5)), nrow=n_f, ncol=n_features-5)
X_f <- cbind(X_f_signal, X_f_noise) + matrix(rnorm(n_f * n_features), nrow=n_f, ncol=n_features)
colnames(X_f) <- paste0("Feat_", 1:n_features)
rownames(X_f) <- paste0("SampleF_", 1:n_f)

# Apply cPCA++ (requires geigen and corpcor packages)
# install.packages(c("geigen", "corpcor"))
if (requireNamespace("geigen", quietly = TRUE) && requireNamespace("corpcor", quietly = TRUE)) {
  # Assuming helper constructors like bi_projector are available
  # library(multivarious) 

  res_cpca_plus <- cPCAplus(X_f, X_b, ncomp = 5, method = "geigen")

  # Scores for the foreground data (samples x components)
  print(head(res_cpca_plus$s))

  # Loadings (contrastive directions) (features x components)
  print(head(res_cpca_plus$v))

  # Plot scores
  plot(res_cpca_plus$s[, 1], res_cpca_plus$s[, 2],
       xlab = "Contrastive Component 1", ylab = "Contrastive Component 2",
       main = "cPCA++ Scores (geigen method)")

  # Example with corpcor method
  res_cpca_corp <- cPCAplus(X_f, X_b, ncomp = 5, method = "corpcor")
  print(head(res_cpca_corp$s)) # Scores in whitened space
  print(head(res_cpca_corp$v)) # Loadings in whitened space
}

bbuchsbaum/multivarious documentation built on July 16, 2025, 11:04 p.m.