cPCAplus: Contrastive PCA++ (cPCA++) Performs Contrastive PCA++...

View source: R/cPCA.R

cPCAplusR Documentation

Contrastive PCA++ (cPCA++) Performs Contrastive PCA++ (cPCA++) to find directions that capture variation enriched in a "foreground" dataset relative to a "background" dataset. This implementation follows the cPCA++ approach which directly solves the generalized eigenvalue problem Rf v = lambda Rb v, where Rf and Rb are the covariance matrices of the foreground and background data, centered using the background mean.

Description

Contrastive PCA++ (cPCA++) Performs Contrastive PCA++ (cPCA++) to find directions that capture variation enriched in a "foreground" dataset relative to a "background" dataset. This implementation follows the cPCA++ approach which directly solves the generalized eigenvalue problem Rf v = lambda Rb v, where Rf and Rb are the covariance matrices of the foreground and background data, centered using the background mean.

Usage

cPCAplus(
  X_f,
  X_b,
  ncomp = NULL,
  center_background = TRUE,
  lambda = 0,
  method = c("geigen", "primme", "sdiag", "corpcor"),
  strategy = c("auto", "feature", "sample"),
  ...
)

Arguments

X_f

A numeric matrix representing the foreground dataset (samples x features).

X_b

A numeric matrix representing the background dataset (samples x features). X_f and X_b must have the same number of features (columns).

ncomp

Integer. The number of contrastive components to compute. Defaults to min(ncol(X_f)), but will be capped by the effective rank based on sample sizes.

center_background

Logical. If TRUE (default), both X_f and X_b are centered using the column means of X_b. If FALSE, it assumes data is already appropriately centered.

lambda

Shrinkage intensity for covariance estimation (0 <= lambda <= 1). Defaults to 0 (no shrinkage). Uses corpcor::cov.shrink. Can help stabilize results if Rb is ill-conditioned or singular.

method

A character string specifying the primary computation method. Options include:

  • "geigen" (Default): Use geneig from the geigen package.

  • "primme": Use geneig with the PRIMME library backend (requires special geigen build).

  • "sdiag": Use geneig with a spectral decomposition method.

  • "corpcor": Use a corpcor-based whitening approach followed by standard PCA.

strategy

Controls the GEVD approach when method is not "corpcor". Options include:

  • "auto" (Default): Chooses based on dimensions (feature vs. sample space).

  • "feature": Forces direct computation via ⁠p x p⁠ covariance matrices.

  • "sample": Forces sample-space computation via SVD and a smaller GEVD (efficient for large p).

...

Additional arguments passed to the underlying computation functions (geigen::geneig or irlba::irlba based on method and strategy).

Details

Preprocessing: Following the cPCA++ paper, if center_background = TRUE, both X_f and X_b are centered by subtracting the column means calculated only from the background data X_b. This is crucial for isolating variance specific to X_f.

Core Algorithm (methods "geigen", "primme", "sdiag", strategy="feature"):

  1. Center X_f and X_b using the mean of X_b.

  2. Compute potentially shrunk p \times p covariance matrices Rf (from centered X_f) and Rb (from centered X_b) using corpcor::cov.shrink.

  3. Solve the generalized eigenvalue problem ⁠Rf v = lambda Rb v⁠ for the top ncomp eigenvectors v using geigen::geneig. These eigenvectors are the contrastive principal components (loadings).

  4. Compute scores by projecting the centered foreground data onto the eigenvectors: S = X_f_centered %*% v.

Core Algorithm (Large-D / Sample Space Strategy, strategy="sample"): When p \gg n, forming p \times p matrices Rf and Rb is infeasible. The "sample" strategy follows cPCA++ §3.2:

  1. Center X_f and X_b using the mean of X_b.

  2. Compute the SVD of centered X_b = Ub Sb Vb^T (using irlba for efficiency).

  3. Project centered X_f into the background's principal subspace: Zf = X_f_centered %*% Vb.

  4. Form small r \times r matrices: Rf_small = cov(Zf) and Rb_small = (1/(n_b-1)) * Sb^2.

  5. Solve the small r \times r GEVD: ⁠Rf_small w = lambda Rb_small w⁠ using geigen::geneig.

  6. Lift eigenvectors back to feature space: v = Vb %*% w.

  7. Compute scores: S = X_f_centered %*% v.

Alternative Algorithm (method "corpcor"):

  1. Center X_f and X_b using the mean of X_b.

  2. Compute Rb and its inverse square root Rb_inv_sqrt.

  3. Whiten the foreground data: X_f_whitened = X_f_centered %*% Rb_inv_sqrt.

  4. Perform standard PCA (stats::prcomp) on X_f_whitened.

  5. The returned v and s are the loadings and scores in the whitened space. The loadings are not the generalized eigenvectors v. A specific class corpcor_pca is added to signal this.

Value

A bi_projector-like object with classes c("cPCAplus", "<method_class>", "bi_projector") containing:

v

Loadings matrix (features x ncomp). Interpretation depends on method (see Details).

s

Scores matrix (samples_f x ncomp).

sdev

Vector (length ncomp). Standard deviations (sqrt of generalized eigenvalues for geigen methods, PCA std devs for corpcor).

values

Vector (length ncomp). Generalized eigenvalues (for geigen methods) or PCA eigenvalues (for corpcor).

strategy

The strategy used ("feature" or "sample") if method was not "corpcor".

preproc

The initialized preprocessor object used.

method

The computation method used.

ncomp

The number of components computed.

nfeatures

The number of features.

References

Salloum, R., Kuo, C. C. J. (2022). cPCA++: An efficient method for contrastive feature learning. Pattern Recognition, 124, 108378. (Algorithm 1)

Examples

# Simulate data where foreground has extra variance in first few dimensions
set.seed(123)
n_f <- 100
n_b <- 150
n_features <- 50

# Background: standard normal noise
X_b <- matrix(rnorm(n_b * n_features), nrow=n_b, ncol=n_features)
colnames(X_b) <- paste0("Feat_", 1:n_features)

# Foreground: background noise + extra variance in first 5 features
X_f_signal <- matrix(rnorm(n_f * 5, mean=0, sd=2), nrow=n_f, ncol=5)
X_f_noise <- matrix(rnorm(n_f * (n_features-5)), nrow=n_f, ncol=n_features-5)
X_f <- cbind(X_f_signal, X_f_noise) + matrix(rnorm(n_f * n_features), nrow=n_f, ncol=n_features)
colnames(X_f) <- paste0("Feat_", 1:n_features)
rownames(X_f) <- paste0("SampleF_", 1:n_f)

# Apply cPCA++ (requires geigen and corpcor packages)
# install.packages(c("geigen", "corpcor"))
if (requireNamespace("geigen", quietly = TRUE) && requireNamespace("corpcor", quietly = TRUE)) {
  # Assuming helper constructors like bi_projector are available
  # library(multivarious) 

  res_cpca_plus <- cPCAplus(X_f, X_b, ncomp = 5, method = "geigen")

  # Scores for the foreground data (samples x components)
  print(head(res_cpca_plus$s))

  # Loadings (contrastive directions) (features x components)
  print(head(res_cpca_plus$v))

  # Plot scores
  plot(res_cpca_plus$s[, 1], res_cpca_plus$s[, 2],
       xlab = "Contrastive Component 1", ylab = "Contrastive Component 2",
       main = "cPCA++ Scores (geigen method)")

  # Example with corpcor method
  res_cpca_corp <- cPCAplus(X_f, X_b, ncomp = 5, method = "corpcor")
  print(head(res_cpca_corp$s)) # Scores in whitened space
  print(head(res_cpca_corp$v)) # Loadings in whitened space
}


bbuchsbaum/multivarious documentation built on July 16, 2025, 11:04 p.m.