R/cPCA.R

Defines functions cPCAplus

Documented in cPCAplus

#' Contrastive PCA++ (cPCA++)
#
#' Performs Contrastive PCA++ (cPCA++) to find directions that capture variation
#' enriched in a "foreground" dataset relative to a "background" dataset.
#' This implementation follows the cPCA++ approach which directly solves the
#' generalized eigenvalue problem Rf v = lambda Rb v, where Rf and Rb are
#' the covariance matrices of the foreground and background data, centered
#' using the *background mean*.
#'
#' @references
#' Salloum, R., Kuo, C. C. J. (2022). cPCA++: An efficient method for contrastive feature learning. Pattern Recognition, 124, 108378. (Algorithm 1)
#'
#' @param X_f A numeric matrix representing the foreground dataset (samples x features).
#' @param X_b A numeric matrix representing the background dataset (samples x features).
#'        `X_f` and `X_b` must have the same number of features (columns).
#' @param ncomp Integer. The number of contrastive components to compute. Defaults to `min(ncol(X_f))`, 
#'        but will be capped by the effective rank based on sample sizes.
#' @param center_background Logical. If TRUE (default), both `X_f` and `X_b` are centered using the
#'        column means of `X_b`. If FALSE, it assumes data is already appropriately centered.
#' @param lambda Shrinkage intensity for covariance estimation (0 <= lambda <= 1).
#'        Defaults to 0 (no shrinkage). Uses `corpcor::cov.shrink`. Can help stabilize
#'        results if `Rb` is ill-conditioned or singular.
#' @param method A character string specifying the primary computation method. Options include:
#'    - `"geigen"` (Default): Use `geneig` from the `geigen` package.
#'    - `"primme"`: Use `geneig` with the PRIMME library backend (requires special `geigen` build).
#'    - `"sdiag"`: Use `geneig` with a spectral decomposition method.
#'    - `"corpcor"`: Use a corpcor-based whitening approach followed by standard PCA.
#' @param strategy Controls the GEVD approach when `method` is not `"corpcor"`. Options include:
#'    - `"auto"` (Default): Chooses based on dimensions (feature vs. sample space).
#'    - `"feature"`: Forces direct computation via `p x p` covariance matrices.
#'    - `"sample"`: Forces sample-space computation via SVD and a smaller GEVD (efficient for large `p`).
#' @param ... Additional arguments passed to the underlying computation functions 
#'    (`geigen::geneig` or `irlba::irlba` based on `method` and `strategy`).
#'
#' @details
#' **Preprocessing:** Following the cPCA++ paper, if `center_background = TRUE`, both `X_f` and `X_b`
#' are centered by subtracting the column means calculated *only* from the background data `X_b`.
#' This is crucial for isolating variance specific to `X_f`.
#'  
#' **Core Algorithm (methods "geigen", "primme", "sdiag", strategy="feature"):**
#' 1. Center `X_f` and `X_b` using the mean of `X_b`.
#' 2. Compute potentially shrunk \eqn{p \times p} covariance matrices `Rf` (from centered `X_f`) and `Rb` (from centered `X_b`) using `corpcor::cov.shrink`.
#' 3. Solve the generalized eigenvalue problem `Rf v = lambda Rb v` for the top `ncomp` eigenvectors `v` using `geigen::geneig`. These eigenvectors are the contrastive principal components (loadings).
#' 4. Compute scores by projecting the centered foreground data onto the eigenvectors: `S = X_f_centered %*% v`.
#'
#' **Core Algorithm (Large-D / Sample Space Strategy, strategy="sample"):**
#' When \eqn{p \gg n}, forming \eqn{p \times p} matrices `Rf` and `Rb` is infeasible. The "sample" strategy follows cPCA++ §3.2:
#' 1. Center `X_f` and `X_b` using the mean of `X_b`.
#' 2. Compute the SVD of centered \eqn{X_b = Ub Sb Vb^T} (using `irlba` for efficiency).
#' 3. Project centered `X_f` into the background's principal subspace: `Zf = X_f_centered %*% Vb`.
#' 4. Form small \eqn{r \times r} matrices: `Rf_small = cov(Zf)` and `Rb_small = (1/(n_b-1)) * Sb^2`.
#' 5. Solve the small \eqn{r \times r} GEVD: `Rf_small w = lambda Rb_small w` using `geigen::geneig`.
#' 6. Lift eigenvectors back to feature space: `v = Vb %*% w`.
#' 7. Compute scores: `S = X_f_centered %*% v`.
#'
#' **Alternative Algorithm (method "corpcor"):**
#' 1. Center `X_f` and `X_b` using the mean of `X_b`.
#' 2. Compute `Rb` and its inverse square root `Rb_inv_sqrt`.
#' 3. Whiten the foreground data: `X_f_whitened = X_f_centered %*% Rb_inv_sqrt`.
#' 4. Perform standard PCA (`stats::prcomp`) on `X_f_whitened`.
#' 5. The returned `v` and `s` are the loadings and scores *in the whitened space*. The loadings are *not* the generalized eigenvectors `v`. A specific class `corpcor_pca` is added to signal this.
#'
#' @return A `bi_projector`-like object with classes `c("cPCAplus", "<method_class>", "bi_projector")` containing:
#' \describe{
#'   \item{v}{Loadings matrix (features x ncomp). Interpretation depends on `method` (see Details).}
#'   \item{s}{Scores matrix (samples_f x ncomp).}
#'   \item{sdev}{Vector (length ncomp). Standard deviations (sqrt of generalized eigenvalues for `geigen` methods, PCA std devs for `corpcor`).}
#'   \item{values}{Vector (length ncomp). Generalized eigenvalues (for `geigen` methods) or PCA eigenvalues (for `corpcor`).}
#'   \item{strategy}{The strategy used ("feature" or "sample") if method was not "corpcor".}
#'   \item{preproc}{The initialized `preprocessor` object used.}
#'   \item{method}{The computation method used.}
#'   \item{ncomp}{The number of components computed.}
#'   \item{nfeatures}{The number of features.}
#' }
#'
#' @examples
#' # Simulate data where foreground has extra variance in first few dimensions
#' set.seed(123)
#' n_f <- 100
#' n_b <- 150
#' n_features <- 50
#'
#' # Background: standard normal noise
#' X_b <- matrix(rnorm(n_b * n_features), nrow=n_b, ncol=n_features)
#' colnames(X_b) <- paste0("Feat_", 1:n_features)
#'
#' # Foreground: background noise + extra variance in first 5 features
#' X_f_signal <- matrix(rnorm(n_f * 5, mean=0, sd=2), nrow=n_f, ncol=5)
#' X_f_noise <- matrix(rnorm(n_f * (n_features-5)), nrow=n_f, ncol=n_features-5)
#' X_f <- cbind(X_f_signal, X_f_noise) + matrix(rnorm(n_f * n_features), nrow=n_f, ncol=n_features)
#' colnames(X_f) <- paste0("Feat_", 1:n_features)
#' rownames(X_f) <- paste0("SampleF_", 1:n_f)
#'
#' # Apply cPCA++ (requires geigen and corpcor packages)
#' # install.packages(c("geigen", "corpcor"))
#' if (requireNamespace("geigen", quietly = TRUE) && requireNamespace("corpcor", quietly = TRUE)) {
#'   # Assuming helper constructors like bi_projector are available
#'   # library(multivarious) 
#'
#'   res_cpca_plus <- cPCAplus(X_f, X_b, ncomp = 5, method = "geigen")
#'
#'   # Scores for the foreground data (samples x components)
#'   print(head(res_cpca_plus$s))
#'
#'   # Loadings (contrastive directions) (features x components)
#'   print(head(res_cpca_plus$v))
#'
#'   # Plot scores
#'   plot(res_cpca_plus$s[, 1], res_cpca_plus$s[, 2],
#'        xlab = "Contrastive Component 1", ylab = "Contrastive Component 2",
#'        main = "cPCA++ Scores (geigen method)")
#'
#'   # Example with corpcor method
#'   res_cpca_corp <- cPCAplus(X_f, X_b, ncomp = 5, method = "corpcor")
#'   print(head(res_cpca_corp$s)) # Scores in whitened space
#'   print(head(res_cpca_corp$v)) # Loadings in whitened space
#' }
#'
#' @importFrom stats prcomp
#' @importFrom Matrix crossprod
#' @importFrom stats coef prcomp
#' @importFrom corpcor pseudoinverse
#' @importFrom geigen geigen
#' @export
cPCAplus <- function(X_f, X_b, ncomp = NULL,
                     center_background = TRUE,
                     lambda = 0,
                     method = c("geigen", "primme", "sdiag", "corpcor"),
                     strategy = c("auto", "feature", "sample"),
                     ...) {

  # --- Input Validation --- 
  method <- match.arg(method)
  strategy <- match.arg(strategy)
  # Use chk or assertions for validation
  stopifnot(is.matrix(X_f), is.numeric(X_f))
  stopifnot(is.matrix(X_b), is.numeric(X_b))
  if (ncol(X_f) != ncol(X_b)) {
    stop("Foreground matrix X_f (", ncol(X_f), " features) and background matrix X_b (",
         ncol(X_b), " features) must have the same number of columns.")
  }
  if (is.null(ncomp)) {
    ncomp <- min(ncol(X_f), nrow(X_f), nrow(X_b))
    message("ncomp not provided, defaulting to min(p, n_f, n_b) = ", ncomp)
  }
  stopifnot(is.numeric(ncomp), length(ncomp) == 1, ncomp > 0, ncomp == floor(ncomp))
  stopifnot(ncomp <= ncol(X_f))
  stopifnot(is.logical(center_background), length(center_background) == 1)
  stopifnot(is.numeric(lambda), length(lambda) == 1, lambda >= 0, lambda <= 1)

  # Preserve original row/col names
  rn_f <- rownames(X_f)
  cn <- colnames(X_f)
  if (is.null(cn) && !is.null(colnames(X_b))) cn <- colnames(X_b)

  # --- Preprocessing --- 
  # Fix 2: Create proper pre_processor objects
  if (center_background) {
    mean_b <- colMeans(X_b, na.rm = TRUE)
    if(any(is.na(mean_b))) stop("NA values encountered in background mean calculation.")
    X_f_centered <- sweep(X_f, 2, mean_b, "-")
    X_b_centered <- sweep(X_b, 2, mean_b, "-")
    # Create a finalized preprocessor using the calculated means
    proc <- prep(center(cmeans = mean_b))
  } else {
    X_f_centered <- X_f
    X_b_centered <- X_b
    # Create a finalized identity preprocessor
    proc <- prep(pass())
  }

  # --- Core Computation --- 
  if (method == "corpcor") {
    if (!requireNamespace("corpcor", quietly = TRUE)) {
            stop("Package 'corpcor' needed for method='corpcor'. Please install it.", call. = FALSE)
        }

    # Compute Rb using cov.shrink on already centered data
    Rb <- corpcor::cov.shrink(X_b_centered, lambda = lambda, verbose = FALSE)

    # Compute Rb^(-1/2) using eigenvalue decomposition
    eig_Rb <- eigen(Rb, symmetric = TRUE)
    eig_vals_Rb_inv_sqrt <- ifelse(eig_Rb$values > sqrt(.Machine$double.eps), 1 / sqrt(eig_Rb$values), 0)
    # Check for potential issues
    if (sum(eig_vals_Rb_inv_sqrt > 0) < ncomp) {
        warning("Rank of background covariance (after thresholding) is less than ncomp. Results may be unreliable.")
    }
    Rb_inv_sqrt <- eig_Rb$vectors %*% diag(eig_vals_Rb_inv_sqrt, nrow=length(eig_vals_Rb_inv_sqrt)) %*% t(eig_Rb$vectors)

    # Whiten foreground data
    X_f_whitened <- X_f_centered %*% Rb_inv_sqrt

    # Perform standard PCA on the whitened data
    # Using prcomp for robustness and standard output
    pca_res <- tryCatch({
        stats::prcomp(X_f_whitened, center = FALSE, scale. = FALSE, rank. = ncomp)
        }, error = function(e){ 
            stop("Error during stats::prcomp on whitened data: ", e$message)
        })

    # Adjust ncomp if prcomp returned fewer components
    actual_ncomp <- min(ncomp, ncol(pca_res$rotation))
    if (actual_ncomp < ncomp) {
        warning("prcomp returned fewer components (", actual_ncomp, ") than requested (", ncomp, ").")
        ncomp <- actual_ncomp
    }
    if (ncomp == 0) stop("PCA on whitened data yielded zero components.")

    # Results are in the whitened space
    eigenvectors_whitened <- pca_res$rotation[, 1:ncomp, drop = FALSE]
    eigenvalues_pca <- (pca_res$sdev[1:ncomp])^2
    scores_pca <- pca_res$x[, 1:ncomp, drop = FALSE]

    # Back-transform loadings to original space
    v_orig <- Rb_inv_sqrt %*% eigenvectors_whitened

    # Recalculate scores by projecting original centered data onto back-transformed loadings
    scores <- X_f_centered %*% v_orig

    # Create structure using bi_projector constructor
    # Note: values/sdev are from PCA on whitened data, not generalized eigenvalues
     projector <- bi_projector(
            v = v_orig,                  # Loadings in original space
            s = scores,                  # Scores = projection onto original space loadings
            sdev = pca_res$sdev[1:ncomp], # Sdev from PCA on whitened data
            values = eigenvalues_pca,    # Eigenvalues from PCA on whitened data
            preproc = proc,
            classes = c("cPCAplus", "corpcor_pca", "bi_projector"),
            method_used = list(type = "cPCAplus", method = method, lambda=lambda, ncomp=ncomp) # Store extra info
        )

  } else {
    # --- Methods using geigen --- 
    if (!requireNamespace("geigen", quietly = TRUE)) {
            stop("Package 'geigen' needed for methods 'geigen', 'primme', 'sdiag'. Please install it.", call. = FALSE)
        }
    if (!requireNamespace("corpcor", quietly = TRUE)) {
            # Required for covariance estimation
            stop("Package 'corpcor' needed for covariance estimation. Please install it.", call. = FALSE)
        }
        
    # Determine strategy: feature space (direct covariance) or sample space (large D)
    p <- ncol(X_f_centered)
    n_f <- nrow(X_f_centered)
    n_b <- nrow(X_b_centered)
    use_sample_space <- FALSE
    effective_strategy <- strategy

    if (strategy == "auto") {
      # Heuristic: p > 5 * max(n_f, n_b)
      if (p > 5 * max(n_f, n_b)) {
        message("Large dimension detected (p > 5*max(n_f, n_b)). Switching to sample-space GEVD strategy.")
        use_sample_space <- TRUE
        effective_strategy <- "sample"
      } else {
        effective_strategy <- "feature"
      }
    } else if (strategy == "sample") {
      use_sample_space <- TRUE
    } # else strategy == "feature" -> use_sample_space remains FALSE

    if (lambda != 0 && use_sample_space) {
        warning("Shrinkage (lambda != 0) is specified but the sample-space strategy is selected. Shrinkage currently only applied to feature-space strategy.")
        # lambda is ignored in sample-space path computation below
    }

    # --- Compute Eigen solution based on strategy --- 
    if (use_sample_space) {
        # --- Sample Space Strategy (Large D) --- 
        message("Using sample-space strategy...")
        if (!requireNamespace("irlba", quietly = TRUE))
            stop("Package 'irlba' needed for large-D sample-space strategy. Install it.", call.=FALSE)

        # 1. Background SVD (thin)
        # Oversample slightly for stability
        r_target <- min(n_b - 1, p, ncomp + 15)
        if (r_target <= 0) stop("Target rank for background SVD is non-positive.")

        svdb <- tryCatch({
            irlba::irlba(X_b_centered, nv = r_target, nu = 0)
           }, error = function(e){ 
               stop("Error during irlba SVD of background matrix X_b: ", e$message)
           })
        
        V_b  <- svdb$v            # p  x r
        # Use adjusted degrees of freedom for covariance estimate
        Sigma2 <- (svdb$d^2) / max(1, n_b - 1)   # length r 
        actual_rank_b <- length(Sigma2)

        # 2. Small matrices
        Rb_small <- diag(Sigma2, nrow = actual_rank_b)
        
        # Foreground projected into background subspace
        Zf  <- X_f_centered %*% V_b            # n_f x r
        # Use adjusted degrees of freedom for covariance estimate
        Rf_small <- crossprod(Zf) / max(1, n_f - 1)  # r x r

        # 3. Solve GEVD in r x r space
        ncomp_geigen <- min(ncomp, actual_rank_b)
        if (ncomp_geigen < ncomp) {
            warning("Reduced ncomp from ", ncomp, " to ", ncomp_geigen, " due to background SVD rank.")
            ncomp <- ncomp_geigen
        }
        if (ncomp == 0) stop("Cannot compute components; ncomp is zero after rank adjustment.")

        geig_small <- tryCatch({
            geigen::geigen(Rf_small, Rb_small, symmetric = TRUE)
           }, error = function(e) {
               stop("Error during small GEVD using geigen: ", e$message)
           })
        
        # Ensure enough valid eigenvalues/vectors returned
        k_avail_small <- length(geig_small$values)
        actual_ncomp_small <- min(ncomp, k_avail_small)
        if (actual_ncomp_small < ncomp) {
             warning("Small GEVD returned fewer eigenvalues (", k_avail_small, ") than requested (", ncomp, "). Reducing ncomp to ", actual_ncomp_small, ".")
             ncomp <- actual_ncomp_small
        }
        if (ncomp == 0) stop("Small GEVD returned zero components.")

        eigenvalues_raw <- geig_small$values[1:ncomp]
        w <- geig_small$vectors[, 1:ncomp, drop = FALSE] # r x k
        
        # 4. Lift eigenvectors back to feature space
        v_raw <- V_b %*% w                     # p x k
        
        # Ensure real
        if (is.complex(v_raw)) {
            warning("Complex eigenvectors encountered after lifting, taking the real part.")
            v_raw <- Re(v_raw)
        }
        eigenvectors_ordered <- v_raw # Keep intermediate name
        eigenvalues_ordered <- eigenvalues_raw

    } else {
        # --- Feature Space Strategy (Standard/Small D) --- 
        message("Using feature-space strategy...")
        # Compute covariance matrices using shrunk estimates on centered data
        Rf_obj <- corpcor::cov.shrink(X_f_centered, lambda = lambda, verbose = FALSE)
        Rb_obj <- corpcor::cov.shrink(X_b_centered, lambda = lambda, verbose = FALSE)
        
        # Ensure plain matrix for geigen by removing S3 class
        Rf <- Rf_obj
        Rb <- Rb_obj
        class(Rf) <- "matrix"
        class(Rb) <- "matrix"

        # Solve the generalized eigenvalue problem: Rf v = lambda Rb v
        geigen_res <- tryCatch({
            geigen::geigen(A = Rf, B = Rb, ...)
            }, error = function(e) {
               stop("Error during geigen::geigen: ", e$message) 
            })

        # Extract results
        k_avail <- length(geigen_res$values)
        actual_ncomp <- min(ncomp, k_avail)
         if (actual_ncomp < ncomp) {
            warning("geigen returned fewer eigenvalues (", k_avail, ") than requested (", ncomp, "). Reducing ncomp to ", actual_ncomp, ".")
            ncomp <- actual_ncomp
        }
        if (ncomp == 0) stop("Generalized eigenvalue decomposition returned zero components.")

        eigenvalues_raw <- geigen_res$values[1:ncomp]
        eigenvectors_raw <- geigen_res$vectors[, 1:ncomp, drop = FALSE]

        # Ensure eigenvectors are real
        if (is.complex(eigenvectors_raw)) {
            warning("Complex eigenvectors encountered, taking the real part.")
            eigenvectors_raw <- Re(eigenvectors_raw)
        }
        eigenvectors_ordered <- eigenvectors_raw # Keep intermediate name
        eigenvalues_ordered <- eigenvalues_raw
    }

    # --- Post-process results common to both GEVD strategies --- 
    
    # Order by decreasing eigenvalue magnitude
    ord <- order(abs(eigenvalues_ordered), decreasing = TRUE)
    eigenvalues <- abs(eigenvalues_ordered[ord]) # Use abs value, should be positive
    eigenvectors <- eigenvectors_ordered[, ord, drop = FALSE]

    # Apply sign flip for consistency
    if (is.numeric(eigenvectors) && nrow(eigenvectors) > 0) { # Check if eigenvectors were successfully extracted
        signs <- apply(eigenvectors, 2, function(v) {
            if(all(is.na(v) | v == 0)) return(1) # Handle all zero/NA vectors
            # Flip based on element with largest absolute value
            sign_val <- base::sign(v[which.max(abs(v))])
            # Ensure sign is not 0 if max abs value is 0 (should not happen with check above)
            if (sign_val == 0) 1 else sign_val
            })
        signs[signs == 0] <- 1 # Handle cases where max element is 0
        eigenvectors <- sweep(eigenvectors, 2, signs, FUN = "*")
    } else {
        warning("Could not apply sign flip: eigenvectors are not numeric or empty.")
    }

    # Compute scores: projection of centered foreground data onto eigenvectors
    scores <- X_f_centered %*% eigenvectors

    # Create a bi_projector instance
    projector <- bi_projector(
            v = eigenvectors,         # Generalized eigenvectors (Loadings)
            s = scores,               # Scores = Projection onto loadings
            sdev = sqrt(pmax(eigenvalues, 0)), # Ensure non-negative before sqrt
            values = eigenvalues,     # Generalized eigenvalues
            preproc = proc,
            classes = c("cPCAplus", paste0(effective_strategy, "_pca"), "bi_projector"),
            method_used = list(type = "cPCAplus", method = method, strategy = effective_strategy, lambda=lambda, ncomp=ncomp) 
        )
  }

  # --- Final Touches --- 
  # Assign names
  colnames(projector$v) <- paste0("cPC", 1:ncomp)
  colnames(projector$s) <- paste0("cPC", 1:ncomp)
  if (!is.null(cn)) {
      rownames(projector$v) <- cn
  }
  if (!is.null(rn_f)) {
      rownames(projector$s) <- rn_f
  }

  return(projector)
}
bbuchsbaum/multivarious documentation built on July 16, 2025, 11:04 p.m.