bootstrap_pca: Fast, Exact Bootstrap for PCA Results from 'pca' function

View source: R/bootstrap.R

bootstrap_pcaR Documentation

Fast, Exact Bootstrap for PCA Results from pca function

Description

Performs bootstrap resampling for Principal Component Analysis (PCA) based on the method described by Fisher et al. (2016), optimized for high-dimensional data (p >> n). This version is specifically adapted to work with the output object generated by the provided pca function (which returns a bi_projector object of class 'pca').

Usage

bootstrap_pca(
  x,
  nboot = 100,
  k = NULL,
  parallel = FALSE,
  cores = NULL,
  seed = NULL,
  epsilon = 1e-15,
  ...
)

Arguments

x

An object of class 'pca' as returned by the provided pca function. It's expected to contain loadings (v), scores (s), singular values (sdev), left singular vectors (u), and pre-processing info (preproc).

nboot

The number of bootstrap resamples to perform. Must be a positive integer (default: 100).

k

The number of principal components to bootstrap (default: all components available in the fitted PCA model x). Must be less than or equal to the number of components in x.

parallel

Logical flag indicating whether to use parallel processing via the future framework (default: FALSE). Requires the future.apply package and a configured future backend (e.g., future::plan(future::multisession)).

cores

The number of cores to use for parallel processing if parallel = TRUE (default: future::availableCores()). This is used if no future plan is set.

seed

An integer value for the random number generator seed for reproducibility (default: NULL, no seed is set).

epsilon

A small positive value added to standard deviations before division to prevent division by zero or instability (default: 1e-15).

...

Additional arguments (currently ignored).

Details

This function implements the fast bootstrap PCA algorithm proposed by Fisher et al. (2016), adapted for the output structure of the provided pca function. The pca function returns an object containing:

  • v: Loadings (coefficients, p x k) - equivalent to V in SVD Y = U D V'. Note the transpose difference from prcomp.

  • s: Scores (n x k) - calculated as U %*% D.

  • sdev: Singular values (vector of length k) - equivalent to d.

  • u: Left singular vectors (n x k).

The bootstrap algorithm works by resampling the subjects (rows) and recomputing the SVD on a low-dimensional representation. Specifically, it computes the SVD of the resampled matrix ⁠D U' P^b⁠, where ⁠Y = U D V'⁠ is the SVD of the original (pre-processed) data, and P^b is a resampling matrix operating on the subjects (columns of U').

The SVD of the resampled low-dimensional matrix is ⁠svd(D U' P^b) = A^b S^b (R^b)'⁠. The bootstrap principal components (loadings) are then calculated as ⁠V^b = V A^b⁠, and the bootstrap scores are ⁠Scores^b = R^b S^b⁠.

Z-scores are provided as mean / sd.

Important Note: The algorithm assumes the data Y used for the original SVD (⁠Y = U D V'⁠) was appropriately centered (or pre-processed according to x$preproc). The bootstrap samples are generated based on the components derived from this pre-processed data.

Value

A list object of class bootstrap_pca_result containing:

E_Vb

Matrix (p x k) of the estimated bootstrap means of the principal components (loadings V^b = coefficients).

sd_Vb

Matrix (p x k) of the estimated bootstrap standard deviations of the principal components (loadings V^b).

z_loadings

Matrix (p x k) of the bootstrap Z-scores for the loadings, calculated as E_Vb / sd_Vb.

E_Scores

Matrix (n x k) of the estimated bootstrap means of the principal component scores (S^b).

sd_Scores

Matrix (n x k) of the estimated bootstrap standard deviations of the principal component scores (S^b).

z_scores

Matrix (n x k) of the bootstrap Z-scores for the scores, calculated as E_Scores / sd_Scores.

E_Ab

Matrix (k x k) of the estimated bootstrap means of the internal rotation matrices A^b.

Ab_array

Array (k x k x nboot) containing all the bootstrap rotation matrices A^b.

Scores_array

Array (n x k x nboot) containing all the bootstrap score matrices (S^b, with NAs for non-sampled subjects).

nboot

The number of bootstrap samples used (successful ones).

k

The number of components bootstrapped.

call

The matched call to the function.

References

Fisher, Aaron, Brian Caffo, Brian Schwartz, and Vadim Zipunnikov. 2016. "Fast, Exact Bootstrap Principal Component Analysis for P > 1 Million." Journal of the American Statistical Association 111 (514): 846–60. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2015.1062383")}.

Examples

# --- Assuming the pca and svd_wrapper functions provided are loaded ---
# Also assuming helper functions like center(), pass(), prep(), init_transform()
# are defined and available. Create simplified mocks if needed:
center <- function() structure(list(scale=FALSE, center=TRUE), class=c("center", "prepper"))
pass <- function() structure(list(scale=FALSE, center=FALSE), class=c("pass", "prepper"))
prep <- function(preproc) {
   xfun <- function(X) scale(X, center = preproc$center, scale = preproc$scale)
   # Add proper reverse logic if needed for projection
   return(structure(list(transform = xfun, reverse = function(X) X), class="prep"))
}
init_transform <- function(proc, X) proc$transform(X)
bi_projector <- function(v, s, sdev, u, preproc, classes, method) {
    structure(list(v=v, s=s, sdev=sdev, u=u, preproc=preproc, method=method),
              class=c(classes, "bi_projector", "projector"))
}
coefficients.projector <- function(x,...) x$v
scores.projector <- function(x,...) x$s
sdev.projector <- function(x,...) x$sdev
# --- End Mock Helpers ---

# Simulate data (p=50, n=20)
set.seed(123)
p_dim <- 50
n_obs <- 20
Y_mat <- matrix(rnorm(p_dim * n_obs), nrow = p_dim, ncol = n_obs)
# Transpose for pca function input (n x p)
X_mat <- t(Y_mat)

# Perform PCA using the provided pca function
# Use center() pre-processing
pca_res <- pca(X_mat, ncomp = 5, preproc = center(), method = "fast")

# Run bootstrap on the pca result
boot_res <- bootstrap_pca(pca_res, nboot = 50, k = 5, seed = 456)

# Explore results
print(dim(boot_res$z_loadings)) # p x k Z-scores for loadings (coefficients)
print(dim(boot_res$z_scores))   # n x k Z-scores for scores

# Plot std dev for first loading vector (V)
plot(boot_res$sd_Vb[, 1], ylab = "Bootstrap SD", main = "Loading Vector 1 Variability")

# Plot std dev for first score vector (S)
plot(boot_res$sd_Scores[, 1], ylab = "Bootstrap SD", main = "Score Vector 1 Variability")

## Not run: 
# Example with parallel processing (requires future & future.apply)
# library(future)
# library(future.apply)
# plan(multisession) # Set up parallel backend
boot_res_parallel <- bootstrap_pca(pca_res, nboot = 100, k = 5, seed = 456,
                                 parallel = TRUE)

## End(Not run)

bbuchsbaum/multivarious documentation built on July 16, 2025, 11:04 p.m.