knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The R package pcpr
implements Principal Component Pursuit (PCP), a robust
dimensionality reduction technique, for pattern recognition tailored to
environmental health data. The statistical methodology and computational details
are provided in Gibson et al. (2022).
You can install the latest official CRAN release of pcpr
with:
install.packages("pcpr")
The development version of pcpr
can be installed from GitHub with:
# install.packages("pak") pak::pak("Columbia-PRIME/pcpr")
pcpr
can then be loaded and attached in your current R session as usual with
library(pcpr)
Extensive documentation is available on our pkgdown website and offline within R.
You can see the pcpr
reference manual in R with:
help("pcpr")
A number of vignettes are available from within R. They can be browsed using:
browseVignettes("pcpr")
We recommend reading the vignettes in the following order:
vignette("theory-crash-course")
vignette("pcp-quickstart")
vignette("pcp-applied")
Have a bug to report or question to ask? Open an issue on our GitHub.
PCP algorithms model an observed exposure matrix $D$ as the sum of three underlying ground-truth matrices:
a low-rank matrix $L_0$ encoding consistent patterns of exposure, a sparse matrix $S_0$ isolating unique or outlying exposure events (that cannot be explained by the consistent exposure patterns), and dense noise $Z_0$.
The models in pcpr
seek to decompose an observed data matrix D
into estimated
low-rank and sparse components L
and S
for use in downstream environmental health
analyses. The functions in pcpr
are outfitted with three environmental health
(EH)-specific extensions making pcpr
particularly powerful for EH research:
L
matrixThe methods in pcpr
have already been applied in many environmental health studies. Several are listed below:
Please cite use of pcpr
with:
Chillrud L, Benavides J, Gibson E, Zhang J, Yan J, Wright J, Goldsmith J, Kioumourtzoglou M (2025). pcpr: Principal Component Pursuit for Environmental Epidemiology. R package version 1.0.0, https://columbia-prime.github.io/pcpr/, https://github.com/Columbia-PRIME/pcpr.
@Manual{, title = {pcpr: Principal Component Pursuit for Environmental Epidemiology}, author = {Lawrence G. Chillrud and Jaime Benavides and Elizabeth A. Gibson and Junhui Zhang and Jingkai Yan and John N. Wright and Jeff Goldsmith and Marianthi-Anna Kioumourtzoglou}, year = {2025}, note = {R package version 1.0.0, https://github.com/Columbia-PRIME/pcpr}, url = {https://columbia-prime.github.io/pcpr/}, }
Please also cite Gibson et al. (2022).
This work was supported by NIEHS PRIME R01 ES028805.
Special thanks to Sophie Calhoun for designing pcpr
's logo!
# In the below example, we simulate a simple mixtures model and run PCP, # comparing it's performance to that of PCA. For an in depth example with # simulated data, see vignette("pcp-quickstart"). For more realistic # PCP usage, check out vignette("pcp-applied"). # Simulate an environmental mixture data <- sim_data( n = 100, p = 10, r = 3, sparse_nonzero_idxs = seq(1, 1000, 101), sigma = 0.05 ) D <- data$D # Observed matrix L_0 <- data$L # Ground truth low-rank matrix S_0 <- data$S # Ground truth sparse matrix Z_0 <- data$Z # Ground truth noise matrix # Simulate a limit of detection for each chemical in mixture lod_info <- sim_lod(D, q = 0.1) D_lod <- lod_info$D_tilde lod <- lod_info$lod # Simulate missing observations corrupted_data <- sim_na(D_lod, perc = 0.05) D_tilde <- corrupted_data$D_tilde # Finish simulating LOD by imputing values < LOD with LOD/sqrt(2) lod_root2 <- matrix( lod / sqrt(2), nrow = nrow(D_tilde), ncol = ncol(D_tilde), byrow = TRUE ) lod_idxs <- which(lod_info$tilde_mask == 1) D_tilde[lod_idxs] <- lod_root2[lod_idxs] # Run grid search to obtain optimal r, eta parameters # (Not shown here to save space, see vignette("pcp-quickstart") # for full example which obtains r = 3, eta = 0.224) r_star <- 3 eta_star <- 0.224 # Run non-convex PCP to estimate L, S from D_tilde pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod) # Clean up sparse matrix pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4) # Benchmark with PCA's attempt at recovering L D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE)) L_pca <- proj_rank_r(D_imputed, r = r_star) # Evaluate PCP ground truth data.frame( "Obs_rel_err" = norm(L_0 - D_imputed, "F") / norm(L_0, "F"), "PCA_L_rel_err" = norm(L_0 - L_pca, "F") / norm(L_0, "F"), "PCP_L_rel_err" = norm(L_0 - pcp_model$L, "F") / norm(L_0, "F"), "PCP_S_rel_err" = norm(S_0 - pcp_model$S, "F") / norm(S_0, "F"), "PCP_L_rank" = matrix_rank(pcp_model$L), "PCP_S_sparsity" = sparsity(pcp_model$S) )
Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Principal component pursuit for pattern identification in environmental mixtures." Environmental Health Perspectives 130, no. 11 (2022): 117008.
Tao, Rachel H., Lawrence G. Chillrud, Yanelli Nunez, Sebastian T. Rowland, Amelia K. Boehme, Jingkai Yan, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Applying principal component pursuit to investigate the association between source-specific fine particulate matter and myocardial infarction hospitalizations in New York City." Environmental Epidemiology 7 (2), (2023).
Wu, Haotian, Vrinda Kalia, Katherine E. Manz, Lawrence Chillrud, Nathalie Hoffmann Dishon, Gabriela L. Jackson, Christian K. Dye, Raoul Orvieto, Adva Aizer, Hagai Levine, Marianthi-Anna Kioumourtzoglou, Kurt D. Pennell, Andrea A. Baccarelli, and Ronit Machtinger. "Exposome Profiling of Environmental Pollutants in Seminal Plasma and Novel Associations with Semen Parameters." Environmental Science & Technology, 58 (31), (2024): 13594-13604.
Benavides, Jaime, Sabah Usmani, Vijay Kumar, and Marianthi-Anna Kioumourtzoglou. "Development of a community severance index for urban areas in the United States: A case study in New York City." Environment International, 185, (2024): 108526.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.