pcpr: Principal Component Pursuit for Environmental Epidemiology

Documented in root_pcp

#' Square root principal component pursuit (convex PCP)
#'
#' @description
#' `root_pcp()` implements the convex PCP algorithm "Square root principal
#' component pursuit" as described in
#' [Zhang et al. (2021)](https://proceedings.neurips.cc/paper/2021/hash/f65854da4622c1f1ad4ffeb361d7703c-Abstract.html)
#' , outfitted with environmental health (EH)-specific extensions as described
#' in Gibson et al. (2022).
#'
#' Given an observed data matrix `D`, and regularization parameters `lambda` and
#' `mu`, `root_pcp()` aims to find the best low-rank and sparse estimates `L`
#' and `S`. The `L` matrix encodes latent patterns that govern the observed
#' data. The `S` matrix captures any extreme events in the data unexplained by
#' the underlying patterns in `L`.
#'
#' Being convex, `root_pcp()` determines the rank `r`, or number of latent
#' patterns in the data, autonomously during it's optimization. As such, the
#' user does not need to specify the desired rank `r` of the output `L` matrix
#' as in the non-convex PCP model [rrmc()].
#'
#' Experimentally, the `root_pcp()` approach to PCP modeling has best been able
#' to handle those datasets that are governed by well-defined underlying
#' patterns, characterized by quickly decaying singular values. This is typical
#' of imaging and video data, but uncommon for EH data. For observed data with a
#' complex low rank structure (slowly decaying singular values), like EH data,
#' [rrmc()] may offer a better model estimate.
#'
#' Three EH-specific extensions are currently supported by `root_pcp()`:
#' 1. The model can handle missing values in the input data matrix `D`;
#' 2. The model can also handle measurements that fall below the limit of
#'    detection (LOD), if provided `LOD` information by the user; and
#' 3. The model is also equipped with an optional non-negativity constraint
#'    on the low-rank `L` matrix, ensuring that all output values in `L` are
#'    \eqn{> 0}.
#'
#' @section The objective function:
#' `root_pcp()` optimizes the following objective function:
#' \deqn{\min_{L, S} ||L||_* + \lambda ||S||_1 + \mu ||L + S - D||_F}
#' The first term is the nuclear norm of the `L` matrix, incentivizing `L` to be
#' low-rank. The second term is the \eqn{\ell_1} norm of the `S` matrix,
#' encouraging `S` to be sparse. The third term is the Frobenius norm
#' applied to the model's noise, ensuring that the estimated low-rank and sparse
#' models `L` and `S` together have high fidelity to the observed data `D`.
#' The objective is not smooth nor differentiable, however it is convex and
#' separable. As such, it is optimized using the Alternating Direction
#' Method of Multipliers (ADMM) algorithm Boyd et al. (2011), Gao et al. (2020).
#'
#' @section The `lambda` and `mu` parameters:
#' * `lambda` controls the sparsity of `root_pcp()`'s output `S` matrix;
#'   larger values of `lambda` penalize non-zero entries in `S` more
#'   stringently, driving the recovery of sparser `S` matrices. Therefore,
#'   if you a priori expect few outlying events in your model, you might
#'   expect a grid search to recover relatively larger `lambda` values, and
#'   vice-versa.
#' * `mu` adjusts `root_pcp()`'s sensitivity to noise; larger values of `mu`
#'   penalize errors between the predicted model and the observed data (i.e.
#'   noise), more severely. Environmental data subject to higher noise levels
#'   therefore require a `root_pcp()` model equipped with smaller `mu` values
#'   (since higher noise means a greater discrepancy between the observed
#'   mixture and the true underlying low-rank and sparse model). In virtually
#'   noise-free settings (e.g. simulations), larger values of `mu` would be
#'   appropriate.
#'
#' The default values of `lambda` and `mu` offer _theoretical_ guarantees
#' of optimal estimation performance, and stable recovery of `L` and `S`. By
#' "stable", we mean `root_pcp()`'s reconstruction error is, in the worst case,
#' proportional to the magnitude of the noise corrupting the observed data
#' (\eqn{||Z||_F}), often outperforming this upper bound.
#' Candès et al. (2011) obtained the guarantee for `lambda`, while
#' [Zhang et al. (2021)](https://proceedings.neurips.cc/paper/2021/hash/f65854da4622c1f1ad4ffeb361d7703c-Abstract.html)
#' obtained the result for `mu`.
#'
#' @section Environmental health specific extensions:
#' We refer interested readers to
#' Gibson et al. (2022) for the complete details regarding the EH-specific
#' extensions.
#'
#' **Missing value functionality:** PCP assumes that the same data generating
#' mechanisms govern both the missing and the observed entries in `D`. Because
#' PCP primarily seeks accurate estimation of _patterns_ rather than
#' individual _observations_, this assumption is reasonable, but in some edge
#' cases may not always be justified. Missing values in `D` are therefore
#' reconstructed in the recovered low-rank `L` matrix according to the
#' underlying patterns in `L`. There are three corollaries to keep in mind
#' regarding the quality of recovered missing observations:
#' 1. Recovery of missing entries in `D` relies on accurate estimation of
#'    `L`;
#' 2. The fewer observations there are in `D`, the harder it is to accurately
#'    reconstruct `L` (therefore estimation of _both_ unobserved _and_ observed
#'    measurements in `L` degrades); and
#' 3. Greater proportions of missingness in `D` artifically drive up the
#'    sparsity of the estimated `S` matrix. This is because it is not possible
#'    to recover a sparse event in `S` when the corresponding entry in `D` is
#'    unobserved. By definition, sparse events in `S` cannot be explained by
#'    the consistent patterns in `L`. Practically, if 20% of the entries in `D`
#'    are missing, then at least 20% of the entries in `S` will be 0.
#'
#' **Handling measurements below the limit of detection:** When equipped with
#' LOD information, PCP treats any estimations of values known to be below the
#' LOD as equally valid if their approximations fall between 0 and the LOD. Over
#' the course of optimization, observations below the LOD are pushed into this
#' known range \eqn{[0, LOD]} using penalties from above and below: should a
#' \eqn{< LOD} estimate be \eqn{< 0}, it is stringently penalized, since
#' measured observations cannot be negative. On the other hand, if a \eqn{< LOD}
#' estimate is \eqn{>} the LOD, it is also heavily penalized: less so than when
#' \eqn{< 0}, but more so than observations known to be above the LOD, because
#' we have prior information that these observations must be below LOD.
#' Observations known to be above the LOD are penalized as usual, using the
#' Frobenius norm in the above objective function.
#'
#' Gibson et al. (2022) demonstrates that
#' in experimental settings with up to 50% of the data corrupted below the LOD,
#' PCP with the LOD extension boasts superior accuracy of recovered `L` models
#' compared to PCA coupled with \eqn{LOD / \sqrt{2}} imputation. PCP even
#' outperforms PCA in low-noise scenarios with as much as 75% of the data
#' corrupted below the LOD. The few situations in which PCA bettered PCP were
#' those pathological cases in which `D` was characterized by extreme noise and
#' huge proportions (i.e., 75%) of observations falling below the LOD.
#'
#' **The non-negativity constraint on `L`:** To enhance interpretability of
#' PCP-rendered solutions, there is an optional non-negativity constraint
#' that can be imposed on the `L` matrix to ensure all estimated values
#' within it are \eqn{\geq 0}. This prevents researchers from having to deal
#' with negative observation values and questions surrounding their meaning
#' and utility. Non-negative `L` models also allow for seamless use of methods
#' such as non-negative matrix factorization to extract non-negative patterns.
#' The non-negativity constraint is incorporated in the ADMM splitting technique
#' via the introduction of an additional optimization variable and corresponding
#' constraint.
#' @inheritParams rrmc
#' @param lambda,mu (Optional) A pair of doubles each in the range `[0, Inf)`
#'   regularizing `S` and `L`. `lambda` controls the sparsity of the output
#'   `S` matrix; larger values penalize non-zero entries in `S` more
#'   stringently, driving the recovery of sparser `S` matrices. `mu` adjusts the
#'    model's sensitivity to noise; larger values will penalize errors between
#'   the predicted model and the observed data more severely. It is highly
#'   recommended the user tunes both of these parameters using
#'   [grid_search_cv()] for each unique data matrix `D`. By default, both
#'   `lambda` and `mu` are `NULL`, in which case the theoretically optimal
#'   values are used, calculated according to [get_pcp_defaults()].
#' @param non_negative (Optional) A logical indicating whether or not the
#'   non-negativity constraint should be used to constrain the output `L`
#'   matrix to have all entries \eqn{\geq 0}. By default, `non_negative = TRUE`.
#' @param max_iter (Optional) An integer specifying the maximum number of
#'   iterations to allow PCP before giving up on meeting PCP's convergence
#'   criteria. By default, `max_iter = 10000`, suitable for most problems.
#' @param verbose (Optional) A logical indicating whether or not to print
#'   information in real time over the course of PCP's optimization. By
#'   default, `verbose = FALSE`.
#'
#' @returns A list containing:
#'   * `L`: The rank-`r` low-rank matrix encoding the `r`-many latent patterns
#'     governing the observed input data matrix `D`. `dim(L)` will be the same
#'     as `dim(D)`. To explicitly obtain the underlying patterns, `L` can be
#'     used as the input to any matrix factorization technique of choice, e.g.
#'     PCA, factor analysis, or non-negative matrix factorization.
#'   * `S`: The sparse matrix containing the rare outlying or extreme
#'     observations in `D` that are not explained by the underlying patterns in
#'     the corresponding `L` matrix. `dim(S)` will be the same as `dim(D)`.
#'     Most entries in `S` are `0`, while non-zero entries identify the extreme
#'     outlying observations in `D`.
#'   * `num_iter`: The number of iterations taken to reach convergence. If
#'     `num_iter == max_iter` then `root_pcp()` did not converge.
#'   * `objective`: A vector containing the values of `root_pcp()`'s objective
#'     function over the course of optimization.
#'   * `converged`: A boolean indicating whether the convergence criteria were
#'     met before `max_iter` was reached.
#'
#' @seealso [rrmc()]
#' @examples
#' #### -------Simple simulated PCP problem-------####
#' # First we will simulate a simple dataset with the sim_data() function.
#' # The dataset will be a 100x10 matrix comprised of:
#' # 1. A rank-2 component as the ground truth L matrix;
#' # 2. A ground truth sparse component S w/outliers along the diagonal; and
#' # 3. A dense Gaussian noise component
#' data <- sim_data(r = 2, sigma = 0.1)
#' # Best practice is to conduct a grid search with grid_search_cv() function,
#' # but we skip that here for brevity.
#' pcp_model <- root_pcp(data$D, lambda = 0.225, mu = 3.04)
#' data.frame(
#'   "Estimated_L_rank" = matrix_rank(pcp_model$L, 5e-2),
#'   "Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
#'   "PCA_error" = norm(data$L - proj_rank_r(data$D, r = 2), "F") / norm(data$L, "F"),
#'   "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
#'   "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
#' )
#' @references Zhang, Junhui, Jingkai Yan, and John Wright.
#'   "Square root principal component pursuit: tuning-free noisy robust matrix
#'   recovery." Advances in Neural Information Processing Systems 34 (2021):
#'   29464-29475. [available
#'   [here](https://proceedings.neurips.cc/paper/2021/hash/f65854da4622c1f1ad4ffeb361d7703c-Abstract.html)]
#' @references Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence
#'   Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff
#'   Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou.
#'   "Principal component pursuit for pattern identification in
#'   environmental mixtures." Environmental Health Perspectives 130, no.
#'   11 (2022): 117008.
#' @references Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan
#'   Eckstein. "Distributed optimization and statistical learning via the
#'   alternating direction method of multipliers." Foundations and Trends in
#'   Machine learning 3, no. 1 (2011): 1-122.
#' @references Gao, Wenbo, Donald Goldfarb, and Frank E. Curtis. "ADMM for
#'   multiaffine constrained optimization." Optimization Methods and Software
#'   35, no. 2 (2020): 257-303.
#' @references Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright.
#'   "Robust principal component analysis?." Journal of the ACM (JACM)
#'   58, no. 3 (2011): 1-37.
#' @export
root_pcp <- function(D, lambda = NULL, mu = NULL, LOD = -Inf, non_negative = TRUE, max_iter = 10000, verbose = FALSE) {
  # 1. Initialize variables, argument assertions:
  # 1a. Matrix dimensions:
  checkmate::assert_matrix(D)
  n <- nrow(D)
  p <- ncol(D)
  if (is.null(lambda)) lambda <- 1 / sqrt(max(n, p))
  if (is.null(mu)) mu <- sqrt(min(n, p) / 2)
  checkmate::qassert(lambda, rules = "N1[0,)")
  checkmate::qassert(mu, rules = "N1[0,)")
  checkmate::qassert(LOD, rules = "n+")
  if (checkmate::test_numeric(LOD, min.len = 2) && !checkmate::test_matrix(LOD)) {
    checkmate::assert_true(ncol(D) == length(LOD))
  }
  if (checkmate::test_matrix(LOD)) {
    checkmate::assert_true(all(dim(D) == dim(LOD)))
  }
  checkmate::qassert(non_negative, rules = "B1")
  checkmate::qassert(max_iter, rules = "X1[1,)")
  checkmate::qassert(verbose, rules = "B1")
  # 1b. Process LOD input (convert LOD vector to matrix if needed, so it multiplies correctly):
  if (is.vector(LOD) && length(LOD) != 1) LOD <- matrix(LOD, nrow = n, ncol = p, byrow = TRUE)
  # 1c. Omega = mask of observed values, for handling missingness later:
  Omega <- !is.na(D)
  D[!Omega] <- 0
  # 1d. mask_above/below_LOD = entries that are observed and above/below LOD:
  mask_above_LOD <- Omega & (D >= LOD)
  mask_below_LOD <- Omega & (D < LOD)
  # 1e. Hard-coded optimization parameters:
  eps_abs <- 1e-06 # Epsilon absolute, for stopping criteria
  eps_rel <- 1e-06 # Epsilon relative, for stopping criteria
  rho <- 0.1 # Augmented Lagrangian coefficient (rate)
  # 1f. L matrix primal variables (L3 carries non-negativity constraint in primal):
  L1 <- matrix(0, nrow = n, ncol = p)
  L2 <- matrix(0, nrow = n, ncol = p)
  L3 <- matrix(0, nrow = n, ncol = p)
  # 1g. S matrix primal variables:
  S1 <- matrix(0, nrow = n, ncol = p)
  S2 <- matrix(0, nrow = n, ncol = p)
  # 1h. Noise/error matrix primal variable:
  Z <- matrix(0, nrow = n, ncol = p)
  # 1i. Dual variables (noise matrix; Y4 carries non-negativity constraint in dual):
  Y1 <- matrix(0, nrow = n, ncol = p)
  Y2 <- matrix(0, nrow = n, ncol = p)
  Y3 <- matrix(0, nrow = n, ncol = p)
  Y4 <- matrix(0, nrow = n, ncol = p)
  # 1j. Flag identifying convergence; vector of objective values for user
  converged <- FALSE
  objective <- vector("numeric", length = max_iter)
  # 2. ADMM-splitting iterations:
  for (i in 1:max_iter) {
    # 2a. Update 1st primal variables (L1, S1):
    nuc <- prox_nuclear((L2 + L3 - (Y1 + Y4) / rho) / 2, c = 1 / 2 / rho)
    L1 <- nuc[[1]]
    L1_nuclear_norm <- nuc[[2]]
    S1 <- prox_l1(S2 - Y2 / rho, c = lambda / rho)
    # 2b. Update 2nd primal variables (L2, L3, S2):
    L2_old <- L2
    S2_old <- S2
    L3_old <- L3
    temp <- L2 + S2 - Y3 / rho
    temp_D <- D * mask_above_LOD + temp * (mask_below_LOD & (temp >= 0) & (temp <= LOD)) + ifelse(mask_below_LOD & (temp > LOD), yes = LOD, no = 0)
    Z <- prox_frobenius(temp - temp_D, c = mu / rho) + temp_D
    if (non_negative) L3 <- pmax(L1 + Y4 / rho, 0)
    L2_obs <- Omega / 3 * (2 * L1 - S1 + Z + (2 * Y1 - Y2 + Y3) / rho)
    L2_unobs <- (1 - Omega) * (L1 + Y1 / rho)
    L2 <- L2_obs + L2_unobs
    S2_obs <- Omega / 3 * (2 * S1 - L1 + Z + (2 * Y2 - Y1 + Y3) / rho)
    S2_unobs <- (1 - Omega) * (S1 + Y2 / rho)
    S2 <- S2_obs + S2_unobs
    # 2c. Update dual variables (Y1, Y2, Y3, Y4):
    Y1 <- Y1 + rho * (L1 - L2)
    Y2 <- Y2 + rho * (S1 - S2)
    Y3 <- Y3 + rho * Omega * (Z - (L2 + S2))
    if (non_negative) Y4 <- Y4 + rho * (L1 - L3)
    # 2d. Calculate primal & dual residuals:
    res_primal <- sqrt(norm(L1 - L2, type = "F")^2 + non_negative * norm(L1 - L3, type = "F")^2 + norm(S1 - S2, type = "F")^2 + norm(Omega * (Z - L2 - S2), type = "F")^2)
    res_dual <- rho * sqrt(norm(L2 + L3 - L2_old - L3_old, type = "F")^2 + norm(S2 - S2_old, type = "F")^2 + norm(Omega * (L2 - L2_old + S2 - S2_old), type = "F")^2)
    # 2e. Update rho:
    if (res_primal > 10 * res_dual) {
      rho <- 2 * rho
    } else if (res_dual > 10 * res_primal) {
      rho <- rho / 2
    }
    # 2f. Calculate objective:
    objective[i] <- L1_nuclear_norm + lambda * sum(abs(S1)) + mu * sqrt(loss_lod(D, L2 + S2, LOD))
    if (verbose) cat(paste0("Iteration: ", i, " Obj: ", round(objective[i], 5)))
    # 2g. Update & check stopping criteria:
    thresh_primal <- eps_abs * sqrt((5 + non_negative) * n * p) + eps_rel * max(sqrt((1 + non_negative) * norm(L1, type = "F")^2 + norm(S1, type = "F")^2 + norm(Z, type = "F")^2), sqrt(norm(L2, type = "F")^2 + norm(L3, type = "F")^2 + norm(S2, type = "F")^2 + norm(Omega * (L2 + S2), type = "F")^2))
    thresh_dual <- eps_abs * sqrt((3 + non_negative) * n * p) + eps_rel * sqrt(norm(Y1 + Y4, type = "F")^2 + norm(Y2, type = "F")^2 + norm(Y3, type = "F")^2)
    if (res_primal < thresh_primal && res_dual < thresh_dual) {
      converged <- TRUE
      if (verbose) cat(paste0("Converged in ", i, " iterations."))
      break
    }
  }
  # 3. Wrap up & return:
  if (!converged && verbose) warning(paste0("Maximum iterations reached (", max_iter, "). PCP did not converge."))
  L_final <- (L1 + L2 + L3) / (2 + non_negative)
  S_final <- (S1 + S2) / 2
  if (non_negative) L_final[L_final < 0] <- 0
  list(L = L_final, S = S_final, num_iter = i, objective = objective, converged = converged)
}

#' Loss function for PCP's limit of detection penalty
#'
#' `loss_lod()` includes the LOD-specific penalty terms to compute the loss on
#' the squared error term `L + S - D`.
#'
#' @param D The original data matrix.
#' @param X The predicted value of `L + S` at the current iteration.
#' @param LOD The LOD as a matrix, where `dim(LOD) == dim(D)`.
#'
#' @returns Scalar value used to calculate loss in objective function.
#'
#' @seealso [root_pcp()]
#' @keywords internal
#' @noRd
loss_lod <- function(D, X, LOD) {
  X_lod <- (X - D) * (D >= 0) +
    (X - LOD) * ((D < 0) & (X > LOD)) +
    X * ((D < 0) & (X < 0))
  sum(X_lod^2) / 2
}