grid_search_cv: Cross-validated grid search for PCP models
In Columbia-PRIME/pcpr: Principal Component Pursuit for Environmental Epidemiology

grid_search_cv

R Documentation

Cross-validated grid search for PCP models

Description

grid_search_cv() conducts a Monte Carlo style cross-validated grid search of PCP parameters for a given data matrix D, PCP function pcp_fn, and grid of parameter settings to search through grid. The run time of the grid search can be sped up using bespoke parallelization settings. The call to grid_search_cv() can be wrapped in a call to progressr::with_progress() for progress bar updates. See the below sections for details.

Usage

grid_search_cv(
  D,
  pcp_fn,
  grid,
  ...,
  parallel_strategy = "sequential",
  num_workers = 1,
  perc_test = 0.05,
  num_runs = 100,
  return_all_tests = FALSE,
  verbose = TRUE
)

Arguments

`D`	The input data matrix (can contain `NA` values). Note that PCP will converge much more quickly when `D` has been standardized in some way (e.g. scaling columns by their standard deviations, or column-wise min-max normalization).
`pcp_fn`	The PCP function to use when grid searching. Must be either `rrmc` or `root_pcp` (passed without the soft brackets).
`grid`	A `data.frame` of dimension `j` by `k` containing the `j`-many unique settings of `k`-many parameters to try. NOTE: The columns of `grid` should be named after the required parameters in the function header of `pcp_fn`. For example, if `pcp_fn = root_pcp` and you want to search through `lambda` and `mu`, then `names(grid)` must be set to `c("lambda", "mu")`. If instead you want to keep e.g. `lambda` fixed and search through only `mu`, you can either have a `grid` with only one column, `mu`, and pass `lambda` as a constant via `...`, or you can have `names(grid)` set to `c("lambda", "mu")` where `lambda` is constant. The same logic applies for `pcp_fn = rrmc` and `eta` and `r`.
`...`	Any parameters required by `pcp_fn` that should be kept constant throughout the grid search, or those parameters that cannot be stored in `grid` (e.g. the `LOD` parameter). A parameter should not be passed with `...` if it is already a column in `grid`, as that behavior is ambiguous.
`parallel_strategy`	(Optional) The parallelization strategy used when conducting the grid search (to be passed on to the `future::plan()` function). Must be one of: `"sequential"`, `"multisession"`, `"multicore"` or `"cluster"`. By default, `parallel_strategy = "sequential"`, which runs the grid search in serial and the `num_workers` argument is ignored. The option `parallel_strategy = "multisession"` parallelizes the search via sockets in separate R sessions. The option `parallel_strategy = "multicore"` is not supported on Windows machines, nor in .Rmd files (must be run in a .R script) but parallelizes the search much faster than `"multisession"` since it runs separate forked R processes. The option `parallel_strategy = "cluster"` parallelizes using separate R sessions running typically on one or more machines. Support for other parallel strategies will be added in a future release of `pcpr`. It is recommended to use `parallel_strategy = "multicore"` or `"multisession"` when possible.
`num_workers`	(Optional) An integer specifying the number of workers to use when parallelizing the grid search, to be passed on to `future::plan()`. By default, `num_workers = 1`. When possible, it is recommended to use `num_workers = parallel::detectCores(logical = F)`, which computes the number of physical CPUs available on the machine (see `parallel::detectCores()`). `num_workers` is ignored when `parallel_strategy = "sequential"`, and must be `⁠> 1⁠` otherwise.
`perc_test`	(Optional) The fraction of entries of `D` that will be randomly corrupted as `NA` missing values (the test set). Can be anthing in the range `⁠[0, 1)⁠`. By default, `perc_test = 0.05`. See Best practices section for more details.
`num_runs`	(Optional) The number of times to test a given parameter setting. By default, `num_runs = 100`. See Best practices section for more details.
`return_all_tests`	(Optional) A logical indicating if you would like the output from all the calls made to `pcp_fn` over the course of the grid search to be returned to you in list format. If set to `FALSE`, then only statistics on the parameters tested will be returned. If set to `TRUE` then every `L`, and `S` matrix recovered during the grid search will be returned in the lists `L_mats` and `S_mats`, every test set matrix will be returned in the list `test_mats`, the original input matrix will be returned as `original_mat`, and the parameters passed in to `...` will be returned in the `constant_params` list. By default, `return_all_tests = FALSE`, which is highly recommended. Setting `return_all_tests = TRUE` can consume a massive amount of memory depending on the size of `grid`, the input matrix `D`, and the value for `num_runs`.
`verbose`	(Optional) A logical indicating if you would like verbose output displayed or not. By default, `verbose = TRUE`. To obtain progress bar updates, the user must wrap the `grid_search_cv()` call with a call to `progressr::with_progress()`. The progress bar does not depend on the value passed for `verbose`.

Value

A list containing:

all_stats: A data.frame containing the statistics of every run comprising the grid search. These statistics include the parameter settings for the run, along with the run_num (used as the seed for the corruption step, step 1 in the grid search procedure), the relative error for the run rel_err, the rank of the recovered L matrix L_rank, the sparsity of the recovered S matrix S_sparsity, the number of iterations PCP took to reach convergence (for root_pcp() only), and the error status run_error of the PCP run (NA if no error, otherwise a character string).
summary_stats: A data.frame containing a summary of the information in all_stats. Summary made by column-wise averaging the results in all_stats.
metadata: A character string containing the metadata associated with the grid search instance.

If return_all_tests = TRUE then the following are also returned as part of the list:

L_mats: A list containing all the L matrices returned from PCP throughout the grid search. Therefore, length(L_mats) == nrow(all_stats). Row i in all_stats corresponds to L_mats[[i]].
S_mats: A list containing all the S matrices returned from PCP throughout the grid search. Therefore, length(S_mats) == nrow(all_stats). Row i in all_stats corresponds to S_mats[[i]].
test_mats: A list of length(num_runs) containing all the corrupted test mats (and their masks) used throughout the grid search. Note: all_stats$run[i] corresponds to test_mats[[i]].
original_mat: The original data matrix D.
constant_params: A copy of the constant parameters that were originally passed to the grid search (for record keeping).

The Monte Carlo style cross-validation procedure

Each hyperparameter setting is cross-validated by:

Randomly corrupting perc_test percent of the entries in D as missing (i.e. NA values), yielding D_tilde. Done via sim_na().
Running the PCP function pcp_fn on D_tilde, yielding estimates L and S.
Recording the relative recovery error of L compared with the input data matrix D for only those values that were imputed as missing during the corruption step (step 1 above). Mathematically, calculate: ||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F, where P_{\Omega^c} selects only those entries where is.na(D_tilde) == TRUE.
Repeating steps 1-3 for a total of num_runs-many times, where each "run" has a unique random seed from 1 to num_runs associated with it.
Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.

Best practices for `perc_test` and `num_runs`

Experimentally, this grid search procedure retrieves the best performing PCP parameter settings when perc_test is relatively low, e.g. perc_test = 0.05, or 5%, and num_runs is relatively high, e.g. num_runs = 100.

The larger perc_test is, the more the test set turns into a matrix completion problem, rather than the desired matrix decomposition problem. To better resemble the actual problem PCP will be faced with come inference time, perc_test should therefore be kept relatively low.

Choosing a reasonable value for num_runs is dependent on the need to keep perc_test relatively low. Ideally, a large enough num_runs is used so that many (if not all) of the entries in D are likely to eventually be tested. Note that since test set entries are chosen randomly for all runs 1 through num_runs, in the pathologically worst case scenario, the same exact test set could be drawn each time. In the best case scenario, a different test set is obtained each run, providing balanced coverage of D. Viewed another way, the smaller num_runs is, the more the results are susceptible to overfitting to the relatively few selected test sets.

Interpretaion of results

Once the grid search of has been conducted, the optimal hyperparameters can be chosen by examining the output statistics summary_stats. Below are a few suggestions for how to interpret the summary_stats table:

Generally speaking, the first thing a user will want to inspect is the rel_err statistic, capturing the relative discrepancy between recovered test sets and their original, observed (yet possibly noisy) values. Lower rel_err means the PCP model was better able to recover the held-out test set. So, in general, the best parameter settings are those with the lowest rel_err. Having said this, it is important to remember that this statistic should be taken with a grain of salt: Because no ground truth L matrix exists, the rel_err measurement is forced to rely on the comparison between the noisy observed data matrix D and the estimated low-rank model L. So the rel_err metric is an "apples to oranges" relative error. For data that is a priori expected to be subject to a high degree of noise, it may actually be better to discard parameter settings with suspiciously low rel_errs (in which case the solution may be hallucinating an inaccurate low-rank structure from the observed noise).
For grid searches using root_pcp() as the PCP model, parameters that fail to converge can be discarded. Generally, fewer root_pcp() iterations (num_iter) taken to reach convergence portend a more reliable / stable solution. In rare cases, the user may need to increase root_pcp()'s max_iter argument to reach convergence. rrmc() does not report convergence metadata, as its optimization scheme runs for a fixed number of iterations.
Parameter settings with unreasonable sparsity or rank measurements can also be discarded. Here, "unreasonable" means these reported metrics flagrantly contradict prior assumptions, knowledge, or work. For instance, most air pollution datasets contain a number of extreme exposure events, so PCP solutions returning sparse S models with 100% sparsity have obviously been regularized too heavily. Solutions with lower sparsities should be preferred. Note that reported sparsity and rank measurements are estimates heavily dependent on the thresh set by the sparsity() & matrix_rank() functions. E.g. it could be that the actual average matrix rank is much higher or lower when a threshold that better takes into account the relative scale of the singular values is used. Likewise for the sparsity estimations. Also, recall that the given value for perc_test artifically sets a sparsity floor, since those missing entries in the test set cannot be recovered in the S matrix. E.g. if perc_test = 0.05, then no parameter setting will have an estimated sparsity lower than 5%.

Examples

#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
  data$D,
  rrmc,
  data.frame("eta" = 0.35),
  r = 3,
  num_runs = 2
)
gs$summary_stats

Columbia-PRIME/pcpr documentation built on April 14, 2025, 8:33 a.m.