View source: R/grid_search_cv.R
grid_search_cv | R Documentation |
grid_search_cv()
conducts a Monte Carlo style cross-validated grid search
of PCP parameters for a given data matrix D
, PCP function pcp_fn
, and
grid of parameter settings to search through grid
. The run time of the grid
search can be sped up using bespoke parallelization settings. The call to
grid_search_cv()
can be wrapped in a call to progressr::with_progress()
for progress bar updates. See the below sections for details.
grid_search_cv(
D,
pcp_fn,
grid,
...,
parallel_strategy = "sequential",
num_workers = 1,
perc_test = 0.05,
num_runs = 100,
return_all_tests = FALSE,
verbose = TRUE
)
D |
The input data matrix (can contain |
pcp_fn |
The PCP function to use when grid searching. Must be either
|
grid |
A |
... |
Any parameters required by |
parallel_strategy |
(Optional) The parallelization strategy used when
conducting the grid search (to be passed on to the |
num_workers |
(Optional) An integer specifying the number of workers to
use when parallelizing the grid search, to be passed on to
|
perc_test |
(Optional) The fraction of entries of |
num_runs |
(Optional) The number of times to test a given parameter
setting. By default, |
return_all_tests |
(Optional) A logical indicating if you would like the
output from all the calls made to |
verbose |
(Optional) A logical indicating if you would like verbose
output displayed or not. By default, |
A list containing:
all_stats
: A data.frame
containing the statistics of every run
comprising the grid search. These statistics include the parameter
settings for the run, along with the run_num
(used as the seed
for the corruption step, step 1 in the grid search procedure),
the relative error for the run rel_err
, the rank of the recovered L
matrix L_rank
, the sparsity of the recovered S matrix S_sparsity
,
the number of iterations
PCP took to reach convergence (for root_pcp()
only), and the error status run_error
of the PCP run (NA
if no error,
otherwise a character string).
summary_stats
: A data.frame
containing a summary of the information in
all_stats
. Summary made by column-wise averaging the results in
all_stats
.
metadata
: A character string containing the metadata associated with the
grid search instance.
If return_all_tests = TRUE
then the following are also returned as part
of the list:
L_mats
: A list containing all the L
matrices returned from PCP
throughout the grid search. Therefore, length(L_mats) == nrow(all_stats)
.
Row i
in all_stats
corresponds to L_mats[[i]]
.
S_mats
: A list containing all the S matrices returned from PCP throughout
the grid search. Therefore, length(S_mats) == nrow(all_stats)
. Row i
in
all_stats
corresponds to S_mats[[i]]
.
test_mats
: A list of length(num_runs)
containing all the corrupted test
mats (and their masks) used throughout the grid search. Note:
all_stats$run[i]
corresponds to test_mats[[i]]
.
original_mat
: The original data matrix D
.
constant_params
: A copy of the constant parameters that were originally
passed to the grid search (for record keeping).
Each hyperparameter setting is cross-validated by:
Randomly corrupting perc_test
percent of the entries in D
as missing
(i.e. NA
values), yielding D_tilde
. Done via sim_na()
.
Running the PCP function pcp_fn
on D_tilde
, yielding estimates L
and S
.
Recording the relative recovery error of L
compared with the input data
matrix D
for only those values that were imputed as missing during the
corruption step (step 1 above). Mathematically, calculate:
||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F
, where
P_{\Omega^c}
selects only those entries where
is.na(D_tilde) == TRUE
.
Repeating steps 1-3 for a total of num_runs
-many times, where each "run"
has a unique random seed from 1
to num_runs
associated with it.
Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.
perc_test
and num_runs
Experimentally, this grid search procedure retrieves the best performing
PCP parameter settings when perc_test
is relatively low, e.g.
perc_test = 0.05
, or 5%, and num_runs
is relatively high, e.g.
num_runs = 100
.
The larger perc_test
is, the more the test set turns into a matrix
completion problem, rather than the desired matrix decomposition problem. To
better resemble the actual problem PCP will be faced with come inference
time, perc_test
should therefore be kept relatively low.
Choosing a reasonable value for num_runs
is dependent on the need to keep
perc_test
relatively low. Ideally, a large enough num_runs
is used so
that many (if not all) of the entries in D
are likely to eventually be
tested. Note that since test set entries are chosen randomly for all runs 1
through num_runs
, in the pathologically worst case scenario, the same exact
test set could be drawn each time. In the best case scenario, a different
test set is obtained each run, providing balanced coverage of D
. Viewed
another way, the smaller num_runs
is, the more the results are susceptible
to overfitting to the relatively few selected test sets.
Once the grid search of has been conducted, the optimal hyperparameters can
be chosen by examining the output statistics summary_stats
. Below are a
few suggestions for how to interpret the summary_stats
table:
Generally speaking, the first thing a user will want to inspect is the
rel_err
statistic, capturing the relative discrepancy between recovered
test sets and their original, observed (yet possibly noisy) values. Lower
rel_err
means the PCP model was better able to recover the held-out test
set. So, in general, the best parameter settings are those with the
lowest rel_err
. Having said this, it is important to remember that this
statistic should be taken with a grain of salt: Because no ground truth L
matrix exists, the rel_err
measurement is forced to rely on the
comparison between the noisy observed data matrix D
and the estimated
low-rank model L
. So the rel_err
metric is an "apples to oranges"
relative error. For data that is a priori expected to be subject to a
high degree of noise, it may actually be better to discard
parameter settings with suspiciously low rel_err
s (in which
case the solution may be hallucinating an inaccurate low-rank structure
from the observed noise).
For grid searches using root_pcp()
as the PCP model, parameters that
fail to converge can be discarded. Generally, fewer root_pcp()
iterations
(num_iter
) taken to reach convergence portend a more reliable / stable
solution. In rare cases, the user may need to increase root_pcp()
's
max_iter
argument to reach convergence. rrmc()
does not report
convergence metadata, as its optimization scheme runs for a fixed
number of iterations.
Parameter settings with unreasonable sparsity or rank measurements
can also be discarded. Here, "unreasonable" means these reported metrics
flagrantly contradict prior assumptions, knowledge, or work. For instance,
most air pollution datasets contain a number of extreme exposure events, so
PCP solutions returning sparse S
models with 100% sparsity have obviously
been regularized too heavily. Solutions with lower sparsities should be
preferred. Note that reported sparsity and rank measurements are estimates
heavily dependent on the thresh
set by the sparsity()
& matrix_rank()
functions. E.g. it could be that the actual average matrix rank is much
higher or lower when a threshold that better takes into account the
relative scale of the singular values is used. Likewise for the sparsity
estimations. Also, recall that the given value for perc_test
artifically
sets a sparsity floor, since those missing entries in the test set cannot
be recovered in the S
matrix. E.g. if perc_test = 0.05
, then no
parameter setting will have an estimated sparsity lower than 5%.
sim_na()
, sparsity()
, matrix_rank()
, get_pcp_defaults()
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
data$D,
rrmc,
data.frame("eta" = 0.35),
r = 3,
num_runs = 2
)
gs$summary_stats
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.