vanilla_search: Conducts a cross-validated grid search of the parameters for...

View source: R/vanilla_search.R

vanilla_searchR Documentation

Conducts a cross-validated grid search of the parameters for Principle Component Pursuit (PCP).

Description

vanilla_search conducts a cross-validated grid search of the parameters for a given data matrix mat, PCP function pcp_func, and settings of parameters to search through grid. See the Methods section below for more details.

Usage

vanilla_search(
  mat,
  pcp_func,
  grid,
  scale_func = NULL,
  parallel_approach = "multisession",
  cores = parallel::detectCores(logical = F),
  perc_test = 0.15,
  runs = 1,
  conserve_memory = FALSE,
  verbose = TRUE,
  save_as = NULL,
  ...
)

Arguments

mat

The data matrix to conduct the grid search on.

pcp_func

The PCP function to use when grid searching. Note: the PCP function passed must be able to handle missing NA values. For example: root_pcp_na.

grid

A dataframe with dimension N x P containing the N-many settings of P-many parameters to try. The columns of grid should be named exactly as they are in the function header of pcp_func. For example, if pcp_func = root_pcp_noncvx_na, then the columns of grid should be named "lambda", "mu", and "r" An optional additional column named "rel_err" can be included that contains the mean relative error recorded by that row's parameter setting, with those rows (settings) that have not been tried left as NA. In this way, you can perform a grid search in which you already know the relative errors of some parameter settings, but would like to expand your knowledge of the unexplored parts of the grid further.

scale_func

(Optional) The function used to scale the input mat by column. By default, scale_func = NULL, and no scaling is to be done at all.

parallel_approach

(Optional) The computational approach used when conducting the gridsearch (to be passed on to the future package's plan function). Must be one of: "sequential", "multisession", "multicore". By default, parallel_approach = "multisession", which does parallelization via sockets (in separate R sessions) and works on any operating system. If parallel_approach = "sequential" then the search will be conducted in serial. The option parallel_approach = "multicore" is not supported on Windows machines, nor in RStudio (must be run from the command line) but is faster than the "multisession" approach since it runs separate forked R processes.

cores

(Optional) The number of cores to use when parallelizing the grid search. By default, cores = parallel::detectCores(logical = F), which is the number of physical CPUs available on the machine.

perc_test

(Optional) The fraction of entries of mat that will be randomly imputed as NA missing values (the test set). Can be anthing in the range [0, 1). By default, perc_test = 0.15.

runs

(Optional) The number of times to test a given parameter setting. By default, runs = 1.

conserve_memory

(Optional) A logical indicating if you only care about the actual statistics of the gridsearch and would therefore like to conserve memory when running the gridsearch. If set to TRUE, then only statistics on the parameters tested will be returned. By default, conserve_memory = FALSE, in which case additional objects saving the outputs of all runs of pcp_func will also be returned.

verbose

(Optional) A logical indicating if you would like verbose output displayed or not. By default, verbose = TRUE.

save_as

(Optional) A character containing the root of the file path used to save the output to. Importantly, this should not end in any file extension, since this character will be used to save both the resulting [save_as].rds and [save_as]_README.txt files. By default, save_as = NULL, in which case the gridsearch is not saved to any file.

...

Any parameters required by pcp_func that could not be specified in grid. Importantly, these parameters are therefore kept constant (not involved in the grid search). The best example is the LOD parameter for those PCP functions that require the LOD argument.

Value

A list containing the following:

all_stats

A data.frame containing the statistics of every run comprising the grid search. These statistics include the parameter settings for the run, along with the run number (used as the seed in the corruption step outlined in step 1 of the Methods section), the relative error for the run rel_err, the rank of the recovered L matrix L_rank, the sparsity of the recovered S matrix S_sparsity, the number of iterations PCP took to reach convergence, and the error status run_error of the PCP run (NA if no error, otherwise a character).

summary_stats

A data.frame containing a summary of the information in all_stats. Made to easily pass on to print_gs.

L_mats

A list containing all the L matrices returned from PCP throughout the gridsearch. Therefore, length(L_mats) == nrow(all_stats). Row i in all_stats corresponds to L_mats[[i]]. Only returned when conserve_memory = FALSE.

S_mats

A list containing all the S matrices returned from PCP throughout the gridsearch. Therefore, length(S_mats) == nrow(all_stats). Row i in all_stats corresponds to S_mats[[i]]. Only returned when conserve_memory = FALSE.

test_mats

A list of length(runs) containing all the corrupted test mats (and their masks) used throughout the gridsearch. Note: all_stats$run[i] corresponds to test_mats[[i]]. Only returned when conserve_memory = FALSE.

original_mat

The original data matrix mat after it was column scaled by scale_func. Only returned when conserve_memory = FALSE.

constant_params

A copy of the constant parameters that were originally passed to the gridsearch (for record keeping).

Methods

Each hyperparameter setting is cross-validated by:

  1. Randomly corrupting perc_test percent of the entries in mat as missing (i.e. NA values), yielding corrupted_mat. Done via the corrupt_mat_randomly function.

  2. Running the PCP function (pcp_func) on corrupted_mat, giving L_hat and S_hat.

  3. Recording the relative recovery errors of L_hat compared with the input data matrix mat for only those values that were imputed as missing during the corruption step. Ie. ||P_OmegaCompliment(mat - L_hat)||_F / ||P_OmegaCompliment(mat)||_F.

  4. Repeating steps 1-3 for a total of runs many times.

  5. Reporting the mean of the runs-many runs for each parameter setting.

See Also

Older versions of PCP's gridsearch (not recommended): grid_search_cv, random_search_cv, bayes_search_cv, and print_gs

Examples


library(pcpr) # since we will be passing \code{vanilla_search} a PCP function 

# simulate a data matrix:

n <- 50
p <- 10
data <- sim_data(sim_seed = 1, nrow = n, ncol = p, rank = 3, sigma=0, add_sparse = FALSE)
mat <- data$M

# pick parameter settings of lambda and mu to try:

lambdas <- c(1/sqrt(n), 1.25/sqrt(n), 1.5/sqrt(n))
mus <- c(sqrt(p/2), sqrt(p/1.5), sqrt(p/1.25))
param_grid <- expand.grid(lambda = lambdas, mu = mus)

# run the grid search:

search_results <- vanilla_search(mat, pcp_func = root_pcp_na, grid_df = param_grid, cores = 4, perc_b = 0.2, runs = 20, verbose = TRUE, save_as = NULL)

# visualize the output:

print_gs2(search_results$summary_stats)

Columbia-PRIME/PCPhelpers documentation built on April 24, 2022, 7:57 p.m.