random_search_cv: Conducts a cross-validated randomized search of the...

View source: R/random_search_cv.R

random_search_cvR Documentation

Conducts a cross-validated randomized search of the parameters for Principle Component Pursuit (PCP).

Description

random_search_cv conducts a cross-validated randomized search of the parameters for a given PCP function, given a data matrix mat and parameter settings to search through. See the Methods section below for more details.

Usage

random_search_cv(
  mat,
  pcp_func,
  grid_df,
  n_evals,
  cores = NULL,
  perc_b = 0.2,
  runs = 100,
  seed = NULL,
  progress_bar = TRUE,
  file = NULL,
  ...
)

Arguments

mat

The data matrix to conduct the grid search on.

pcp_func

The PCP function to use when grid searching. Note: the PCP function passed must be able to handle missing NA values. For example: root_pcp_na.

grid_df

A dataframe with dimension N x P containing the N-many settings of P-many parameters to try. The columns of grid_df should be named exactly as they are in the function header of pcp_func. For example, if pcp_func = root_pcp_noncvx_na, then the columns of grid_df should be named "lambda", "mu", and "r" (assuming you want to search all 3 parameters; if one of those parameters is constant, instead of giving it its own column in grid_df, you can simply pass it as a free argument to this method. See ... below). An optional additional column named "value" can be included that contains the mean relative errors recorded by that row's parameter setting, with those rows (settings) that have not been tried left as NA. In this way, you can perform a grid search in which you already know the relative errors of some parameter settings, but would like to expand your knowledge of the unexplored parts of the grid further. Ex: conduct a a bayesian grid search, examining 10/50 settings. Then search again, looking at another 10 settings, but including the information learned from the first run.

n_evals

The number of parameter settings in grid_df you would like to evaluate.

cores

The number of cores to use when parallelizing the grid search. If cores = 1, the search will be conducted sequentially. If cores > 1, then the search will be parallelized. By default, cores = the maximum available cores on your machine. For optimal performance, cores should usually be set to half that.

perc_b

The percent of entries of the matrix mat that will be randomly imputed as NA missing values. By default, perc_b = 0.2.

runs

The number of times to test a given parameter setting. By default, runs = 100.

seed

The seed used when randomly selecting parameter settings to evaluate. By default, seed = NULL to simulate randomness. For reproducible results, set seed to some whole number.

progress_bar

An optional logical indicating if you would like a progress bar displayed or not. By default, progress_bar = TRUE.

file

An optional character containing the file path used to save the output in. Should end in ".Rda". When file = NULL, the output is not saved. By default, file = NULL.

...

Any parameters required by pcp_func that were not specified in grid_df, and therefore are kept constant (not involved in the grid search). An example could be the LOD parameter for those PCP functions that require the LOD argument.

Value

A list containing the following:

raw

a data.frame containing the raw statistics of each run comprising the grid search. These statistics include the parameter settings for the run, the random seed used for the corruption step outlined in step 1 of the Methods section below, the relative error for the run, the rank of the recovered L matrix, the sparsity of the recovered S matrix, and the number of iterations PCP took to reach convergence (20,000 = Did not converge as of PCPhelpers v. 0.3.1).

formatted

A data.frame containing the summary of the grid search. Made to easily pass on to print_gs.

constants

A list containing those arguments initially passed as constant values when calling random_search_cv.

Methods

Each hyperparameter setting is cross-validated by:

  1. Randomly corrupting perc_b percent of the entries in mat as missing (i.e. NA values), yielding corrupted_mat. Done via the corrupt_mat_randomly function.

  2. Running a PCP function (pcp_func) on corrupted_mat, giving L_hat and S_hat.

  3. Recording the relative recovery errors of L_hat + S_hat compared with the raw input data matrix for only those values that were imputed as missing during the corruption step.

  4. Repeating steps 1-3 for a total of runs many times.

  5. Reporting the mean of the runs-many runs for each parameter setting.

See Also

grid_search_cv, bayes_search_cv, and print_gs

Examples


library(pcpr) # since we will be passing grid_search_cv a PCP function 

# simulate a data matrix:

n <- 50
p <- 10
data <- sim_data(sim_seed = 1, nrow = n, ncol = p, rank = 3, sigma=0, add_sparse = FALSE)
mat <- data$M

# pick parameter settings of lambda and mu to try:

lambdas <- c(1/sqrt(n), 1.25/sqrt(n), 1.5/sqrt(n))
mus <- c(sqrt(p/2), sqrt(p/1.5), sqrt(p/1.25))
param_grid <- expand.grid(lambda = lambdas, mu = mus)

# run the grid search:

param_grid.out <- random_search_cv(mat, pcp_func = root_pcp_na, grid_df = param_grid, n_evals = 4, cores = 4, perc_b = 0.2, runs = 20, seed = 1, progress_bar = TRUE, file = NULL)

# visualize the output:

print_gs(param_grid.out$formatted)

Columbia-PRIME/PCPhelpers documentation built on April 24, 2022, 7:57 p.m.