#' Conducts a cross-validated bayesian search of the parameters for Principle Component Pursuit (PCP).
#'
#' \code{bayes_search_cv} conducts a cross-validated bayesian search of the
#' parameters for a given PCP function, given a data matrix \code{mat} and parameter settings to search through.
#' Briefly, a bayesian grid search fits a gaussian process (GP) prior to evaluated parameter settings and then
#' uses that prior to calculate a conditional distribution. Given the calculated conditional distribution, an acquisition function
#' is then used to pick the next parameter setting for evaluation, essentially conducting a "smart" search of the parameter grid.
#' The hope is to avoid a brute-force search of the entire parameter grid and instead only evaluate a fraction of the possible parameter settings
#' yet still find the optimal parameter settings.
#' See the \strong{Methods} section below for details on the parameter evaluation step.
#'
#' @param mat The data matrix to conduct the grid search on.
#' @param pcp_func The PCP function to use when grid searching. \emph{Note: the PCP function passed must be able to handle missing \code{NA} values.} For example: \code{root_pcp_na}.
#' @param grid_df A dataframe with dimension N x P containing the N-many settings of P-many parameters to try. \emph{The columns of grid_df should be named exactly as they are in the function
#' header of} \code{pcp_func}. For example, if \code{pcp_func = root_pcp_noncvx_na}, then the columns of \code{grid_df} should be named "lambda", "mu", and "r"
#' (assuming you want to search all 3 parameters; if one of those parameters is constant, instead of giving it its own column in \code{grid_df}, you can simply pass it as a free argument to this method. See \code{...} below).
#' An optional additional column named "value" can be included that contains the mean relative errors recorded by that row's parameter setting, with those rows (settings) that have not been tried left as \code{NA}.
#' In this way, you can perform a grid search in which you already know the relative errors of some parameter settings, but would like to expand your knowledge of the unexplored parts of the grid further.
#' Ex: conduct a a bayesian grid search, examining 10/50 settings. Then search again, looking at another 10 settings, but including the information learned from the first run.
#' @param init_evals The number of parameter settings in \code{grid_df} you would like to randomly evaluate for the initialization step.
#' @param bayes_evals The number of parameter settings you would like the bayesian grid search to evaluate.
#' @param cores The number of cores to use when parallelizing the grid search. If \code{cores = 1}, the search will be conducted sequentially. If \code{cores > 1}, then the search will be parallelized.
#' By default, \code{cores =} the maximum available cores on your machine. For optimal performance, \code{cores} should usually be set to half that.
#' @param acquisition_function The acquisition function to guide the bayesian grid search's picks. Must be one of \code{"ei", "poi", "cb"}. By default \code{acquisition_function = "ei"}. It is recommended to try the default before changing acquisition function.
#' @param perc_b The percent of entries of the matrix \code{mat} that will be randomly imputed as \code{NA} missing values. By default, \code{perc_b = 0.2}.
#' @param runs The number of times to test a given parameter setting. By default, \code{runs = 100}.
#' @param seed The seed used for the initialization step (essentially a small random grid search). By default, \code{seed = NULL} to simulate randomness. For reproducible results, set \code{seed} to some whole number.
#' @param verbose An optional logical indicating if you would like a progress bar displayed or not. By default, \code{verbose = TRUE}.
#' @param file An optional character containing the file path used to save the output in. Should end in "\code{.Rda}". When \code{file = NULL}, the output is not saved. By default, \code{file = NULL}.
#' @param ... \emph{Any parameters required by \code{pcp_func} that were not specified in \code{grid_df}, and therefore are kept constant (not involved in the grid search).
#' An example could be the \code{LOD} parameter for those PCP functions that require the \code{LOD} argument.}
#'
#' @section Methods:
#' Each hyperparameter setting of is cross-validated by:
#' \enumerate{
#' \item Randomly corrupting \code{perc_b} percent of the entries in \code{mat} as missing (i.e. \code{NA} values), yielding \code{corrupted_mat}.
#' Done via the \code{\link{corrupt_mat_randomly}} function.
#' \item Running a PCP function (\code{pcp_func}) on \code{corrupted_mat}, giving \code{L_hat} and \code{S_hat}.
#' \item Recording the relative recovery errors of \code{L_hat + S_hat} compared with the raw input data matrix for only those values that were imputed as missing during the corruption step.
#' \item Repeating steps 1-3 for a total of \code{runs} many times.
#' \item Reporting the mean of the \code{runs}-many runs for each parameter setting.
#' }
#'
#' @return A list containing the following:
#' \describe{
#' \item{\code{raw}}{a \code{data.frame} containing the raw statistics of each run comprising the grid search.
#' These statistics include the parameter settings for the run,
#' the random \code{seed} used for the corruption step outlined in step 1 of the \strong{Methods} section below,
#' the relative error for the run, the rank of the recovered L matrix, the sparsity of the recovered S matrix,
#' and the number of iterations PCP took to reach convergence (20,000 = Did not converge as of PCPhelpers v. 0.3.1).}
#' \item{\code{formatted}}{A \code{data.frame} containing the summary of the grid search.
#' Made to easily pass on to \code{\link{print_gs}}.}
#' \item{\code{bayes.picks}}{A \code{data.frame} summarizing those parameter settings that the bayesian grid search chose to evaluate.}
#' \item{\code{constants}}{A list containing those arguments initially passed as constant values when calling \code{bayes_search_cv}.}
#' }
#' @seealso \code{\link{grid_search_cv}}, \code{\link{random_search_cv}}, and \code{\link{print_gs}}
#' @examples
#'
#' library(pcpr) # since we will be passing grid_search_cv a PCP function
#'
#' # simulate a data matrix:
#'
#' n <- 50
#' p <- 10
#' data <- sim_data(sim_seed = 1, nrow = n, ncol = p, rank = 3, sigma=0, add_sparse = FALSE)
#' mat <- data$M
#'
#' # pick parameter settings of lambda and mu to try:
#'
#' lambdas <- c(1/sqrt(n), 1.25/sqrt(n), 1.5/sqrt(n))
#' mus <- c(sqrt(p/2), sqrt(p/1.5), sqrt(p/1.25))
#' param_grid <- expand.grid(lambda = lambdas, mu = mus)
#'
#' # run the grid search:
#'
#' param_grid.out <- bayes_search_cv(mat, pcp_func = root_pcp_na, grid_df = param_grid, init_evals = 3, bayes_evals = 3, cores = 4, acquisition_function = "ei", perc_b = 0.2, runs = 20, seed = 1, verbose = TRUE, file = NULL)
#'
#' # visualize the output:
#'
#' print_gs(param_grid.out$formatted)
#' @export
#' @importFrom magrittr %>%
#' @importFrom foreach foreach %do%
bayes_search_cv <- function(
mat,
pcp_func,
grid_df,
init_evals,
bayes_evals,
cores = NULL,
acquisition_function = "ei",
perc_b = 0.2,
runs = 100,
seed = NULL,
verbose = TRUE,
file = NULL,
...)
{
if (acquisition_function == "ei") {
acquire <- expected_improvement
} else if (acquisition_function == "poi") {
acquire <- probability_improvement
} else if (acquisition_function == "cb") {
acquire <- lower_confidence_bound
} else {
acquire <- NULL
stop("acquisition_function must be one of: 'ei', 'poi', or 'cb'")
}
if (verbose) print("Conducting bayesian search...")
# initialization
metrics <- c("value", "L.rank", "S.sparsity", "iterations")
for (metric in metrics) {
if (!metric %in% names(grid_df)) {
grid_df[, metric] <- as.numeric(NA)
}
}
evaled_count <- sum(!is.na(grid_df$value)*1)
if (evaled_count < init_evals) {
if (verbose) print("Running the initial random search step 1/2:")
cv.init <- random_search_cv(mat=mat, pcp_func=pcp_func, grid_df=grid_df, n_evals=init_evals - evaled_count, cores = cores, perc_b=perc_b, runs=runs, seed=seed, progress_bar = verbose, file = NULL, ...)
grid_df <- cv.init$formatted
history.init <- cv.init$raw
} else {
if (verbose) print("No need to run the initial random search step 1/2. init_evals already made.")
history.init <- data.frame()
}
kernel <- list(
cov = list(type = "exponential", power = 1.95),
matern = list(type = "matern", nu = 5/2)
)
metrics <- c("value", "L.rank", "S.sparsity", "iterations")
param_names <- grid_df %>% dplyr::select(!tidyselect::all_of(metrics)) %>% colnames()
X <- as.matrix(grid_df[, param_names])
X.scaled <- as.data.frame(apply(X, 2, GPfit::scale_norm))
#### grid search ####
points_to_eval_count <- sum(is.na(grid_df$value)*1)
if (bayes_evals > points_to_eval_count) {
bayes_evals <- points_to_eval_count
}
if (verbose) print("Running the final bayes search step 2/2:")
bayes.picks <- data.frame()
history.bayes <- foreach(i = iterators::icount(bayes_evals), .combine = rbind) %do% {
if (verbose) print(paste("Conducting bayes eval:", i, "/", bayes_evals))
# fit GP with all the points we have so far
points_evaled <- which(!is.na(grid_df$value))
X.evaled <- as.matrix(X.scaled[points_evaled,])
Y.evaled <- as.numeric(grid_df[points_evaled,]$value)
fit <- GPfit::GP_fit(
X = X.evaled,
Y = Y.evaled,
corr = kernel$cov
)
# look at all the points we have yet to evaluate
points_to_eval.scaled <- X.scaled[-points_evaled, ]
points_to_eval <- grid_df[-points_evaled, ]
pred <- GPfit::predict.GP(fit, xnew = points_to_eval.scaled)
gp.mu <- pred$Y_hat
gp.sigma <- sqrt(pred$MSE)
idx.best <- which.min(grid_df$value)
y.best <- as.numeric(grid_df[idx.best, c("value")])
# predict and try another point (acquire = acquisition_function).
x.next <- acquire(gp.mu, gp.sigma, y.best, points_to_eval)
x.next.r <- grid_search_cv(mat=mat, pcp_func=pcp_func, grid_df=x.next, cores = cores, perc_b=perc_b, runs=runs, progress_bar = verbose, file = NULL, ...)
bayes.picks <- rbind(bayes.picks, x.next.r$formatted)
x.next.r.formatted <- x.next.r$formatted %>% tidyr::unite(param_setting, param_names)
grid_df.formatted <- grid_df %>% tidyr::unite(param_setting, param_names)
grid_df.formatted[match(x.next.r.formatted$param_setting, grid_df.formatted$param_setting), ] <- x.next.r.formatted
grid_df <- grid_df.formatted %>% tidyr::separate(param_setting, into = param_names, sep = "_", convert = TRUE)
x.next.r$raw
}
history <- rbind(history.init, history.bayes)
# save
if (!is.null(file)) save(history, grid_df, bayes.picks, file = file)
# return
list(raw = history, formatted = grid_df, bayes.picks = bayes.picks, constants = list(...))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.