mize: Numerical Optimization

Description Usage Arguments Details Value Optimization Methods Line Search Momentum Convergence References Examples

View source: R/mize.R


Numerical optimization including conjugate gradient, Broyden-Fletcher-Goldfarb-Shanno (BFGS), and the limited memory BFGS.


mize(par, fg, method = "L-BFGS", norm_direction = FALSE, memory = 5,
  scale_hess = TRUE, cg_update = "PR+", preconditioner = "",
  tn_init = 0, tn_exit = "curvature", nest_q = 0,
  nest_convex_approx = FALSE, nest_burn_in = 0, step_up = 1.1,
  step_up_fun = "*", step_down = NULL, dbd_weight = 0.1,
  line_search = "More-Thuente", c1 = 1e-04, c2 = NULL,
  step0 = NULL, step_next_init = NULL, try_newton_step = NULL,
  ls_max_fn = 20, ls_max_gr = Inf, ls_max_fg = Inf,
  ls_max_alpha_mult = Inf, ls_max_alpha = Inf, ls_safe_cubic = FALSE,
  strong_curvature = NULL, approx_armijo = NULL, mom_type = NULL,
  mom_schedule = NULL, mom_init = NULL, mom_final = NULL,
  mom_switch_iter = NULL, mom_linear_weight = FALSE,
  use_init_mom = FALSE, restart = NULL, restart_wait = 10,
  max_iter = 100, max_fn = Inf, max_gr = Inf, max_fg = Inf,
  abs_tol = sqrt(.Machine$double.eps), rel_tol = abs_tol,
  grad_tol = NULL, ginf_tol = NULL,
  step_tol = sqrt(.Machine$double.eps), check_conv_every = 1,
  log_every = check_conv_every, verbose = FALSE,
  store_progress = FALSE)



Initial values for the function to be optimized over.


Function and gradient list. See 'Details'.


Optimization method. See 'Details'.


If TRUE, then the steepest descent direction is normalized to unit length. Useful for adaptive step size methods where the previous step size is used to initialize the next iteration.


The number of updates to store if using the L-BFGS method. Ignored otherwise. Must be a positive integer.


if TRUE, the approximation to the inverse Hessian is scaled according to the method described by Nocedal and Wright (approximating an eigenvalue). Applies only to the methods BFGS (where the scaling is applied only during the first step) and L-BFGS (where the scaling is applied during every iteration). Ignored otherwise.


Type of update to use for the "CG" method. For details see the "CG" subsection of the "Optimization Methods" section. Ignored if method is not "CG".


Type of preconditioner to use in Truncated Newton. Leave blank or set to "L-BFGS" to use a limited memory BFGS preconditioner. Use the "memory" parameter to control the number of updates to store. Applies only if method = "TN" or "CG", ignored otherwise.


Type of initialization to use in inner loop of Truncated Newton. Use 0 to use the zero vector (the usual TN initialization), or "previous" to use the final result from the previous iteration, as suggested by Martens (2010). Applies only if method = "TN", ignored otherwise.


Type of exit criterion to use when terminating the inner CG loop of Truncated Newton method. Either "curvature" to use the standard negative curvature test, or "strong" to use the modified "strong" curvature test in TNPACK (Xie and Schlick, 1999). Applies only if method = "TN", ignored otherwise.


Strong convexity parameter for the NAG momentum term. Must take a value between 0 (strongly convex) and 1 (zero momentum). Only applies using the NAG method or a momentum method with Nesterov momentum schedule. Also does nothing if nest_convex_approx is TRUE.


If TRUE, then use an approximation due to Sutskever for calculating the momentum parameter in the NAG method. Only applies using the NAG method or a momentum method with Nesterov momentum schedule.


Number of iterations to wait before using a non-zero momentum. Only applies using the NAG method or a momentum method with Nesterov momentum schedule.


Value by which to increase the step size for the "bold" step size method or the "DBD" method.


Operator to use when combining the current step size with step_up. Can be one of "*" (to multiply the current step size with step_up) or "+" (to add).


Multiplier to reduce the step size by if using the "DBD" method or the "bold" line search method. Should be a positive value less than 1. Also optional for use with the "back" line search method.


Weighting parameter used by the "DBD" method only, and only if no momentum scheme is provided. Must be an integer between 0 and 1.


Type of line search to use. See 'Details'.


Sufficient decrease parameter for Wolfe-type line searches. Should be a value between 0 and 1.


Sufficient curvature parameter for line search for Wolfe-type line searches. Should be a value between c1 and 1.


Initial value for the line search on the first step. See 'Details'.


For Wolfe-type line searches only, how to initialize the line search on iterations after the first. See 'Details'.


For Wolfe-type line searches only, try the line step value of 1 as the initial step size whenever step_next_init suggests a step size > 1. Defaults to TRUE for quasi-Newton methods such as BFGS and L-BFGS, FALSE otherwise.


Maximum number of function evaluations allowed during a line search.


Maximum number of gradient evaluations allowed during a line search.


Maximum number of function or gradient evaluations allowed during a line search.


The maximum value that can be attained by the ratio of the initial guess for alpha for the current line search, to the final value of alpha of the previous line search. Used to stop line searches diverging due to very large initial guesses. Only applies for Wolfe-type line searches.


Maximum value of alpha allowed during line search. Only applies for line_search = "more-thuente".


(Optional). If TRUE, check that cubic interpolation in the Wolfe line search does not produce too small a value, using method of Xie and Schlick (2002). Only applies for line_search = "more-thuente".


(Optional). If TRUE use the strong curvature condition in Wolfe line search. See the 'Line Search' section for details.


(Optional). If TRUE use the approximate Armijo condition in Wolfe line search. See the 'Line Search' section for details.


Momentum type, either "classical" or "nesterov". See 'Details'.


Momentum schedule. See 'Details'.


Initial momentum value.


Final momentum value.


For mom_schedule "switch" only, the iteration when mom_init is changed to mom_final.


If TRUE, the gradient contribution to the update is weighted using momentum contribution.


If TRUE, then the momentum coefficient on the first iteration is non-zero. Otherwise, it's zero. Only applies if using a momentum schedule.


Momentum restart type. Can be one of "fn", "gr" or "speed". See 'Details'. Ignored if no momentum scheme is being used.


Number of iterations to wait between restarts. Ignored if restart is NULL.


Maximum number of iterations to optimize for. Defaults to 100. See the 'Convergence' section for details.


Maximum number of function evaluations. See the 'Convergence' section for details.


Maximum number of gradient evaluations. See the 'Convergence' section for details.


Maximum number of function or gradient evaluations. See the 'Convergence' section for details.


Absolute tolerance for comparing two function evaluations. See the 'Convergence' section for details.


Relative tolerance for comparing two function evaluations. See the 'Convergence' section for details.


Absolute tolerance for the length (l2-norm) of the gradient vector. See the 'Convergence' section for details.


Absolute tolerance for the infinity norm (maximum absolute component) of the gradient vector. See the 'Convergence' section for details.


Absolute tolerance for the size of the parameter update. See the 'Convergence' section for details.


Positive integer indicating how often to check convergence. Default is 1, i.e. every iteration. See the 'Convergence' section for details.


Positive integer indicating how often to log convergence results to the console. Ignored if verbose is FALSE. If not an integer multiple of check_conv_every, it will be set to check_conv_every.


If TRUE, log information about the progress of the optimization to the console.


If TRUE store information about the progress of the optimization in a data frame, and include it as part of the return value.


The function to be optimized should be passed as a list to the fg parameter. This should consist of:

The fg function is optional, but for some methods (e.g. line search methods based on the Wolfe criteria), both the function and gradient values are needed for the same parameter value. Calculating them in the same function can save time if there is a lot of shared work.


A list with components:

Optimization Methods

The method specifies the optimization method:

For more details on gradient-based optimization in general, and the BFGS, L-BFGS and CG methods, see Nocedal and Wright.

Line Search

The parameter line_search determines the line search to be carried out:

If using one of the methods: "BFGS", "L-BFGS", "CG" or "NAG", one of the Wolfe line searches: "Rasmussen" or "More-Thuente", "Schmidt" or "Hager-Zhang" should be used, otherwise very poor performance is likely to be encountered. The following parameters can be used to control the line search:

For the Wolfe line searches, the methods of "Rasmussen", "Schmidt" and "More-Thuente" default to using the strong curvature condition and the exact Armijo condition to terminate the line search (i.e. Strong Wolfe conditions). The default step size initialization methods use the Rasmussen method for the first iteration and quadratic interpolation for subsequent iterations.

The "Hager-Zhang" Wolfe line search method defaults to the standard curvature condition and the approximate Armijo condition (i.e. approximate Wolfe conditions). The default step size initialization methods are those used by Hager and Zhang (2006) in the description of CG_DESCENT.

If the "DBD" is used for the optimization "method", then the line_search parameter is ignored, because this method controls both the direction of the search and the step size simultaneously. The following parameters can be used to control the step size:

The "bold driver" line search also uses the step_up and step_down parameters with similar meanings to their use with the "DBD" method: the backtracking portion reduces the step size by a factor of step_down. Once a satisfactory step size has been found, the line search for the next iteration is initialized by multiplying the previously found step size by step_up.


For method "Momentum", momentum schemes can be accessed through the momentum arguments:

The restart parameter provides a way to restart the momentum if the optimization appears to be not be making progress, inspired by the method of O'Donoghue and Candes (2013) and Su and co-workers (2014). There are three strategies:

The effect of the restart is to "forget" any previous momentum update vector, and, for those momentum schemes that change with iteration number, to effectively reset the iteration number back to zero. If the mom_type is "nesterov", the gradient-based restart is not available. The restart_wait parameter controls how many iterations to wait after a restart, before allowing another restart. Must be a positive integer. Default is 10, as used by Su and co-workers (2014). Setting this too low could cause premature convergence. These methods were developed specifically for the NAG method, but can be employed with any momentum type and schedule.

If method type "momentum" is specified with no other values, the momentum scheme will default to a constant value of 0.9.


There are several ways for the optimization to terminate. The type of termination is communicated by a two-item list terminate in the return value, consisting of what, a short string describing what caused the termination, and val, the value of the termination criterion that caused termination.

The following parameters control various stopping criteria:

Convergence is checked between specific iterations. How often is determined by the check_conv_every parameter, which specifies the number of iterations between each check. By default, this is set for every iteration.

Be aware that if abs_tol or rel_tol are non-NULL, this requires the function to have been evaluated at the current position at the end of each iteration. If the function at that position has not been calculated, it will be calculated and will contribute to the total reported in the counts list in the return value. The calculated function value is cached for use by the optimizer in the next iteration, so if the optimizer would have needed to calculate the function anyway (e.g. use of the strong Wolfe line search methods), there is no significant cost accrued by calculating it earlier for convergence calculations. However, for methods that don't use the function value at that location, this could represent a lot of extra function evaluations. On the other hand, not checking convergence could result in a lot of extra unnecessary iterations. Similarly, if grad_tol or ginf_tol is non-NULL, then the gradient will be calculated if needed.

If extra function or gradient evaluations is an issue, set check_conv_every to a higher value, but be aware that this can cause convergence limits to be exceeded by a greater amount.

Note also that if the verbose parameter is TRUE, then a summary of the results so far will be logged to the console whenever a convergence check is carried out. If the store_progress parameter is TRUE, then the same information will be returned as a data frame in the return value. For a long optimization this could be a lot of data, so by default it is not stored.

Other ways for the optimization to terminate is if an iteration generates a non-finite (i.e. Inf or NaN) gradient or function value. Some, but not all, line-searches will try to recover from the latter, by reducing the step size, but a non-finite gradient calculation during the gradient descent portion of optimization is considered catastrophic by mize, and it will give up. Termination under non-finite gradient or function conditions will result in terminate$what being "gr_inf" or "fn_inf" respectively. Unlike the convergence criteria, the optimization will detect these error conditions and terminate even if a convergence check would not be carried out for this iteration.

The value of par in the return value should be the parameters which correspond to the lowest value of the function that has been calculated during the optimization. As discussed above however, determining which set of parameters requires a function evaluation at the end of each iteration, which only happens if either the optimization method calculates it as part of its own operation or if a convergence check is being carried out during this iteration. Therefore, if your method does not carry out function evaluations and check_conv_every is set to be so large that no convergence calculation is carried out before max_iter is reached, then the returned value of par is the last value encountered.


Gilbert, J. C., & Nocedal, J. (1992). Global convergence properties of conjugate gradient methods for optimization. SIAM Journal on optimization, 2(1), 21-42.

Hager, W. W., & Zhang, H. (2005). A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM Journal on Optimization, 16(1), 170-192.

Hager, W. W., & Zhang, H. (2006). Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software (TOMS), 32(1), 113-137.

Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural networks, 1(4), 295-307.

Janet, J. A., Scoggins, S. M., Schultz, S. M., Snyder, W. E., White, M. W., & Sutton, J. C. (1998, May). Shocking: An approach to stabilize backprop training with greedy adaptive learning rates. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. (Vol. 3, pp. 2218-2223). IEEE.

Martens, J. (2010, June). Deep learning via Hessian-free optimization. In Proceedings of the International Conference on Machine Learning. (Vol. 27, pp. 735-742).

More', J. J., & Thuente, D. J. (1994). Line search algorithms with guaranteed sufficient decrease. ACM Transactions on Mathematical Software (TOMS), 20(3), 286-307.

Nocedal, J., Sartenaer, A., & Zhu, C. (2002). On the behavior of the gradient norm in the steepest descent method. Computational Optimization and Applications, 22(1), 5-35.

Nocedal, J., & Wright, S. (2006). Numerical optimization. Springer Science & Business Media.

O'Donoghue, B., & Candes, E. (2013). Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3), 715-732.

Schmidt, M. (2005). minFunc: unconstrained differentiable multivariate optimization in Matlab. http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html

Su, W., Boyd, S., & Candes, E. (2014). A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. In Advances in Neural Information Processing Systems (pp. 2510-2518).

Sutskever, I. (2013). Training recurrent neural networks (Doctoral dissertation, University of Toronto).

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 1139-1147).

Xie, D., & Schlick, T. (1999). Remark on Algorithm 702 - The updated truncated Newton minimization package. ACM Transactions on Mathematical Software (TOMS), 25(1), 108-122.

Xie, D., & Schlick, T. (2002). A more lenient stopping rule for line search algorithms. Optimization Methods and Software, 17(4), 683-700.


# Function to optimize and starting point defined after creating optimizer
rosenbrock_fg <- list(
  fn = function(x) { 100 * (x[2] - x[1] * x[1]) ^ 2 + (1 - x[1]) ^ 2  },
  gr = function(x) { c( -400 * x[1] * (x[2] - x[1] * x[1]) - 2 * (1 - x[1]),
                         200 *        (x[2] - x[1] * x[1])) })
rb0 <- c(-1.2, 1)

# Minimize using L-BFGS
res <- mize(rb0, rosenbrock_fg, method = "L-BFGS")

# Conjugate gradient with Fletcher-Reeves update, tight Wolfe line search
res <- mize(rb0, rosenbrock_fg, method = "CG", cg_update = "FR", c2 = 0.1)

# Steepest decent with constant momentum = 0.9
res <- mize(rb0, rosenbrock_fg, method = "MOM", mom_schedule = 0.9)

# Steepest descent with constant momentum in the Nesterov style as described
# in papers by Sutskever and Bengio
res <- mize(rb0, rosenbrock_fg, method = "MOM", mom_type = "nesterov",
             mom_schedule = 0.9)

# Nesterov momentum with adaptive restart comparing function values
res <- mize(rb0, rosenbrock_fg, method = "MOM", mom_type = "nesterov",
             mom_schedule = 0.9, restart = "fn")

jlmelville/mizer documentation built on Sept. 23, 2018, 10:50 p.m.