optimisers: optimisation methods
In greta-dev/greta: Simple and Scalable Statistical Modelling in R

optimisers

R Documentation

optimisation methods

Description

Functions to set up optimisers (which find parameters that maximise the joint density of a model) and change their tuning parameters, for use in opt(). For details of the algorithms and how to tune them, see the TensorFlow optimiser docs, or the Tensorflow Probability optimiser docs.

Usage

nelder_mead(
  objective_function = NULL,
  initial_vertex = NULL,
  step_sizes = NULL,
  func_tolerance = 1e-08,
  position_tolerance = 1e-08,
  reflection = NULL,
  expansion = NULL,
  contraction = NULL,
  shrinkage = NULL
)

bfgs(
  value_and_gradients_function = NULL,
  initial_position = NULL,
  tolerance = 1e-08,
  x_tolerance = 0L,
  f_relative_tolerance = 0L,
  initial_inverse_hessian_estimate = NULL,
  stopping_condition = NULL,
  validate_args = TRUE,
  max_line_search_iterations = 50L,
  f_absolute_tolerance = 0L
)

powell()

momentum()

cg()

newton_cg()

l_bfgs_b()

tnc()

cobyla()

slsqp()

gradient_descent(learning_rate = 0.01, momentum = 0, nesterov = FALSE)

adadelta(learning_rate = 0.001, rho = 1, epsilon = 1e-08)

adagrad(learning_rate = 0.8, initial_accumulator_value = 0.1, epsilon = 1e-08)

adagrad_da(
  learning_rate = 0.8,
  global_step = 1L,
  initial_gradient_squared_accumulator_value = 0.1,
  l1_regularization_strength = 0,
  l2_regularization_strength = 0
)

adam(
  learning_rate = 0.1,
  beta_1 = 0.9,
  beta_2 = 0.999,
  amsgrad = FALSE,
  epsilon = 1e-08
)

adamax(learning_rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07)

ftrl(
  learning_rate = 1,
  learning_rate_power = -0.5,
  initial_accumulator_value = 0.1,
  l1_regularization_strength = 0,
  l2_regularization_strength = 0,
  l2_shrinkage_regularization_strength = 0,
  beta = 0
)

proximal_gradient_descent(
  learning_rate = 0.01,
  l1_regularization_strength = 0,
  l2_regularization_strength = 0
)

proximal_adagrad(
  learning_rate = 1,
  initial_accumulator_value = 0.1,
  l1_regularization_strength = 0,
  l2_regularization_strength = 0
)

nadam(learning_rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07)

rms_prop(
  learning_rate = 0.1,
  rho = 0.9,
  momentum = 0,
  epsilon = 1e-10,
  centered = FALSE
)

Arguments

`objective_function`	A function that accepts a point as a real Tensor and returns a Tensor of real dtype containing the value of the function at that point. The function to be minimized. If `batch_evaluate_objective` is TRUE, the function may be evaluated on a Tensor of shape `⁠[n+1] + s⁠` where n is the dimension of the problem and s is the shape of a single point in the domain (so n is the size of a Tensor representing a single point). In this case, the expected return value is a Tensor of shape `⁠[n+1]⁠`. Note that this method does not support univariate functions so the problem dimension n must be strictly greater than 1.
`initial_vertex`	Tensor of real dtype and any shape that can be consumed by the `objective_function`. A single point in the domain that will be used to construct an axes aligned initial simplex.
`step_sizes`	Tensor of real dtype and shape broadcasting compatible with `initial_vertex`. Supplies the simplex scale along each axes.
`func_tolerance`	Single numeric number. The algorithm stops if the absolute difference between the largest and the smallest function value on the vertices of the simplex is below this number. Default is 1e-08.
`position_tolerance`	Single numeric number. The algorithm stops if the largest absolute difference between the coordinates of the vertices is below this threshold.
`reflection`	(optional) Positive Scalar Tensor of same dtype as `initial_vertex`. This parameter controls the scaling of the reflected vertex. See, Press et al(2007) for details. If not specified, uses the dimension dependent prescription of Gao and Han (2012) \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s10589-010-9329-3")}
`expansion`	(optional) Positive Scalar Tensor of same dtype as `initial_vertex`. Should be greater than 1 and reflection. This parameter controls the expanded scaling of a reflected vertex.See, Press et al(2007) for details. If not specified, uses the dimension dependent prescription of Gao and Han (2012) \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s10589-010-9329-3")}
`contraction`	(optional) Positive scalar Tensor of same dtype as `initial_vertex`. Must be between 0 and 1. This parameter controls the contraction of the reflected vertex when the objective function at the reflected point fails to show sufficient decrease. See, Press et al(2007) for details. If not specified, uses the dimension dependent prescription of Gao and Han (2012) \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s10589-010-9329-3")}
`shrinkage`	(Optional) Positive scalar Tensor of same dtype as `initial_vertex`. Must be between 0 and 1. This parameter is the scale by which the simplex is shrunk around the best point when the other steps fail to produce improvements. See, Press et al(2007) for details. If not specified, uses the dimension dependent prescription of Gao and Han (2012) \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s10589-010-9329-3")}
`value_and_gradients_function`	A function that accepts a point as a real Tensor and returns a tuple of Tensors of real dtype containing the value of the function and its gradient at that point. The function to be minimized. The input should be of shape `⁠[..., n]⁠`, where n is the size of the domain of input points, and all others are batching dimensions. The first component of the return value should be a real Tensor of matching shape `⁠[...]⁠`. The second component (the gradient) should also be of shape `⁠[..., n]⁠` like the input value to the function.
`initial_position`	real Tensor of shape `⁠[..., n]⁠`. The starting point, or points when using batching dimensions, of the search procedure. At these points the function value and the gradient norm should be finite.
`tolerance`	Scalar Tensor of real dtype. Specifies the gradient tolerance for the procedure. If the supremum norm of the gradient vector is below this number, the algorithm is stopped. Default is 1e-08.
`x_tolerance`	Scalar Tensor of real dtype. If the absolute change in the position between one iteration and the next is smaller than this number, the algorithm is stopped. Default of 0L.
`f_relative_tolerance`	Scalar Tensor of real dtype. If the relative change in the objective value between one iteration and the next is smaller than this value, the algorithm is stopped.
`initial_inverse_hessian_estimate`	Optional Tensor of the same dtype as the components of the output of the value_and_gradients_function. If specified, the shape should broadcastable to shape `⁠[..., n, n]⁠`; e.g. if a single `⁠[n, n]⁠` matrix is provided, it will be automatically broadcasted to all batches. Alternatively, one can also specify a different hessian estimate for each batch member. For the correctness of the algorithm, it is required that this parameter be symmetric and positive definite. Specifies the starting estimate for the inverse of the Hessian at the initial point. If not specified, the identity matrix is used as the starting estimate for the inverse Hessian.
`stopping_condition`	(Optional) A function that takes as input two Boolean tensors of shape `⁠[...]⁠`, and returns a Boolean scalar tensor. The input tensors are converged and failed, indicating the current status of each respective batch member; the return value states whether the algorithm should stop. The default is `tfp$optimizer.converged_all` which only stops when all batch members have either converged or failed. An alternative is `tfp$optimizer.converged_any` which stops as soon as one batch member has converged, or when all have failed.
`validate_args`	Logical, default TRUE. When TRUE, optimizer parameters are checked for validity despite possibly degrading runtime performance. When FALSE invalid inputs may silently render incorrect outputs.
`max_line_search_iterations`	Python int. The maximum number of iterations for the hager_zhang line search algorithm.
`f_absolute_tolerance`	Scalar Tensor of real dtype. If the absolute change in the objective value between one iteration and the next is smaller than this value, the algorithm is stopped.
`learning_rate`	the size of steps (in parameter space) towards the optimal value. Default value 0.01
`momentum`	hyperparameter that accelerates gradient descent in the relevant direction and dampens oscillations. Defaults to 0, which is vanilla gradient descent.
`nesterov`	Whether to apply Nesterov momentum. Defaults to FALSE.
`rho`	the decay rate
`epsilon`	a small constant used to condition gradient updates
`initial_accumulator_value`	initial value of the 'accumulator' used to tune the algorithm
`global_step`	the current training step number
`initial_gradient_squared_accumulator_value`	initial value of the accumulators used to tune the algorithm
`l1_regularization_strength`	L1 regularisation coefficient (must be 0 or greater)
`l2_regularization_strength`	L2 regularisation coefficient (must be 0 or greater)
`beta_1`	exponential decay rate for the 1st moment estimates
`beta_2`	exponential decay rate for the 2nd moment estimates
`amsgrad`	Boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". Defaults to FALSE.
`learning_rate_power`	power on the learning rate, must be 0 or less
`l2_shrinkage_regularization_strength`	A float value, must be greater than or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. When input is sparse shrinkage will only happen on the active weights.
`beta`	A float value, representing the beta value from the paper by McMahan et al 2013. Defaults to 0
`centered`	Boolean. If TRUE, gradients are normalized by the estimated variance of the gradient; if FALSE, by the uncentered second moment. Setting this to TRUE may help with training, but is slightly more expensive in terms of computation and memory. Defaults to FALSE.

Details

The optimisers powell(), cg(), newton_cg(), l_bfgs_b(), tnc(), cobyla(), and slsqp() are now defunct. They will error when called in greta 0.5.0. This are removed because they are no longer available in TensorFlow 2.0. Note that optimiser momentum() has been replaced with gradient_descent()

Value

an optimiser object that can be passed to opt().

Note

This optimizer isn't supported in TF2, so proceed with caution. See the TF docs on AdagradDAOptimiser for more detail.

This optimizer isn't supported in TF2, so proceed with caution. See the TF docs on ProximalGradientDescentOptimizer for more detail.

This optimizer isn't supported in TF2, so proceed with caution. See the TF docs on ProximalAdagradOptimizer for more detail.

Examples

## Not run: 
# use optimisation to find the mean and sd of some data
x <- rnorm(100, -2, 1.2)
mu <- variable()
sd <- variable(lower = 0)
distribution(x) <- normal(mu, sd)
m <- model(mu, sd)

# configure optimisers & parameters via 'optimiser' argument to opt
opt_res <- opt(m, optimiser = bfgs())

# compare results with the analytic solution
opt_res$par
c(mean(x), sd(x))

## End(Not run)

greta-dev/greta documentation built on June 10, 2025, 1:47 p.m.