| optimizer_muon | R Documentation |
Note that this optimizer should not be used in the following layers:
Embedding layer
Final output fully connected layer
Any 0- or 1-D variables
These should all be optimized using AdamW.
The Muon optimizer can use both the Muon update step or the AdamW update step based on the following:
For any variable that isn't 2D, 3D or 4D, the AdamW step will be used. This is not configurable.
If the argument exclude_embeddings (defaults to TRUE) is set
to TRUE, the AdamW step will be used.
For any variable with a name that matches an expression
listed in the argument exclude_layers (a list), the
AdamW step will be used.
Any other variable uses the Muon step.
Typically, you only need to pass the name of your densely-connected
output layer to exclude_layers, e.g.
exclude_layers = "output_dense".
optimizer_muon(
learning_rate = 0.001,
adam_beta_1 = 0.9,
adam_beta_2 = 0.999,
epsilon = 1e-07,
weight_decay = 0.1,
clipnorm = NULL,
clipvalue = NULL,
global_clipnorm = NULL,
use_ema = FALSE,
ema_momentum = 0.99,
ema_overwrite_frequency = NULL,
loss_scale_factor = NULL,
gradient_accumulation_steps = NULL,
name = "muon",
exclude_layers = NULL,
exclude_embeddings = TRUE,
muon_a = 3.4445,
muon_b = -4.775,
muon_c = 2.0315,
adam_lr_ratio = 0.1,
momentum = 0.95,
ns_steps = 6L,
nesterov = TRUE,
...
)
learning_rate |
A float,
|
adam_beta_1 |
A float value or a constant float tensor, or a callable
that takes no arguments and returns the actual value to use.
The exponential decay rate for the 1st moment estimates. Defaults to
|
adam_beta_2 |
A float value or a constant float tensor, ora callable
that takes no arguments and returns the actual value to use.
The exponential decay rate for the 2nd moment estimates. Defaults to
|
epsilon |
A small constant for numerical stability. This is
"epsilon hat" in the Kingma and Ba paper
(in the formula just before Section 2.1),
not the epsilon in Algorithm 1 of the paper.
It is used as in AdamW. Defaults to |
weight_decay |
Float. If set, weight decay is applied. |
clipnorm |
Float. If set, the gradient of each weight is individually clipped so that its norm is no higher than this value. |
clipvalue |
Float. If set, the gradient of each weight is clipped to be no higher than this value. |
global_clipnorm |
Float. If set, the gradient of all weights is clipped so that their global norm is no higher than this value. |
use_ema |
Boolean, defaults to |
ema_momentum |
Float, defaults to |
ema_overwrite_frequency |
Int or |
loss_scale_factor |
Float or |
gradient_accumulation_steps |
Int or |
name |
String, name for the object |
exclude_layers |
List of strings, keywords of layer names to exclude. All layers with keywords in their path will use AdamW. |
exclude_embeddings |
Boolean value.
If |
muon_a |
Float, parameter a of the muon algorithm. It is recommended to use the default value. |
muon_b |
Float, parameter b of the muon algorithm. It is recommended to use the default value. |
muon_c |
Float, parameter c of the muon algorithm. It is recommended to use the default value. |
adam_lr_ratio |
Float, the ratio of the learning rate when
using Adam to the main learning rate.
it is recommended to set it to |
momentum |
Float, momentum used by internal SGD. |
ns_steps |
Integer, number of Newton-Schulz iterations to run. |
nesterov |
Boolean, whether to use Nesterov-style momentum. |
... |
For forward/backward compatibility. |
an Optimizer instance
Other optimizers:
optimizer_adadelta()
optimizer_adafactor()
optimizer_adagrad()
optimizer_adam()
optimizer_adam_w()
optimizer_adamax()
optimizer_ftrl()
optimizer_lamb()
optimizer_lion()
optimizer_loss_scale()
optimizer_nadam()
optimizer_rmsprop()
optimizer_sgd()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.