Description Usage Arguments Details Value Examples
MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep.
1 | optim_madgrad(params, lr = 0.01, momentum = 0.9, weight_decay = 0, eps = 1e-06)
|
params |
(list): List of parameters to optimize. |
lr |
(float): Learning rate (default: 1e-2). |
momentum |
(float): Momentum value in the range [0,1) (default: 0.9). |
weight_decay |
(float): Weight decay, i.e. a L2 penalty (default: 0). |
eps |
(float): Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6). |
MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam's beta1 should work here also.
On sparse problems both weight_decay and momentum should be set to 0. (not yet supported in the R implementation).
An optimizer object implementing the step
and zero_grad
methods.
1 2 3 4 5 6 7 8 9 10 11 12 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.