MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep.

optim_madgrad(params, lr = 0.01, momentum = 0.9, weight_decay = 0, eps = 1e-06)
`params` |
(list): List of parameters to optimize. |

`lr` |
(float): Learning rate (default: 1e-2). |

`momentum` |
(float): Momentum value in the range [0,1) (default: 0.9). |

`weight_decay` |
(float): Weight decay, i.e. a L2 penalty (default: 0). |

`eps` |
(float): Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6). |

MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam's beta1 should work here also.

On sparse problems both weight_decay and momentum should be set to 0. (not yet supported in the R implementation).

An optimizer object implementing the `step`

and `zero_grad`

methods.

