An updater with adaptive step sizes. Adagrad allows different weights to have different effective learning rates, depending on how much that parameter has moved so far.
__. Following Senior et al. ("An empirical study of learning rates in deep neural networks for speech recognition"), the squared gradients are initialized at K instead of 0. By default, K == 0.1
learning.rate
the learning rate (set to one in the original paper)
squared.grad
a matrix summing the squared gradients over all previous updates
delta
the delta matrix (see updater
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.