An updater with adaptive step sizes, like adagrad.
Adadelta modifies adagrad (see adagrad.updater
) by decaying the
squared gradients and multiplying by an extra term to keep the units
consistent. Some evidence indicates that adadelta is more robust
See Zeiler 2012 ADADELTA: AN ADAPTIVE LEARNING RATE METHOD http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf
rho
a rate (e.g. .95) that controls how long the updater "remembers" the squared magnitude of previous updates. Larger rho (closer to 1) allows the model to retain information from more steps in the past.
epsilon
a small constant (e.g. 1E-6) to prevent numerical instability when dividing by small numbers
squared.grad
a matrix summing the squared gradients over all previous updates, but decayed according to rho.
delta
the delta matrix (see updater
)
squared.delta
a matrix summing the squared deltas over all previous updates, but decayed according to rho.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.