An updater with adaptive step sizes, like adagrad.
Adadelta modifies adagrad (see adagrad.updater) by decaying the
squared gradients and multiplying by an extra term to keep the units
consistent. Some evidence indicates that adadelta is more robust
See Zeiler 2012 ADADELTA: AN ADAPTIVE LEARNING RATE METHOD http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf
rhoa rate (e.g. .95) that controls how long the updater "remembers" the squared magnitude of previous updates. Larger rho (closer to 1) allows the model to retain information from more steps in the past.
epsilona small constant (e.g. 1E-6) to prevent numerical instability when dividing by small numbers
squared.grada matrix summing the squared gradients over all previous updates, but decayed according to rho.
deltathe delta matrix (see updater)
squared.deltaa matrix summing the squared deltas over all previous updates, but decayed according to rho.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.