adadelta.updater-class: adadelta updater
In davharris/mistnet: Stochastic Neural Networks

Description Details Fields

An updater with adaptive step sizes, like adagrad. Adadelta modifies adagrad (see adagrad.updater) by decaying the squared gradients and multiplying by an extra term to keep the units consistent. Some evidence indicates that adadelta is more robust

See Zeiler 2012 ADADELTA: AN ADAPTIVE LEARNING RATE METHOD http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf

rho: a rate (e.g. .95) that controls how long the updater "remembers" the squared magnitude of previous updates. Larger rho (closer to 1) allows the model to retain information from more steps in the past.
epsilon: a small constant (e.g. 1E-6) to prevent numerical instability when dividing by small numbers
squared.grad: a matrix summing the squared gradients over all previous updates, but decayed according to rho.
delta: the delta matrix (see updater)
squared.delta: a matrix summing the squared deltas over all previous updates, but decayed according to rho.