Understanding AdaMW optimizer

The standard gradient descent updates the parameters (θ\theta) given a constant learning rate (α\alpha) inorder to minimise the loss LL by multiplying the learning rate to the gradient of loss w.r.t to the parameters. θ=θαLθ \theta = \theta - \alpha\dfrac{\partial L}{\partial\theta}

However, this is usually dependent on the value of α\alpha. A low value could lead to slow convergenece whereas a high value could possibly miss the minima.

To avoid this, we either have to keep track of the of the learning rate and change it based on schedule or as needed. This usually needs some trail and error.

To overcome this, AdaM optimizer is initroduced. AdaM optimizer adaptively changes the learning rate α\alpha for each parameter θ\theta. This enables faster convergence of the training compared to the gradient descent.

Table of Contents