Understanding AdaMW optimizer

10/26/2025

The standard gradient descent updates the parameters ( $\theta$ ) given a constant learning rate ( $\alpha$ ) inorder to minimise the loss $L$ by multiplying the learning rate to the gradient of loss w.r.t to the parameters. $\theta = \theta - \alpha\dfrac{\partial L}{\partial\theta}$

However, this is usually dependent on the value of $\alpha$ . A low value could lead to slow convergenece whereas a high value could possibly miss the minima.

To avoid this, we either have to keep track of the of the learning rate and change it based on schedule or as needed. This usually needs some trail and error.

To overcome this, AdaM optimizer is initroduced. AdaM optimizer adaptively changes the learning rate $\alpha$ for each parameter $\theta$ . This enables faster convergence of the training compared to the gradient descent.

Understanding AdaMW optimizer

Table of Contents