Adam (Kingma & Ba, 2014)

TL;DR

Adam is stochastic gradient descent with adaptive per-parameter step sizes, computed from running averages of the gradient (first moment, like momentum) and the squared gradient (second moment, like RMSProp). It works well out of the box on almost any deep-learning task — which is why nearly every modern paper trains with Adam and rarely revisits the choice.

The defining update, for a parameter \(\theta\) with stochastic gradient \(g_t\) at step \(t\):

\[\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1)\, g_t, \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2)\, g_t^{2}, \\ \hat{m}_t &= m_t / (1 - \beta_1^{t}), \\ \hat{v}_t &= v_t / (1 - \beta_2^{t}), \\ \theta_t &= \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon). \end{aligned}\]

The defaults that the paper recommends — \(\alpha = 10^{-3}\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\) — work for most problems with no tuning.

Problem & Motivation

Plain SGD takes the same step size for every parameter:

\[\theta_{t} = \theta_{t-1} - \alpha\, g_t.\]

That’s a problem when different parameters have wildly different gradient scales (input embeddings vs output logits, dense vs sparse features). You either pick a learning rate that’s safe for the largest gradients (and crawl on the small ones) or one that’s aggressive on small gradients (and explode on the large ones). Real networks don’t have a single “right” learning rate — they have a different right rate per parameter.

The pre-Adam landscape:

SGD with momentum (Polyak, 1964; Nesterov 1983): adds a velocity term to dampen oscillations along high-curvature directions. Reduces the noise but doesn’t address per-parameter scaling.
AdaGrad (Duchi, Hazan & Singer, 2011): scale each step by the inverse-root-sum of past squared gradients. Solves per-parameter scaling but the denominator grows monotonically — eventually the effective learning rate shrinks to nothing and progress stops.
RMSProp (Hinton, 2012, unpublished but widely circulated): replace AdaGrad’s growing sum with an exponential moving average. Effective learning rate stops shrinking; works well but no momentum.

Adam’s pitch: combine RMSProp’s adaptive denominator with momentum-style first-moment smoothing, plus a careful bias correction that fixes a subtle initialization issue.

The Algorithm, Slowly

Initialize \(m_0 = 0\), \(v_0 = 0\), \(t = 0\). Then for each step:

Compute the stochastic gradient \(g_t = \nabla_{\theta} f_t(\theta_{t-1})\).
Update the first moment estimate:

\[m_t = \beta_1\, m_{t-1} + (1 - \beta_1)\, g_t.\]

This is an exponential moving average of past gradients with decay \(\beta_1\). Setting \(\beta_1 = 0\) recovers RMSProp; setting \(\beta_1 = 0.9\) (the default) means \(m_t\) is approximately the average of the last \(\sim 10\) gradients.

Update the second moment estimate:

\[v_t = \beta_2\, v_{t-1} + (1 - \beta_2)\, g_t^{2}.\]

(Element-wise square of the gradient.) With \(\beta_2 = 0.999\), \(v_t\) averages roughly the last \(\sim 1000\) squared gradients — a smooth estimate of per-parameter variance.

Bias-correct. Because \(m_0 = v_0 = 0\), the early estimates are biased toward zero. Multiply by \(1/(1 - \beta^{t})\) to undo that bias:

\[\hat{m}_t = \frac{m_t}{1 - \beta_1^{t}}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^{t}}.\]

For \(t = 1\) and \(\beta_1 = 0.9\), the correction multiplies by \(1/(1 - 0.9) = 10\) — a huge factor. By the time \(t \approx 100\), the correction is essentially \(1\), so the adjustment matters only in the first few hundred steps.

Update the parameter:

\[\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}.\]

The step has magnitude \(\alpha\) times the signal-to-noise ratio of the gradient: large where the gradient is consistent (high \(

\hat{m}

\), low \(\hat{v}\)), small where it’s noisy. \(\epsilon\) is a tiny constant preventing division by zero.

Why Each Piece Matters

Why a momentum-style first moment

Stochastic gradients are noisy. A running mean smooths out the noise and points consistently toward a useful descent direction.

Why RMSProp’s second-moment denominator

Different parameters have different gradient scales. Dividing by the per-parameter RMS gradient normalizes steps so that every parameter moves at roughly the same effective rate.

Why bias correction

If you skip the bias correction, the very first updates are tiny because \(m_1 = (1 - \beta_1) g_1\) is much smaller than \(g_1\) when \(\beta_1\) is close to 1. Bias correction makes the first step approximately \(\alpha \cdot g_1 /

g_1

= \alpha\) — meaningful from the start.

Why \(\sqrt{\hat{v}_t}\) rather than \(\hat{v}_t\)

The denominator scales like \(\sqrt{\operatorname{Var}(g)}\), so the step scales like \(\operatorname{mean}(g)/\sqrt{\operatorname{Var}(g)}\) — a signal-to-noise ratio. Without the square root, the effective learning rate would scale like \(1/\operatorname{Var}\), which is too aggressive.

Convergence Analysis

The paper proves \(O(1/\sqrt{T})\) regret for online convex optimization — the same rate as standard online subgradient descent. The proof carries the same caveats every adaptive method shares: convexity assumed, learning-rate schedule \(\alpha_t = \alpha/\sqrt{t}\) used in the analysis (not in practice), bounded gradients assumed.

In 2018, Reddi, Kale & Kumar pointed out a bug in the original convergence proof and constructed a simple convex counter-example where Adam fails to converge. They proposed AMSGrad as a fix: take the running maximum of \(v_t\) instead of an exponential average, ensuring the denominator never decreases. AMSGrad has clean convergence guarantees and is a drop-in replacement.

In practice, vanilla Adam works fine on virtually every deep-learning task and the AMSGrad fix is rarely necessary. For a theoretically clean optimizer, AdamW (decoupled weight decay) is the modern preferred variant.

Empirical Results From the Paper

The 2014 paper compares Adam to SGD with momentum, AdaGrad, RMSProp, and others on:

MNIST logistic regression and multilayer perceptron — Adam matches the best baseline, with much less hyperparameter tuning.
CIFAR-10 convnet — Adam achieves competitive test error.
IMDB bag-of-words classifier — Adam trains noticeably faster than SGD.

The results aren’t earth-shattering on any single benchmark; the contribution is consistent good performance everywhere, with default hyperparameters that generalize across tasks.

Practical Choices

The defaults work for most things, but a few situations need tweaks:

Situation	What to change
Training transformers	Use Adam with warmup: linearly increase \(\alpha\) from 0 over the first ~1 % of steps, then decay. This is essentially universal in BERT/GPT-style training.
Want decoupled weight decay	Use AdamW (Loshchilov & Hutter, 2017): apply weight decay directly to \(\theta\) rather than through the gradient. Better generalization.
Vision (ResNet, etc.)	SGD with momentum still tends to give slightly better generalization than Adam. The community split is roughly: NLP/transformers → Adam(W); CNN classification → SGD+momentum.
Sparse gradients	Adam handles sparsity naturally (per-parameter \(\hat{v}\) handles it), but SparseAdam is a memory-efficient variant for very large embedding tables.

Common pitfall: Adam’s default \(\alpha = 10^{-3}\) is much larger than what works for small models. For a small linear classifier, \(10^{-4}\) or even \(10^{-5}\) is often better.

Modern Context

AdamW (Loshchilov & Hutter, 2017): decouple weight decay from gradient. Default for transformer training today.
LAMB / LARS: layer-wise adaptive variants for very large batch training.
Lion (Chen et al., 2023): “evolved sign momentum.” Simpler than Adam (no second moment), often comparable performance, half the optimizer state.
Sophia (Liu et al., 2023): second-order method using a Hessian diagonal estimate, faster than Adam on language models.

Adam is past its first decade and still the default optimizer most papers reach for. That’s a remarkable run for any algorithm in deep learning.

Reading the Paper

The paper is dense but short. The sections worth careful attention:

§2: the algorithm itself — derive equations (1)–(4) on paper; understand each term.
§3: convergence analysis — useful for the regret-bound argument and the assumptions; skip the proof on first read.
§4: empirical results — most relevant table is Figure 2 (MNIST MLP convergence speed).
§6.1–§6.2: discussion of bias correction and the relationship to AdaGrad / RMSProp.

For the modern follow-ups, read AdamW (Loshchilov & Hutter, 2017) and the AMSGrad note (Reddi et al., 2018).

Takeaways

Adam = momentum + RMSProp + bias correction. Per-parameter adaptive step sizes from running gradient variance estimates.
The defaults work. \(\alpha = 10^{-3}\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\) is the recipe for thousands of papers.
AdamW for transformer training; SGD+momentum still wins slightly on CNN classification.
The original convergence proof was buggy; AMSGrad is the theoretically clean variant. Practically rarely needed.