← Back to Pattern Recognition

TL;DR

Replace the smooth saturating activation \(\sigma(z) = 1/(1 + e^{-z})\) (or \(\tanh\)) with the brutally simple

\[\operatorname{ReLU}(z) = \max(0, z),\]

and deep networks become substantially easier to train: faster convergence, no vanishing-gradient problem in the active half, and a built-in bias toward sparsity. This 2010 paper introduced ReLU activations as a drop-in replacement in restricted Boltzmann machines and showed they outperform standard sigmoid units. Within a few years it became the default non-linearity for the entire deep-learning era.


Problem & Motivation

Pre-2010 deep networks were almost universally trained with logistic or tanh activations. Both saturate: outside a narrow band around the origin, their derivatives go to zero. That kills the gradient signal, and a stack of such layers makes the vanishing-gradient problem so severe that anything past 3–4 layers was widely believed to be untrainable. Pre-training tricks (RBMs, denoising autoencoders) were invented largely to compensate.

Two more issues with sigmoid/tanh:

  • Centering. Sigmoid outputs are in \([0, 1]\), not centered at zero, which biases gradient updates and slows training.
  • Computation. Each forward pass through \(N\) sigmoid neurons does \(N\) calls to exp — non-trivial on the GPUs of the era.

Nair and Hinton observed that several “binary stochastic units” — Bernoulli units commonly used in RBMs — could be approximated by a deterministic continuous-valued unit whose activation is exactly \(\max(0, z)\). The paper’s pitch is empirical: this unit not only matches binary stochastic units in expressive power but trains deeper, more accurate networks.


The Definition

For a pre-activation \(z = \mathbf{w}^{\top}\mathbf{x} + b\), the rectified linear unit outputs

\[y = \max(0, z), \qquad \frac{\mathrm{d}y}{\mathrm{d}z} = \begin{cases} 1, & z > 0, \\ 0, & z < 0. \end{cases}\]

Two regimes:

  • Active (\(z > 0\)): the unit is a plain identity. Gradient flows through unchanged. No saturation.
  • Inactive (\(z < 0\)): the unit is exactly zero. No gradient, no contribution.

The point at \(z = 0\) is non-differentiable, but in practice the subgradient is taken as \(0\) (or \(1\), or anything in between — it doesn’t matter for stochastic optimization).


Why It Helps Training

1. Non-vanishing gradient in the active half

For a sigmoid, the maximum derivative is \(0.25\) at \(z = 0\); a stack of \(L\) sigmoids has gradient magnitude bounded by \(0.25^{L}\). For ReLU, the derivative on the active half is exactly \(1\), so a stack of \(L\) active ReLUs preserves gradient magnitude perfectly. This single property is most of why deep networks become trainable.

2. Sparse activations

For random inputs, roughly half the units are inactive at any moment. Sparsity has been argued to be a useful inductive bias for representations — independent features fire selectively, rather than every unit responding (mildly) to every input. Sigmoid networks are dense by construction (every unit always non-zero); ReLU networks are naturally sparse.

3. Cheaper compute

max(0, z) is one comparison and one mask. No exponentials. On modern hardware this isn’t huge, but in 2010 it shaved real wall-clock time off training.

4. Linear in the active region

When all the active units are linear, a path through the network from input to output is linear in its input — the network is a piecewise-linear function with as many pieces as there are activation patterns. This makes the optimization landscape simpler than the curved landscape of a sigmoid network.


The “Dying ReLU” Problem

The flat zero on the left is not free. If a unit’s pre-activation gets pushed firmly negative (e.g. by a large gradient step early in training, or by an unfortunate weight initialization), it stays negative forever — no input ever produces a positive activation, so no gradient ever flows to update its weights. The unit is dead.

In practice this happens to ~10–40 % of units in poorly tuned networks. Mitigations:

  • Better initialization (He / Kaiming, 2015) — variance scaled to keep ReLU activations from collapsing.
  • Lower learning rates early in training.
  • Leaky ReLU — \(\max(\alpha z, z)\) with small \(\alpha = 0.01\), so the negative half has nonzero slope. No dead units.
  • Parametric ReLU (PReLU) — make the negative-side slope a learnable parameter per channel.
  • ELU / GELU / Swish — smooth non-linearities that handle the negative half differently. GELU in particular is the default in transformer architectures.

Empirical Results in the Paper

Headline numbers from the 2010 paper:

  • Caltech-101 / NORB. ReLU-units in an RBM-pretrained classifier match or beat the same architecture with binary stochastic units. The gap is several percentage points on harder datasets.
  • MNIST. ReLU networks reach competitive error rates (~1.4 % on permutation-invariant MNIST) without the elaborate pre-training pipelines that sigmoid networks needed.

The numbers are unremarkable today, but the pattern — “deeper is fine, just use ReLU” — is the seed of everything that followed. AlexNet (2012), VGG, ResNet, transformers all use ReLU or one of its descendants.


Modern Context

  • GELU / Swish dominate transformer-era models (BERT, GPT, ViT). Smoother than ReLU, slightly better empirics, more compute.
  • Layer / batch normalization plus ReLU is the canonical CNN combination. Normalization tames the variance of pre-activations, making dying ReLU much rarer.
  • Initialization theory (He init, Kaiming init) was developed specifically for ReLU — keeps each layer’s variance approximately preserved.
  • Activation function search. Several papers (Ramachandran et al., 2017) ran NAS on activation functions and found Swish (\(z \cdot \sigma(\beta z)\)) outperforms ReLU on ImageNet by a small margin.

For a fresh project today: use GELU in a transformer, ReLU or Leaky ReLU in a CNN. Either choice is defensible.


Reading the Paper

The 2010 paper is short and conversational. The key sections:

  • §3: the derivation showing rectified linear units approximate the sum of an infinite series of binary units. Worth reading once for the connection to RBMs.
  • §4: empirical results on NORB and Caltech-101.
  • §5: discussion of sparsity and the comparison to noisy rectified linear units.

For modern context, follow up with He et al., 2015 (“Delving Deep into Rectifiers”) for the initialization analysis, and Hendrycks & Gimpel, 2016 (“Gaussian Error Linear Units (GELUs)”).


Takeaways

  • \(\max(0, z)\) is the simplest non-trivial non-linearity; identity in the active half makes gradients flow without vanishing.
  • ReLU enables deep networks without pre-training — it is the unsung hero of the post-2012 deep-learning revolution.
  • Dead units are the price; mitigated by careful initialization, normalization, or leaky variants.
  • Modern descendants (GELU, Swish, Mish) are slight refinements; ReLU still works in 90 % of new networks.