Dropout (Srivastava et al., 2014)

TL;DR

A big neural network has so many parameters that it can memorize the training set even when there is no real signal. Dropout prevents that by, on every forward pass during training, randomly turning off a fraction of the hidden units — they output zero, and no gradient flows through them. So the network is forced to learn features that are useful even when its neighbours go missing, and at test time we recover the full network at almost no extra cost.

That single sentence is what every student should walk away with. The rest of this note unpacks the math and shows two small interactive demos so you can build intuition before reading the paper.

Problem & Motivation

A neural network with many parameters is a powerful function approximator. The flip side is that its capacity often exceeds what the training data can support. The result is overfitting: low training error, high test error.

Classical defences:

Early stopping — stop training when validation error rises.
L\(_2\) weight decay — add \(\frac{\lambda}{2}\|\mathbf{w}\|^2\) to the loss.
Data augmentation — generate more training points by transforming what you already have.
Model averaging — train several models, average their predictions. Reduces variance, but training and storing \(K\) networks is \(K\) times the cost.

Srivastava and co-authors point out a deeper issue particular to deep nets: co-adaptation. Neurons learn to rely on the exact presence of certain other neurons. A unit might encode “feature \(A\) and feature \(B\) are both present” only because some downstream unit needs the conjunction. If \(A\) or \(B\) misfires, the whole circuit collapses. Co-adapted detectors are great on training data and brittle outside of it.

Dropout’s premise is that forcing every unit to function in the absence of any other unit breaks this fragile co-adaptation. At the same time, training one network with dropout is approximately equivalent to averaging an exponentially large ensemble of “thinned” networks — without the expense of ever instantiating them.

Setting & Notation

A standard feed-forward layer computes

\[\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\, \mathbf{y}^{(\ell)} + \mathbf{b}^{(\ell+1)}, \qquad \mathbf{y}^{(\ell+1)} = f(\mathbf{z}^{(\ell+1)}),\]

with \(\mathbf{y}^{(\ell)}\) the activations of layer \(\ell\) and \(f\) a non-linearity (sigmoid, ReLU, …).

Dropout introduces a binary mask \(\mathbf{r}^{(\ell)} \in \{0,1\}^{n_\ell}\), one entry per unit, drawn independently each forward pass:

\[r_j^{(\ell)} \sim \operatorname{Bernoulli}(p_\ell), \qquad \widetilde{\mathbf{y}}^{(\ell)} = \mathbf{r}^{(\ell)} \odot \mathbf{y}^{(\ell)}. \tag{1}\]

Here \(\odot\) is element-wise multiplication, and \(p_\ell\) is the keep probability for layer \(\ell\) (often called the retention rate; the drop rate is \(1 - p_\ell\)). The thinned activations \(\widetilde{\mathbf{y}}^{(\ell)}\) are what the next layer’s weighted sum sees. Backpropagation runs exactly as before, except gradient updates only flow through units that survived the mask.

Convention: \(p_\ell = 0.5\) is typical for hidden layers; \(p_\ell\) closer to \(1\) (often \(0.8\)) is used for input layers because raw inputs are precious.

Interactive Demo 1 — Sampling a Thinned Network

A small fully-connected network: 4 input units → 6 hidden units → 3 output units. Move the drop rate slider, then press Sample mask (or Auto-play) and watch the network re-thin itself. Each sample is a different sub-network that the optimizer has to make work.

drop rate 0.50

hidden kept — / 6

samples 0

last seed —

drop rate

A few things to notice as you play:

Each sample is not the same network. The optimizer is being asked to make the expected loss small over a distribution of subnetworks.
Cranking the drop rate up to 0.9 makes the surviving network so tiny it can barely fit the data — too much regularization. Crank it down to 0.0 and you recover the original network with no regularization.
The exponentially many subnetworks share weights — that’s why the cost stays at a single network’s worth of memory.

What Dropout Computes — Math

Forward pass with dropout

For one mask realization \(\mathbf{r}\), the layer-\(\ell\) computation becomes

\[\widetilde{\mathbf{y}}^{(\ell)} = \mathbf{r}^{(\ell)} \odot \mathbf{y}^{(\ell)}, \qquad \mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\, \widetilde{\mathbf{y}}^{(\ell)} + \mathbf{b}^{(\ell+1)}. \tag{2}\]

If you back-propagate through (2), the gradient with respect to \(W_{ij}^{(\ell+1)}\) picks up a factor \(r_j^{(\ell)}\). Concretely, units that were dropped contribute zero gradient for that step. They are momentarily invisible.

Test time — the weight-scaling rule

At test time we want a single deterministic prediction. Two equivalent options:

Monte Carlo estimate. Average the network’s output over many random masks: \(\widehat{y} = \tfrac{1}{T}\sum_{t=1}^{T} y(\mathbf{x};\, \mathbf{r}^{(t)})\). Accurate as \(T \to \infty\) but expensive.
Weight scaling. Use the full network with no dropout, but multiply each weight by the keep probability \(p_\ell\) of its source layer — equivalently, multiply each layer’s activations by \(p_\ell\) at test time.

Why are they equivalent (in expectation, for a linear layer)? Because

\[\mathbb{E}_{\mathbf{r}}\\!\left[\mathbf{W}^{(\ell+1)}\, (\mathbf{r}^{(\ell)} \odot \mathbf{y}^{(\ell)})\right] = \mathbf{W}^{(\ell+1)}\, \mathbb{E}_{\mathbf{r}}[\mathbf{r}^{(\ell)}] \odot \mathbf{y}^{(\ell)} = \mathbf{W}^{(\ell+1)}\, p_\ell\, \mathbf{y}^{(\ell)}. \tag{3}\]

For non-linear activations, this is no longer exact — but Srivastava et al. show empirically that the weight-scaling rule is an excellent approximation across many architectures and datasets, with the bonus that test-time inference becomes a single forward pass through the original network.

In modern frameworks (PyTorch’s nn.Dropout, TensorFlow’s keras.layers.Dropout), the convention is to scale during training instead: divide kept activations by \(p_\ell\) so the expected activation matches the no-dropout case. This is called inverted dropout and lets test-time code run with no special handling at all — model.eval() just disables the random masking.

Interactive Demo 2 — Train Average Converges to Test

Equation (3) says: if you average the training-time outputs over enough mask samples, you get the deterministic test-time output. This demo shows that convergence happen in real time.

The setup is intentionally tiny — one weight, one input, one Bernoulli mask. With \(w = 2\), \(x = 1\), keep probability \(p\), the train-time output is

\[y^{\text{train}} = r \cdot w \cdot x, \qquad r \sim \operatorname{Bernoulli}(p),\]

so each sample is either \(0\) (the mask dropped this unit, prob \(1-p\)) or \(w \cdot x = 2\) (kept, prob \(p\)). The deterministic test-time output is \(p \cdot w \cdot x = 2p\) — the dashed grey line.

Click Sample to add one mask draw; click Auto-play to add ~10 per second. The blue dots are individual draws (clamped at \(0\) and \(2\)); the red curve is the running average so far. Watch it settle onto the dashed line.

drop rate 0.50

samples T 0

running mean —

test target 1.000

drop rate

What you should see while playing:

Every blue dot lands at exactly \(0\) or \(2\). Individual mask draws are extreme — there’s no in-between.
The red running-mean curve starts noisy (one or two samples gives huge swings) and tightens as more samples come in. Statistically, the std-dev of the running mean shrinks as \(1/\sqrt{T}\).
That curve converges to the dashed grey line, which is the test-time scaled value \(p \cdot w \cdot x\). Click Reset and try a different drop rate — the line jumps to the new \(p \cdot w \cdot x\) and the convergence story repeats.

That’s equation (3) in pictures: averaging over the random masks during training is, in expectation, the same as the deterministic weight-scaled forward pass at test time — which is why deployed networks pay no extra cost for using dropout during training.

The Ensemble View

A network with \(N\) hidden units has \(2^N\) possible binary masks — i.e. \(2^N\) different “thinned” subnetworks. Training with dropout is equivalent to simultaneously training all \(2^N\) of them with shared weights, where each subnetwork is sampled with probability proportional to the product of its mask’s Bernoulli probabilities.

At test time, exact ensembling would require summing the predictions of all \(2^N\) subnetworks — infeasible. The weight-scaling rule (Section “Test time” above) gives a closed-form approximation that is exact for linear models and remarkably close in practice for non-linear ones.

This ensemble interpretation is why dropout is often described as “model averaging without the cost”. The catch is that the ensembles are highly correlated (they all share the same weight matrices), so the variance reduction is less than \(K\) independent models would give. But it is free relative to actually training and storing \(K\) networks.

Hyperparameters & Practical Choices

The paper’s recommendations, which still hold up:

Layer	Typical keep \(p\)	Notes
Input	\(0.8\)	Inputs are precious; drop too many and the network can’t learn.
Hidden	\(0.5\)	The default. \(0.5\) is also the maximum-entropy choice — most regularization per parameter.
Convolutional feature maps	\(0.7\)–\(0.9\)	Conv layers already share parameters and are less over-parameterized; less aggressive dropout.
Batch-normalized layers	(often skip)	BN already provides a strong regularizing signal; combining the two can hurt.

Other practical points:

Train longer. Networks with dropout typically need 2–3× more epochs to converge, because the optimizer is solving a noisier problem.
Higher learning rates / momentum. Effective in cancelling the optimization noise. Srivastava et al. recommend \(10\times\) or larger LRs than the no-dropout setup.
Max-norm weight constraint. Constrain \(\|\mathbf{w}_j\|_2 \leq c\) per unit. Together with high LR + dropout, gives the configuration the paper found most effective.

Empirical Results From the Paper

A summary of headline numbers (from the original 2014 paper, MNIST and CIFAR-10):

MNIST. A standard fully-connected network drops from \(\sim 1.6\%\) test error (no dropout) to \(\sim 1.05\%\) with dropout. Convnets with dropout reach \(\sim 0.79\%\) — a significant relative improvement.
CIFAR-10. Convnet test error drops from \(\sim 14\%\) to \(\sim 12.6\%\) with dropout, and to \(\sim 11.9\%\) with dropout + max-norm constraint.
Speech (TIMIT). Phone error rates drop \(\sim 5\%\) relative.
Document classification (Reuters). Similar magnitude improvements.

The paper’s contribution is a consistent gain across very different domains with one tiny code change.

Why It Works — The Co-Adaptation Argument

A unit’s gradient depends on the other units in its layer being where the unit “expects” them. When 50 % of those neighbours might be missing, a feature that worked only because of a specific neighbour’s contribution is no longer reliable. The optimizer is pushed toward features that are useful on their own.

Visualizing first-layer filters in a fully-connected network without dropout, the filters tend to look like noisy combinations — different units detecting overlapping features. With dropout, the filters become much more localized and interpretable: edges, blobs, corners. Each unit has been trained to do its own job.

Quantitative evidence in the paper: the correlation between activations of different units, measured on test data, drops sharply when dropout is used. Co-adaptation has been broken.

Limitations & When It Doesn’t Help

Dropout is not a silver bullet:

Recurrent networks. Naive dropout applied to recurrent connections destroys long-range dependencies. Specialized variants (variational dropout, zoneout, recurrent dropout with shared masks) work better. Modern transformers, in contrast, dropout is straightforward (applied within the FFN sub-layer of each block).
Over-regularization. Combined with strong L\(_2\), label smoothing, and modern data augmentation, dropout can over-regularize and hurt performance. The current best practice for image classification is much less dropout than the 2014 recommendations would suggest.
Batch normalization and layer normalization. BN/LN provide a stochastic-noise regularization of their own (mini-batch statistics fluctuate). On well-tuned BN networks, additional dropout often makes no difference or hurts.
Very deep networks. Replacing dropout with stochastic depth (skipping whole layers randomly) tends to do better. Same idea, larger granularity.

Modern Context

Dropout’s intellectual descendants are everywhere:

DropConnect (Wan et al., 2013) — drop weights instead of units. Slightly more flexible; rarely beats vanilla dropout in practice.
Stochastic depth (Huang et al., 2016) — skip whole residual blocks randomly during training. Standard in modern CNNs.
Data augmentation as dropout — masking input pixels (Cutout) or feature-map regions (DropBlock) is dropout applied to the input or low layers.
Attention dropout in transformers — drop entries of the attention probability matrix. Standard in BERT, GPT, ViT.
MC Dropout for uncertainty (Gal & Ghahramani, 2016) — keep dropout active at test time and compute predictive variance over Monte Carlo passes. A cheap probabilistic interpretation of an otherwise deterministic network.

Reading the Paper

After this note, the paper itself (PDF) is much more approachable. The sections most worth reading carefully:

§2: motivation through the lens of overfitting and model averaging.
§3: the formal description of the dropout layer (matches our §2 above).
§4: training tricks (max-norm, learning rates) — these matter more than the paper’s prose suggests.
§6: experimental results — useful as a benchmark of what to expect from dropout in different domains.
§7.1: the visualizations of features learned with vs. without dropout — striking and the most convincing piece of evidence for the co-adaptation theory.

Takeaways

Dropout is randomness as regularization. Drop a fraction of units each forward pass; force the network to perform without any one of them.
Train ↔ test scaling. Either inverted-dropout during training (modern default) or weight-scaling at test (original paper). Equivalent in expectation; inverted dropout is ergonomically better.
Ensemble interpretation. \(2^N\) thinned subnetworks share weights and are trained simultaneously; the weight-scaling rule averages them in closed form.
Hyperparameter rule of thumb: \(p = 0.5\) for hidden, \(p = 0.8\) for input, less aggressive on convolutional layers; train longer with a larger LR and a max-norm constraint on weights.
Modern context: still useful, but on networks already heavily regularized by BatchNorm and modern augmentations, additional dropout is less impactful and sometimes harmful. Stochastic depth and attention dropout are its modern descendants.