Back-Propagation (Rumelhart, Hinton & Williams, 1986)

TL;DR

A neural network is just a stack of simple “units” that each take a weighted sum of their inputs, squash it through a non-linear function, and pass the result to the next layer. To train such a network we want to find the weights that make its output match a target. Back-propagation is the algorithm that says: “Take the error you see at the output, and pass it backwards through the network using the chain rule, so every weight learns how much it personally contributed to the mistake.”

That’s it. The rest of this note slowly builds up the math behind that one sentence — assuming you are not fluent in calculus.

Problem & Motivation

A single neuron (the perceptron, 1958) can only separate inputs with a straight line. It cannot learn XOR:

\(x_1\)	\(x_2\)	\(y\)
0	0	0
0	1	1
1	0	1
1	1	0

No single line cuts the four points correctly. To fix this we add an intermediate (“hidden”) layer of neurons — but now we face the credit-assignment problem: when the network produces the wrong answer, which hidden weight is at fault, and by how much?

Backprop answers that question. It is the gradient of the loss with respect to every weight, computed efficiently by re-using intermediate results.

Math Prerequisites

If the next four pages feel like review, skip ahead. Otherwise, read slowly.

1. Derivatives — what does \(\dfrac{df}{dx}\) mean?

If \(f(x)\) is a function, then \(\dfrac{df}{dx}\) at a point \(x\) is the slope of \(f\) at that point. It tells you: “if I nudge \(x\) by a tiny amount \(\Delta x\), how much does \(f\) change?”

\[f(x + \Delta x) \;\approx\; f(x) + \frac{df}{dx}\,\Delta x.\]

Example. If \(f(x) = x^2\), then \(\dfrac{df}{dx} = 2x\). At \(x=3\) the slope is \(6\), so nudging \(x\) from \(3\) to \(3.01\) changes \(f\) by roughly \(6 \times 0.01 = 0.06\).

2. Partial derivatives — when there are many inputs

Suppose \(E\) depends on many weights \(w_1, w_2, \dots, w_n\). The partial derivative \(\dfrac{\partial E}{\partial w_i}\) asks: “if I nudge only \(w_i\) and hold everything else fixed, how much does \(E\) change?” The notation \(\partial\) (instead of \(d\)) is just a reminder that other variables exist but we are pretending they’re constants for this calculation.

Example. If \(E = w_1^2 + 3 w_1 w_2\), then

\[\frac{\partial E}{\partial w_1} = 2 w_1 + 3 w_2, \qquad \frac{\partial E}{\partial w_2} = 3 w_1.\]

3. The chain rule — composed functions

The chain rule says: “if \(y\) depends on \(u\), and \(u\) depends on \(x\), then a nudge in \(x\) ripples through to \(y\) via \(u\).”

\[\frac{dy}{dx} \;=\; \frac{dy}{du} \cdot \frac{du}{dx}.\]

You multiply the slopes along the path from \(x\) to \(y\).

Concrete example. Let \(y = (3x+2)^2\). Set \(u = 3x+2\), so \(y = u^2\). Then \(\dfrac{dy}{du} = 2u\) and \(\dfrac{du}{dx} = 3\), giving

\[\frac{dy}{dx} = 2u \cdot 3 = 6(3x+2).\]

When the path branches (one \(x\) feeds many \(u\)’s, all of which feed \(y\)), you add up the contribution from each branch. This branching is exactly what happens in a neural network — and exactly why backprop has the form it does.

4. Gradient descent — using the slope to improve

To minimise \(E(w)\) we repeatedly take small steps opposite to the slope:

\[w \;\leftarrow\; w \;-\; \eta\, \frac{\partial E}{\partial w}.\]

The number \(\eta\) (eta, Greek “h”) is the learning rate — how big a step. If the slope is positive, \(E\) is going up as \(w\) increases, so we move \(w\) down. That’s the minus sign.

5. The sigmoid activation and its derivative

The 1986 paper uses the logistic sigmoid:

\[\sigma(z) \;=\; \frac{1}{1 + e^{-z}}.\]

It squashes any real number into the range \((0, 1)\). Its derivative has a beautiful property — it can be written entirely in terms of \(\sigma(z)\) itself:

\[\sigma'(z) \;=\; \sigma(z)\,\bigl(1 - \sigma(z)\bigr).\]

Why this is convenient. During backprop we’ll need \(\sigma'(z)\). We already have \(\sigma(z)\) from the forward pass — no recomputation needed.

Building Up: From One Neuron to a Network

We will build the algorithm in three stages, each adding one layer of complication. By stage 3 you will have re-derived the algorithm yourself.

Stage 1: A single neuron

The simplest possible network is one neuron with \(n\) inputs:

\[a \;=\; \sum_{i=1}^{n} w_i\, x_i, \qquad o \;=\; \sigma(a),\]

where \(x_i\) is the \(i\)-th input value and \(w_i\) is its weight. We have a target \(t\) and use the squared error:

\[E \;=\; \tfrac{1}{2}\,(t - o)^2.\]

(The \(\tfrac{1}{2}\) is just so the \(2\) from differentiating cancels — it doesn’t change where the minimum is.)

Goal: find \(\dfrac{\partial E}{\partial w_i}\), so we can update \(w_i\).

The dependency chain is

\[w_i \;\longrightarrow\; a \;\longrightarrow\; o \;\longrightarrow\; E.\]

Apply the chain rule along this chain — multiplying the slopes:

\[\frac{\partial E}{\partial w_i} \;=\; \underbrace{\frac{\partial E}{\partial o}}_{\text{step 1}} \cdot \underbrace{\frac{\partial o}{\partial a}}_{\text{step 2}} \cdot \underbrace{\frac{\partial a}{\partial w_i}}_{\text{step 3}}.\]

Compute each piece:

Step 1. \(E = \tfrac{1}{2}(t - o)^2\), so \(\dfrac{\partial E}{\partial o} = -(t - o)\).
Step 2. \(o = \sigma(a)\), so \(\dfrac{\partial o}{\partial a} = \sigma'(a) = o(1-o)\).
Step 3. \(a = \sum_j w_j x_j\), so \(\dfrac{\partial a}{\partial w_i} = x_i\).

Multiplying:

\[\frac{\partial E}{\partial w_i} \;=\; -(t - o)\, \cdot \, o(1-o)\, \cdot \, x_i.\]

This single formula already contains the whole idea of backprop. Read it left-to-right:

How wrong was the output? \(-(t-o)\)
How sensitive is the output to its raw input? \(o(1-o)\)
How much did this particular weight feed into that raw input? \(x_i\)

Multiply them, and you have weight \(w_i\)’s share of the blame.

The update rule (gradient descent) is then

\[w_i \;\leftarrow\; w_i \;-\; \eta\, \frac{\partial E}{\partial w_i} \;=\; w_i \;+\; \eta\, (t-o)\, o(1-o)\, x_i.\]

Stage 1.5: A worked numerical example

Let’s run the actual numbers through one full forward-and-backward pass. To keep the arithmetic clean we use a linear neuron (no sigmoid) — the chain-rule structure is identical, just with one fewer factor. This walkthrough follows the Backprop Explainer (Bertucci & Kahng, 2021), which has a beautiful interactive version of exactly this example.

Forward (gray + blue pulse): the value x = 2.1 flows into the neuron, becomes ŷ = 2.1, then enters the loss. Backward (red, dashed): the gradient ∂E/∂ŷ = −3.8 emerges from the loss, gets multiplied by x at the neuron, and arrives at the input as ∂E/∂w = −7.98 — the chain rule, animated.

Setup.

Network: \(\hat{y} = w x + b\) (one linear neuron with weight and bias)
One training example: \(x = 2.1\), target \(y = 4\)
Initial parameters: \(w = 1\), \(b = 0\)
Loss: squared error \(E = (\hat{y} - y)^2\) (no \(\tfrac{1}{2}\) this time, to match the explainer)
Learning rate: \(\eta = 0.01\)

Step 1 — Forward pass

\[\hat{y} = w\,x + b = (1)(2.1) + 0 = 2.1\] \[E = (\hat{y} - y)^2 = (2.1 - 4)^2 = (-1.9)^2 = 3.61.\]

So the prediction is \(2.1\), the target is \(4\), and the current loss is \(3.61\). Clearly the prediction is too low.

Step 2 — Backward pass: compute the gradient

We want \(\dfrac{\partial E}{\partial w}\) and \(\dfrac{\partial E}{\partial b}\). The dependency chain is

\[w,\, b \;\longrightarrow\; \hat{y} \;\longrightarrow\; E.\]

(2a) How does \(E\) change with \(\hat{y}\)? For \(E = (\hat{y} - y)^2\), the derivative (using the chain rule on the square) is

\[\frac{\partial E}{\partial \hat{y}} = 2(\hat{y} - y) = 2(2.1 - 4) = 2(-1.9) = -3.8.\]

The minus sign says: if we increase \(\hat{y}\), the loss goes down — which makes sense, since we under-shot the target.

(2b) How does \(\hat{y}\) change with \(w\) and \(b\)? For \(\hat{y} = wx + b\),

\[\frac{\partial \hat{y}}{\partial w} = x = 2.1, \qquad \frac{\partial \hat{y}}{\partial b} = 1.\]

(2c) Chain them together.

\[\frac{\partial E}{\partial w} \;=\; \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w} \;=\; (-3.8)(2.1) \;=\; -7.98.\] \[\frac{\partial E}{\partial b} \;=\; \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b} \;=\; (-3.8)(1) \;=\; -3.8.\]

Plain-English read of these two numbers: “the loss is currently decreasing fastest if we make \(w\) larger (because \(\partial E/\partial w\) is negative — moving \(w\) in the negative-of-negative direction lowers \(E\)). Same story for \(b\).”

Step 3 — Gradient-descent update

Step in the direction that lowers the loss, scaled by the learning rate:

\[w \;\leftarrow\; w - \eta\, \frac{\partial E}{\partial w} \;=\; 1 - (0.01)(-7.98) \;=\; 1 + 0.0798 \;=\; 1.0798.\] \[b \;\leftarrow\; b - \eta\, \frac{\partial E}{\partial b} \;=\; 0 - (0.01)(-3.8) \;=\; 0 + 0.038 \;=\; 0.038.\]

Both parameters increased — exactly what the gradient told us to do.

Step 4 — Verify the loss actually went down

Run a second forward pass with the new parameters:

\[\hat{y}_{\text{new}} = (1.0798)(2.1) + 0.038 = 2.2676 + 0.038 = 2.3056.\] \[E_{\text{new}} = (2.3056 - 4)^2 = (-1.6944)^2 \approx 2.87.\]

Loss dropped from \(3.61\) to \(2.87\) in one step — a decrease of about \(0.74\). Iterate this process thousands of times over many training examples, and the loss approaches zero.

Try it yourself: gradient descent on \(E(w)\)

The plot below shows the loss curve for our example (\(E(w) = (2.1\,w + 0 - 4)^2\) with \(b\) fixed at \(0\)). Drag the slider to move \(w\) and watch \(\hat y\), \(E\), and the gradient update. Click Step to apply one gradient-descent update — exactly the math from Step 3 above. Click Auto-play to watch the ball roll down the curve.

w = 1.0000

ŷ = w·x + b = 2.1000

E = (ŷ − y)² = 3.6100

∂E/∂w = −7.9800

step # 0

Why we used a linear neuron here. With sigmoid we’d multiply by an extra factor \(\sigma'(a) = o(1-o)\) between Step 2a and 2b. The arithmetic gets uglier (try it: \(\sigma(2.1) \approx 0.891\), \(\sigma'(2.1) \approx 0.097\) → tiny gradients) but the procedure is identical. This is one reason modern networks prefer ReLU: its derivative is just \(1\) for \(z > 0\), so error signals don’t shrink layer by layer.

Stage 2: One hidden layer (the credit-assignment fix)

Now add a hidden layer. Inputs \(x_i\) feed hidden units \(h_j\), which feed the output \(o\). Let

\(v_{ij}\) = weight from input \(x_i\) to hidden unit \(h_j\),
\(w_j\) = weight from hidden unit \(h_j\) to the single output.

Forward pass:

\[a_j \;=\; \sum_i v_{ij}\, x_i, \qquad h_j \;=\; \sigma(a_j),\] \[b \;=\; \sum_j w_j\, h_j, \qquad o \;=\; \sigma(b), \qquad E \;=\; \tfrac{1}{2}(t-o)^2.\]

The output weights \(w_j\) are easy — same as Stage 1 with \(h_j\) playing the role of input:

\[\frac{\partial E}{\partial w_j} \;=\; -(t-o)\, o(1-o)\, h_j.\]

The hard case is \(v_{ij}\). Its dependency chain branches:

\[v_{ij} \;\longrightarrow\; a_j \;\longrightarrow\; h_j \;\longrightarrow\; b \;\longrightarrow\; o \;\longrightarrow\; E.\]

Apply the chain rule along the whole path — five steps multiplied:

\[\frac{\partial E}{\partial v_{ij}} \;=\; \underbrace{\frac{\partial E}{\partial o}}_{=\,-(t-o)} \cdot \underbrace{\frac{\partial o}{\partial b}}_{=\,o(1-o)} \cdot \underbrace{\frac{\partial b}{\partial h_j}}_{=\,w_j} \cdot \underbrace{\frac{\partial h_j}{\partial a_j}}_{=\,h_j(1-h_j)} \cdot \underbrace{\frac{\partial a_j}{\partial v_{ij}}}_{=\,x_i}.\]

Result:

\[\frac{\partial E}{\partial v_{ij}} \;=\; -\,(t-o)\, o(1-o)\, w_j\, h_j(1-h_j)\, x_i.\]

Look at the structure. The first three factors \(-(t-o)\, o(1-o)\, w_j\) are exactly the error signal as seen at hidden unit \(j\). They are the error that arrived at \(o\), multiplied by how strongly \(h_j\) was connected to \(o\). The last two factors \(h_j(1-h_j)\, x_i\) are the local slope of \(h_j\) and the input feeding \(v_{ij}\).

This is the back-propagation of error: the output error gets “transmitted” backward through \(w_j\) to give an effective error at \(h_j\), and that effective error then determines \(v_{ij}\)’s gradient.

Walk through it: a 1-hidden-layer network, one step at a time

Time to make Stage 2 concrete. Drag the slider below to advance through every step of one forward + backward pass on the smallest non-trivial network: one input, two ReLU hidden units, one output. Numbers chosen so every gradient is an integer or a half — no calculator needed.

We use ReLU \(\text{ReLU}(z) = \max(0, z)\) instead of sigmoid for clean arithmetic. Its derivative is the simplest possible: \(\text{ReLU}'(z) = 1\) if \(z > 0\), else \(0\).

Step 0 / 17 Initial state

Drag the slider →

Stage 3: Many layers — the general recursion

To avoid the bookkeeping nightmare of writing out every chain, we define the error signal at every unit:

\[\delta_j^{(\ell)} \;\equiv\; \frac{\partial E}{\partial a_j^{(\ell)}}.\]

Read this as: “how much would the loss change if I nudged the raw input \(a_j^{(\ell)}\) of unit \(j\) in layer \(\ell\) by a tiny amount?”

Once we have \(\delta_j^{(\ell)}\) for every unit, the gradient with respect to any weight is just one multiplication:

\[\boxed{\;\; \frac{\partial E}{\partial w_{ij}^{(\ell)}} \;=\; \delta_j^{(\ell)} \, o_i^{(\ell-1)} \;\;}\]

Plain-English reading: “weight \(w_{ij}^{(\ell)}\)’s share of the blame = (error signal at the unit it feeds) × (the value flowing in from the previous layer).”

The remaining task is to compute \(\delta\) at every unit. We do it in two steps.

Output layer (start of the recursion). The error signal is just the direct loss derivative, and the chain rule gives

\[\delta_j^{(L)} \;=\; \frac{\partial E}{\partial o_j^{(L)}} \cdot \sigma'\!\left(a_j^{(L)}\right) \;=\; -\bigl(t_j - o_j^{(L)}\bigr)\, \sigma'\!\left(a_j^{(L)}\right).\]

Hidden layer (the recursion). Hidden unit \(j\) in layer \(\ell\) feeds every unit \(k\) in layer \(\ell+1\). The chain rule with branches says we sum the contributions from each branch:

\[\boxed{\;\; \delta_j^{(\ell)} \;=\; \sigma'\!\left(a_j^{(\ell)}\right)\, \sum_k w_{jk}^{(\ell+1)}\, \delta_k^{(\ell+1)} \;\;}\]

Plain-English reading: “the error signal at hidden unit \(j\) is the weighted sum of the error signals from the next layer (gathered through the weights \(w_{jk}^{(\ell+1)}\) that connect \(j\) to those next-layer units), multiplied by the local slope \(\sigma'\) of \(j\) itself.”

Notice that this lets us compute \(\delta\) layer by layer from the output back toward the input — that is the “back-propagation” in the name.

The full algorithm

Putting it all together, training proceeds by repeating:

Forward pass. Push \(x\) through the network, recording every \(a^{(\ell)}\) and \(o^{(\ell)}\).
Output error. Compute \(\delta^{(L)}\) from the targets.
Backward pass. For \(\ell = L-1, L-2, \dots, 1\), apply the recursion to obtain \(\delta^{(\ell)}\).
Weight gradients. For every weight, \(\dfrac{\partial E}{\partial w_{ij}^{(\ell)}} = \delta_j^{(\ell)}\, o_i^{(\ell-1)}\).
Update. \(w_{ij}^{(\ell)} \leftarrow w_{ij}^{(\ell)} - \eta\, \delta_j^{(\ell)}\, o_i^{(\ell-1)}\).
Repeat over training examples until convergence.

The paper additionally adds a momentum term, which carries a fraction of the previous step into the next:

\[\Delta w_{ij}^{(\ell)}(t) \;=\; -\eta\, \delta_j^{(\ell)}\, o_i^{(\ell-1)} \;+\; \alpha\, \Delta w_{ij}^{(\ell)}(t-1).\]

Intuitively: if we keep stepping in roughly the same direction, momentum accelerates us; if we keep oscillating back and forth, the previous step cancels part of the new step and we stop wobbling.

Key Results in the Paper

XOR. A 2-2-1 network learns XOR — the smallest demonstration that hidden units can build a representation a perceptron cannot.
Symmetry detection. A network with two hidden units learns to detect left/right mirror symmetry; the trained weights are interpretable feature detectors.
Family-tree relations. Trained on relations among two artificial families, the network generalizes to held-out (person, relation, person) triples, and the hidden units develop semantic features (nationality, generation, branch) that were never given as supervision. This is the seed of the modern idea of “learned representations.”

Why It Matters / Limitations

Why it matters. Backprop is the workhorse of every modern deep network. CNNs, RNNs, Transformers, diffusion models — all of them run the same chain-rule recursion at scale. Once you understand the math above, you understand the core of how every deep model on the planet is trained.
Limitations of the original.
- Vanishing / exploding gradients. When you multiply many sigmoids’ derivatives (\(\sigma'(z) \le 1/4\) everywhere), the error signal shrinks exponentially with depth. Tiny networks dodge this; deep networks needed ReLU + careful initialization + normalization.
- Local minima concerns. The 1986 paper worried about getting stuck in bad local minima. Modern theory says local minima are rarely the obstacle in high dimensions; saddle points and plateaus matter more — and SGD’s stochasticity helps escape both.
- Plain SGD on sigmoids. Replaced in modern practice by ReLU activations, adaptive optimizers (Adam), batch / layer normalization, and skip connections. The algorithm (backprop) is unchanged; everything around it has been upgraded.

Reading the Original

Original PDF (Nature, vol. 323, pp. 533–536, 1986)