Introduction & Motivation
Backpropagation (backward propagation of errors) is the cornerstone algorithm for training neural networks. First introduced in the 1970s and popularized by Rumelhart, Hinton, and Williams in 1986, it provides an efficient method for computing gradients in multi-layer networks.
At its heart, backpropagation answers a fundamental question: How much does each weight in a neural network contribute to the overall error? By answering this question, we can adjust weights to minimize error and improve predictions.
Backpropagation is an algorithm for computing the gradient of a loss function with respect to the weights of a neural network. It applies the chain rule of calculus systematically, propagating error signals backward through the network from output to input.
Why Do We Need Backpropagation?
Neural networks learn by adjusting their weights to minimize a loss function. This requires knowing the gradient—the direction and magnitude of change that will reduce the loss. For a network with millions of parameters, computing these gradients efficiently is crucial.
Without backpropagation, computing gradients would require \(O(W^2)\) operations (where \(W\) is the number of weights). Backpropagation reduces this to \(O(W)\), making deep learning computationally feasible.
The Learning Process Overview
- Forward Pass: Input data flows through the network, layer by layer, producing an output prediction.
- Loss Computation: The prediction is compared to the true label using a loss function, producing a scalar error value.
- Backward Pass: Gradients are computed by propagating the error backward through the network using the chain rule.
- Weight Update: Weights are adjusted in the direction that minimizes the loss (gradient descent).
Forward Propagation
Before understanding backpropagation, we must first understand how data flows forward through a neural network. This forward pass transforms inputs into outputs through a series of linear transformations and non-linear activations.
Neuron Computation
A single neuron computes a weighted sum of its inputs, adds a bias, and applies an activation function:
\(z\) — Pre-activation (weighted sum)
- The raw output before applying the activation function
- Example: if \(\mathbf{x} = [0.5, 0.3]\), \(\mathbf{w} = [0.4, 0.6]\), \(b = 0.1\)
- Then \(z = (0.4 \times 0.5) + (0.6 \times 0.3) + 0.1 = 0.2 + 0.18 + 0.1 = 0.48\)
\(\sigma(z)\) — Activation function
- Introduces non-linearity (otherwise the network is just linear!)
- Squashes values to a specific range (e.g., sigmoid: 0 to 1)
\(a\) — Activation (output)
- The final output of the neuron after activation
- This becomes the input to neurons in the next layer
Layer-wise Computation
For efficiency, we compute all neurons in a layer at once using matrix operations:
For layer \(l\) with \(n^{[l]}\) neurons receiving input from layer \(l-1\) with \(n^{[l-1]}\) neurons:
\(\mathbf{W}^{[l]}\) — Weight matrix: \(n^{[l]} \times n^{[l-1]}\)
- Row \(i\): weights for neuron \(i\) in this layer
- Column \(j\): weights connecting from neuron \(j\) in previous layer
- \(W_{ij}^{[l]}\): weight from neuron \(j\) in layer \(l-1\) to neuron \(i\) in layer \(l\)
\(\mathbf{a}^{[l-1]}\) — Input from previous layer: \(n^{[l-1]} \times 1\)
\(\mathbf{b}^{[l]}\) — Bias vector: \(n^{[l]} \times 1\)
\(\mathbf{z}^{[l]}, \mathbf{a}^{[l]}\) — Output vectors: \(n^{[l]} \times 1\)
Activation Functions
Sigmoid:
\[\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))\]ReLU:
\[\text{ReLU}(z) = \max(0, z), \quad \text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}\]Interactive Neural Network
Hover over neurons to see connections Tap on neurons to see connections
Loss Functions
The loss function quantifies how far our predictions are from the true values. It's the objective we're trying to minimize during training.
Mean Squared Error (MSE)
Binary Cross-Entropy
Categorical Cross-Entropy
The Chain Rule
The chain rule is the mathematical foundation of backpropagation. It allows us to compute derivatives of composite functions.
If \(y = f(u)\) and \(u = g(x)\), then: \[\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\]
Computational Graph
Consider a simple computation: \(f = (x + y) \cdot z\)
Each node only needs to know its local gradient—how its output changes with respect to its inputs. The chain rule combines these local gradients to compute global gradients.
The Backpropagation Algorithm
Now we combine everything: forward propagation, loss functions, and the chain rule to derive backpropagation—the elegant algorithm that makes training deep networks possible.
The Key Quantity: δ (Delta) — The Error Signal
The delta (δ) is the most important concept in backpropagation. It represents the error signal at each layer—specifically, how much the total loss would change if we slightly changed the pre-activation value at that layer.
For layer \(l\), the delta is defined as the partial derivative of the loss with respect to the pre-activation:
🔍 Breaking Down Delta — What Each Part Means
Let's dissect the delta formula piece by piece:
\(\boldsymbol{\delta}^{[l]}\) — The error signal vector at layer \(l\)
- This is a vector with one value per neuron in layer \(l\)
- Each element tells us: "How much would the loss change if this neuron's pre-activation changed?"
\(\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}}\) — Partial derivative of loss w.r.t. pre-activation
- \(\mathcal{L}\) is the scalar loss value (e.g., 0.5 for MSE)
- \(\mathbf{z}^{[l]}\) is the weighted sum before activation: \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\)
- This derivative captures the sensitivity of loss to changes in \(\mathbf{z}\)
We define delta at \(\mathbf{z}\) (pre-activation) rather than \(\mathbf{a}\) (post-activation) because weights directly affect \(\mathbf{z}\). This makes computing weight gradients simpler: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \boldsymbol{\delta} \cdot \mathbf{a}^T\). If we used \(\mathbf{a}\), we'd need an extra chain rule step every time.
🎯 Why Delta Matters — The Bridge to Gradients
Once we have \(\boldsymbol{\delta}^{[l]}\), computing the gradients we need for learning becomes trivial:
Weight Gradient:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\]↳ "Error at this layer" × "Input to this layer" = "How to adjust weights"
Bias Gradient:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]}\]↳ The bias gradient is simply delta itself (since \(\frac{\partial z}{\partial b} = 1\))
The Backpropagation Equations — Step by Step
Step 1: Output Layer Error (δ at layer L)
At the output layer, we can compute delta directly from the loss function:
\(\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}}\) — How loss changes with output activation
- For MSE loss \(\mathcal{L} = \frac{1}{2}(y - \hat{y})^2\): this equals \((\hat{y} - y)\)
- For Cross-Entropy: this equals \(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)
\(\sigma'(\mathbf{z}^{[L]})\) — Derivative of activation function
- For Sigmoid: \(\sigma'(z) = \sigma(z)(1 - \sigma(z)) = a(1-a)\)
- For ReLU: \(\sigma'(z) = 1\) if \(z > 0\), else \(0\)
\(\odot\) — Element-wise (Hadamard) product
- Multiplies corresponding elements: \([a,b] \odot [c,d] = [ac, bd]\)
- Each neuron's error is scaled by its activation's sensitivity
Step 2: Propagate Error Backward (δ for hidden layers)
This is where the "back" in backpropagation happens. We propagate the error signal from layer \(l+1\) back to layer \(l\):
\((\mathbf{W}^{[l+1]})^T\) — Transposed weight matrix
- The transpose "reverses" the forward connections
- In forward pass: layer \(l\) → layer \(l+1\) via \(\mathbf{W}^{[l+1]}\)
- In backward pass: layer \(l+1\) → layer \(l\) via \((\mathbf{W}^{[l+1]})^T\)
\((\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}\) — Weighted error distribution
- Each neuron in layer \(l\) contributed to multiple neurons in layer \(l+1\)
- This term sums up all error contributions, weighted by connection strength
- Strong weights → more blame assigned to that neuron
\(\odot \sigma'(\mathbf{z}^{[l]})\) — Local gradient scaling
- Scales the propagated error by the local activation sensitivity
- If activation was saturated (\(\sigma' \approx 0\)), gradient vanishes
- This is why ReLU helps: \(\sigma'(z) = 1\) for positive inputs
Think of delta as "blame assignment." The output layer knows the final error. It passes blame backward through the network: "You contributed this much weight to my input, so you get this much blame." Each layer receives blame proportional to its contribution, scaled by how much it could have changed things (the activation derivative).
Complete Algorithm Summary
- Forward Pass: Compute all \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\) and \(\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})\)
- Output Error: Compute \(\boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}} \mathcal{L} \odot \sigma'(\mathbf{z}^{[L]})\)
- Backpropagate: For \(l = L-1, \ldots, 1\): compute \(\boldsymbol{\delta}^{[l]} = ((\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}) \odot \sigma'(\mathbf{z}^{[l]})\)
- Compute Gradients: \(\nabla_{\mathbf{W}^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\) and \(\nabla_{\mathbf{b}^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]}\)
- Update Weights: \(\mathbf{W}^{[l]} \leftarrow \mathbf{W}^{[l]} - \eta \nabla_{\mathbf{W}^{[l]}} \mathcal{L}\)
Let's derive the backpropagation equation using the chain rule:
We want \(\frac{\partial \mathcal{L}}{\partial z_j^{[l]}}\) for neuron \(j\) in layer \(l\).
The pre-activation \(z_j^{[l]}\) affects the loss through all neurons in layer \(l+1\):
\[\frac{\partial \mathcal{L}}{\partial z_j^{[l]}} = \sum_k \frac{\partial \mathcal{L}}{\partial z_k^{[l+1]}} \cdot \frac{\partial z_k^{[l+1]}}{\partial z_j^{[l]}}\]Now, \(z_k^{[l+1]} = \sum_j W_{kj}^{[l+1]} a_j^{[l]} + b_k^{[l+1]}\) and \(a_j^{[l]} = \sigma(z_j^{[l]})\), so:
\[\frac{\partial z_k^{[l+1]}}{\partial z_j^{[l]}} = W_{kj}^{[l+1]} \cdot \sigma'(z_j^{[l]})\]Substituting back:
\[\delta_j^{[l]} = \sum_k \delta_k^{[l+1]} \cdot W_{kj}^{[l+1]} \cdot \sigma'(z_j^{[l]})\]In matrix form (pulling out the common \(\sigma'\) term):
\[\boldsymbol{\delta}^{[l]} = \left( (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \right) \odot \sigma'(\mathbf{z}^{[l]})\]Worked Example
Let's work through a complete example with a simple 2-layer network. We'll compute every value step by step so you can see exactly how backpropagation works.
Network Setup
Architecture: 2 inputs → 2 hidden neurons → 1 output
Activation: Sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\)
Loss: Mean Squared Error \(\mathcal{L} = \frac{1}{2}(y - \hat{y})^2\)
Input: \(\mathbf{x} = [0.5, 0.3]^T\), Target: \(y = 1\)
Layer 1 weights and biases:
\[\mathbf{W}^{[1]} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}\]Layer 2 weights and bias:
\[\mathbf{W}^{[2]} = \begin{bmatrix} 0.5 & 0.6 \end{bmatrix}, \quad b^{[2]} = 0.1\]Forward Pass — Computing Activations
Hidden Layer (Layer 1)
Apply: \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}\)
\[z_1^{[1]} = (0.1 \times 0.5) + (0.2 \times 0.3) + 0.1 = 0.05 + 0.06 + 0.1 = \mathbf{0.21}\] \[z_2^{[1]} = (0.3 \times 0.5) + (0.4 \times 0.3) + 0.1 = 0.15 + 0.12 + 0.1 = \mathbf{0.37}\]Apply: \(\mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]})\)
\[a_1^{[1]} = \sigma(0.21) = \frac{1}{1 + e^{-0.21}} = \frac{1}{1.8106} = \mathbf{0.5523}\] \[a_2^{[1]} = \sigma(0.37) = \frac{1}{1 + e^{-0.37}} = \frac{1}{1.6907} = \mathbf{0.5914}\]Output Layer (Layer 2)
Apply: \(z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]}\)
\[z^{[2]} = (0.5 \times 0.5523) + (0.6 \times 0.5914) + 0.1\] \[= 0.2762 + 0.3548 + 0.1 = \mathbf{0.631}\]Apply sigmoid to get prediction:
\[\hat{y} = \sigma(0.631) = \frac{1}{1 + e^{-0.631}} = \mathbf{0.6526}\]Backward Pass — Computing Deltas & Gradients
Output Layer Delta (δ²)
Formula: \(\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial a^{[2]}} \cdot \sigma'(z^{[2]})\)
Part 1: Loss derivative w.r.t. output
\[\frac{\partial \mathcal{L}}{\partial a^{[2]}} = \frac{\partial}{\partial \hat{y}}\left[\frac{1}{2}(y-\hat{y})^2\right] = -(y - \hat{y}) = -(1 - 0.6526) = \mathbf{-0.3474}\]Part 2: Sigmoid derivative
\[\sigma'(z^{[2]}) = \sigma(z^{[2]})(1 - \sigma(z^{[2]})) = 0.6526 \times (1 - 0.6526) = 0.6526 \times 0.3474 = \mathbf{0.2267}\]Final: Multiply together
\[\delta^{[2]} = (-0.3474) \times 0.2267 = \mathbf{-0.0788}\]Hidden Layer Delta (δ¹)
Formula: \(\boldsymbol{\delta}^{[1]} = ((\mathbf{W}^{[2]})^T \delta^{[2]}) \odot \sigma'(\mathbf{z}^{[1]})\)
Part 1: Propagate error backward
\[(\mathbf{W}^{[2]})^T \delta^{[2]} = \begin{bmatrix} 0.5 \\ 0.6 \end{bmatrix} \times (-0.0788) = \begin{bmatrix} -0.0394 \\ -0.0473 \end{bmatrix}\]Part 2: Compute sigmoid derivatives for each hidden neuron
\[\sigma'(z_1^{[1]}) = a_1^{[1]}(1-a_1^{[1]}) = 0.5523 \times 0.4477 = \mathbf{0.2472}\] \[\sigma'(z_2^{[1]}) = a_2^{[1]}(1-a_2^{[1]}) = 0.5914 \times 0.4086 = \mathbf{0.2417}\]Final: Element-wise multiply
\[\boldsymbol{\delta}^{[1]} = \begin{bmatrix} -0.0394 \\ -0.0473 \end{bmatrix} \odot \begin{bmatrix} 0.2472 \\ 0.2417 \end{bmatrix} = \begin{bmatrix} \mathbf{-0.0097} \\ \mathbf{-0.0114} \end{bmatrix}\]Computing Weight Gradients
Formula: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} (\mathbf{a}^{[1]})^T\)
\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = -0.0788 \times [0.5523, 0.5914] = [\mathbf{-0.0435}, \mathbf{-0.0466}]\] \[\frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} = \mathbf{-0.0788}\]Formula: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \boldsymbol{\delta}^{[1]} \mathbf{x}^T\)
\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \begin{bmatrix} -0.0097 \\ -0.0114 \end{bmatrix} \times [0.5, 0.3] = \begin{bmatrix} -0.0049 & -0.0029 \\ -0.0057 & -0.0034 \end{bmatrix}\]Weight Update (with η = 0.5)
Formula: \(\mathbf{W}_{new} = \mathbf{W}_{old} - \eta \cdot \nabla \mathcal{L}\)
\[W^{[2]}_{0,new} = 0.5 - 0.5 \times (-0.0435) = 0.5 + 0.0218 = \mathbf{0.5218}\] \[W^{[2]}_{1,new} = 0.6 - 0.5 \times (-0.0466) = 0.6 + 0.0233 = \mathbf{0.6233}\]Notice: gradients are negative (loss decreasing direction), so weights increase to get output closer to target of 1.
Implementation
Here's a complete Python implementation of backpropagation from scratch:
import numpy as np class NeuralNetwork: def __init__(self, layer_sizes): """Initialize with Xavier initialization""" self.L = len(layer_sizes) - 1 self.weights = {} self.biases = {} for l in range(1, self.L + 1): self.weights[l] = np.random.randn( layer_sizes[l], layer_sizes[l-1] ) * np.sqrt(2 / layer_sizes[l-1]) self.biases[l] = np.zeros((layer_sizes[l], 1)) def sigmoid(self, z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_derivative(self, z): s = self.sigmoid(z) return s * (1 - s) def forward(self, X): """Forward pass""" self.cache = {'a': {0: X}, 'z': {}} a = X for l in range(1, self.L + 1): z = self.weights[l] @ a + self.biases[l] a = self.sigmoid(z) self.cache['z'][l] = z self.cache['a'][l] = a return a def backward(self, y_true): """Backpropagation""" m = y_true.shape[1] gradients = {'dW': {}, 'db': {}} # Output layer a_L = self.cache['a'][self.L] z_L = self.cache['z'][self.L] delta = (a_L - y_true) * self.sigmoid_derivative(z_L) gradients['dW'][self.L] = delta @ self.cache['a'][self.L-1].T / m gradients['db'][self.L] = np.sum(delta, axis=1, keepdims=True) / m # Hidden layers for l in range(self.L - 1, 0, -1): delta = (self.weights[l+1].T @ delta) * \ self.sigmoid_derivative(self.cache['z'][l]) gradients['dW'][l] = delta @ self.cache['a'][l-1].T / m gradients['db'][l] = np.sum(delta, axis=1, keepdims=True) / m return gradients def train(self, X, y, epochs, lr): for epoch in range(epochs): y_pred = self.forward(X) loss = np.mean((y_pred - y) ** 2) / 2 grads = self.backward(y) for l in range(1, self.L + 1): self.weights[l] -= lr * grads['dW'][l] self.biases[l] -= lr * grads['db'][l] if epoch % 100 == 0: print(f"Epoch {epoch}, Loss: {loss:.6f}") # Example: XOR problem X = np.array([[0,0,1,1], [0,1,0,1]]) y = np.array([[0,1,1,0]]) nn = NeuralNetwork([2, 4, 1]) nn.train(X, y, epochs=5000, lr=1.0) print(nn.forward(X))
Advanced Topics
The Vanishing Gradient Problem
When gradients are multiplied through many layers, they can become exponentially small (vanish) or large (explode).
- ReLU activation: Preserves gradient magnitude
- Residual connections: Skip connections for gradient flow
- Batch normalization: Prevents saturation
- Careful initialization: Xavier or He initialization
Gradient Descent Variants
Convolutional Layers
Backprop through convolutions uses the "im2col" trick - unrolling the convolution into matrix multiplication, then applying standard backprop.
Batch Normalization
Requires computing gradients through the normalization statistics. The gradient flows through both the normalized values and the learned scale/shift parameters.
Dropout
During training, gradients only flow through non-dropped units (scaled by 1/p). During inference, no dropout is applied.
Residual Connections
Skip connections allow gradients to flow directly: \(\frac{\partial}{\partial x}(x + F(x)) = 1 + \frac{\partial F}{\partial x}\), preventing vanishing gradients.
Practice & Debug
Gradient Checking
Always verify your gradients numerically! Gradient checking compares your analytical gradients to numerical approximations.
Common Bugs & Debugging Tips
- Shape mismatches: Always verify tensor dimensions
- Forgetting bias gradients: Remember \(\nabla_b = \delta\)
- Wrong averaging: Divide by batch size in the right place
- Numerical instability: Clip values, use stable softmax
- Gradient not flowing: Check for dead ReLUs, vanishing gradients
Resources
Further Reading
- Rumelhart et al. (1986) - Original backprop paper
- Neural Networks and Deep Learning - Michael Nielsen's online book
- CS231n - Stanford's CNN course notes
- Deep Learning Book - Goodfellow, Bengio, Courville
Framework Documentation
- Backpropagation efficiently computes gradients using the chain rule
- The error term δ captures each neuron's contribution to loss
- Gradients flow backward, scaled by weights and activation derivatives
- Modern frameworks handle backprop automatically via autodiff
- Always verify with gradient checking during development
Summary
Backpropagation is one of the most important algorithms in machine learning. Understanding it deeply will make you a better practitioner and give you intuition for debugging training issues and designing architectures.
Keep practicing, implement it from scratch, and don't be afraid to dive into the math. The concepts here form the foundation of all modern deep learning.