Understanding Backpropagation | Deep Dive into Neural Network Training

01

Introduction & Motivation

Backpropagation (backward propagation of errors) is the cornerstone algorithm for training neural networks. First introduced in the 1970s and popularized by Rumelhart, Hinton, and Williams in 1986, it provides an efficient method for computing gradients in multi-layer networks.

At its heart, backpropagation answers a fundamental question: How much does each weight in a neural network contribute to the overall error? By answering this question, we can adjust weights to minimize error and improve predictions.

Definition: Backpropagation

Backpropagation is an algorithm for computing the gradient of a loss function with respect to the weights of a neural network. It applies the chain rule of calculus systematically, propagating error signals backward through the network from output to input.

Why Do We Need Backpropagation?

Neural networks learn by adjusting their weights to minimize a loss function. This requires knowing the gradient—the direction and magnitude of change that will reduce the loss. For a network with millions of parameters, computing these gradients efficiently is crucial.

💡

Key Insight

Without backpropagation, computing gradients would require \(O(W^2)\) operations (where \(W\) is the number of weights). Backpropagation reduces this to \(O(W)\), making deep learning computationally feasible.

The Learning Process Overview

Forward Pass: Input data flows through the network, layer by layer, producing an output prediction.
Loss Computation: The prediction is compared to the true label using a loss function, producing a scalar error value.
Backward Pass: Gradients are computed by propagating the error backward through the network using the chain rule.
Weight Update: Weights are adjusted in the direction that minimizes the loss (gradient descent).

🧠 Animated Forward/Backward Pass

Watch data flow forward (cyan) through the network, then gradients flow backward (magenta).

02

Forward Propagation

Before understanding backpropagation, we must first understand how data flows forward through a neural network. This forward pass transforms inputs into outputs through a series of linear transformations and non-linear activations.

Neuron Computation

A single neuron computes a weighted sum of its inputs, adds a bias, and applies an activation function:

Single Neuron \[z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b\] \[a = \sigma(z)\]

Formula Breakdown — Single Neuron \(z\) — Pre-activation (weighted sum) The raw output before applying the activation function Example: if \(\mathbf{x} = [0.5, 0.3]\), \(\mathbf{w} = [0.4, 0.6]\), \(b = 0.1\) Then \(z = (0.4 \times 0.5) + (0.6 \times 0.3) + 0.1 = 0.2 + 0.18 + 0.1 = 0.48\) \(\sigma(z)\) — Activation function Introduces non-linearity (otherwise the network is just linear!) Squashes values to a specific range (e.g., sigmoid: 0 to 1) \(a\) — Activation (output) The final output of the neuron after activation This becomes the input to neurons in the next layer

Layer-wise Computation

For efficiency, we compute all neurons in a layer at once using matrix operations:

Layer Computation \[\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\] \[\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})\]

Formula Breakdown — Matrix Dimensions For layer \(l\) with \(n^{[l]}\) neurons receiving input from layer \(l-1\) with \(n^{[l-1]}\) neurons: \(\mathbf{W}^{[l]}\) — Weight matrix: \(n^{[l]} \times n^{[l-1]}\) Row \(i\): weights for neuron \(i\) in this layer Column \(j\): weights connecting from neuron \(j\) in previous layer \(W_{ij}^{[l]}\): weight from neuron \(j\) in layer \(l-1\) to neuron \(i\) in layer \(l\) \(\mathbf{a}^{[l-1]}\) — Input from previous layer: \(n^{[l-1]} \times 1\) \(\mathbf{b}^{[l]}\) — Bias vector: \(n^{[l]} \times 1\) \(\mathbf{z}^{[l]}, \mathbf{a}^{[l]}\) — Output vectors: \(n^{[l]} \times 1\)

Activation Functions

📈 Activation Function Explorer

Move mouse over the graph to see values

Common Activation Functions Sigmoid: \[\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))\] ReLU: \[\text{ReLU}(z) = \max(0, z), \quad \text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}\]

Interactive Neural Network

Hover over neurons to see connections Tap on neurons to see connections

03

Loss Functions

The loss function quantifies how far our predictions are from the true values. It's the objective we're trying to minimize during training.

Mean Squared Error (MSE)

MSE Loss \[\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] \[\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)\]

Binary Cross-Entropy

Binary Cross-Entropy \[\mathcal{L}_{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]\]

Categorical Cross-Entropy

Categorical Cross-Entropy \[\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)\]

04

The Chain Rule

The chain rule is the mathematical foundation of backpropagation. It allows us to compute derivatives of composite functions.

The Chain Rule

If \(y = f(u)\) and \(u = g(x)\), then: \[\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\]

Computational Graph

Consider a simple computation: \(f = (x + y) \cdot z\)

📊 Computational Graph

🔗

The Power of Local Computation

Each node only needs to know its local gradient—how its output changes with respect to its inputs. The chain rule combines these local gradients to compute global gradients.

Multivariate Chain Rule \[\frac{\partial \mathcal{L}}{\partial x} = \sum_{i} \frac{\partial \mathcal{L}}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}\]

05

The Backpropagation Algorithm

Now we combine everything: forward propagation, loss functions, and the chain rule to derive backpropagation—the elegant algorithm that makes training deep networks possible.

The Key Quantity: δ (Delta) — The Error Signal

The delta (δ) is the most important concept in backpropagation. It represents the error signal at each layer—specifically, how much the total loss would change if we slightly changed the pre-activation value at that layer.

Definition: Delta (δ) — The Error Term

For layer \(l\), the delta is defined as the partial derivative of the loss with respect to the pre-activation:

Error Term Definition \[\boldsymbol{\delta}^{[l]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}}\]

🔍 Breaking Down Delta — What Each Part Means

Formula Breakdown Let's dissect the delta formula piece by piece: \(\boldsymbol{\delta}^{[l]}\) — The error signal vector at layer \(l\) This is a vector with one value per neuron in layer \(l\) Each element tells us: "How much would the loss change if this neuron's pre-activation changed?" \(\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}}\) — Partial derivative of loss w.r.t. pre-activation \(\mathcal{L}\) is the scalar loss value (e.g., 0.5 for MSE) \(\mathbf{z}^{[l]}\) is the weighted sum before activation: \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\) This derivative captures the sensitivity of loss to changes in \(\mathbf{z}\)

💡

Why Delta at z (not a)?

We define delta at \(\mathbf{z}\) (pre-activation) rather than \(\mathbf{a}\) (post-activation) because weights directly affect \(\mathbf{z}\). This makes computing weight gradients simpler: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \boldsymbol{\delta} \cdot \mathbf{a}^T\). If we used \(\mathbf{a}\), we'd need an extra chain rule step every time.

🎯 Why Delta Matters — The Bridge to Gradients

Once we have \(\boldsymbol{\delta}^{[l]}\), computing the gradients we need for learning becomes trivial:

From Delta to Gradients Weight Gradient: \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\] ↳ "Error at this layer" × "Input to this layer" = "How to adjust weights" Bias Gradient: \[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]}\] ↳ The bias gradient is simply delta itself (since \(\frac{\partial z}{\partial b} = 1\))

The Backpropagation Equations — Step by Step

Step 1: Output Layer Error (δ at layer L)

At the output layer, we can compute delta directly from the loss function:

Output Layer Delta \[\boldsymbol{\delta}^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \odot \sigma'(\mathbf{z}^{[L]})\]

Formula Breakdown — Output Delta \(\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}}\) — How loss changes with output activation For MSE loss \(\mathcal{L} = \frac{1}{2}(y - \hat{y})^2\): this equals \((\hat{y} - y)\) For Cross-Entropy: this equals \(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\) \(\sigma'(\mathbf{z}^{[L]})\) — Derivative of activation function For Sigmoid: \(\sigma'(z) = \sigma(z)(1 - \sigma(z)) = a(1-a)\) For ReLU: \(\sigma'(z) = 1\) if \(z > 0\), else \(0\) \(\odot\) — Element-wise (Hadamard) product Multiplies corresponding elements: \([a,b] \odot [c,d] = [ac, bd]\) Each neuron's error is scaled by its activation's sensitivity

Step 2: Propagate Error Backward (δ for hidden layers)

This is where the "back" in backpropagation happens. We propagate the error signal from layer \(l+1\) back to layer \(l\):

Error Propagation Equation \[\boldsymbol{\delta}^{[l]} = \left( (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \right) \odot \sigma'(\mathbf{z}^{[l]})\]

Formula Breakdown — Error Propagation \((\mathbf{W}^{[l+1]})^T\) — Transposed weight matrix The transpose "reverses" the forward connections In forward pass: layer \(l\) → layer \(l+1\) via \(\mathbf{W}^{[l+1]}\) In backward pass: layer \(l+1\) → layer \(l\) via \((\mathbf{W}^{[l+1]})^T\) \((\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}\) — Weighted error distribution Each neuron in layer \(l\) contributed to multiple neurons in layer \(l+1\) This term sums up all error contributions, weighted by connection strength Strong weights → more blame assigned to that neuron \(\odot \sigma'(\mathbf{z}^{[l]})\) — Local gradient scaling Scales the propagated error by the local activation sensitivity If activation was saturated (\(\sigma' \approx 0\)), gradient vanishes This is why ReLU helps: \(\sigma'(z) = 1\) for positive inputs

🎯

The Beautiful Intuition

Think of delta as "blame assignment." The output layer knows the final error. It passes blame backward through the network: "You contributed this much weight to my input, so you get this much blame." Each layer receives blame proportional to its contribution, scaled by how much it could have changed things (the activation derivative).

Complete Algorithm Summary

Forward Pass: Compute all \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\) and \(\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})\)
Output Error: Compute \(\boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}} \mathcal{L} \odot \sigma'(\mathbf{z}^{[L]})\)
Backpropagate: For \(l = L-1, \ldots, 1\): compute \(\boldsymbol{\delta}^{[l]} = ((\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}) \odot \sigma'(\mathbf{z}^{[l]})\)
Compute Gradients: \(\nabla_{\mathbf{W}^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\) and \(\nabla_{\mathbf{b}^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]}\)
Update Weights: \(\mathbf{W}^{[l]} \leftarrow \mathbf{W}^{[l]} - \eta \nabla_{\mathbf{W}^{[l]}} \mathcal{L}\)

📐 Full Derivation: Why Does the Delta Formula Work?

Let's derive the backpropagation equation using the chain rule:

Derivation We want \(\frac{\partial \mathcal{L}}{\partial z_j^{[l]}}\) for neuron \(j\) in layer \(l\). The pre-activation \(z_j^{[l]}\) affects the loss through all neurons in layer \(l+1\): \[\frac{\partial \mathcal{L}}{\partial z_j^{[l]}} = \sum_k \frac{\partial \mathcal{L}}{\partial z_k^{[l+1]}} \cdot \frac{\partial z_k^{[l+1]}}{\partial z_j^{[l]}}\] Now, \(z_k^{[l+1]} = \sum_j W_{kj}^{[l+1]} a_j^{[l]} + b_k^{[l+1]}\) and \(a_j^{[l]} = \sigma(z_j^{[l]})\), so: \[\frac{\partial z_k^{[l+1]}}{\partial z_j^{[l]}} = W_{kj}^{[l+1]} \cdot \sigma'(z_j^{[l]})\] Substituting back: \[\delta_j^{[l]} = \sum_k \delta_k^{[l+1]} \cdot W_{kj}^{[l+1]} \cdot \sigma'(z_j^{[l]})\] In matrix form (pulling out the common \(\sigma'\) term): \[\boldsymbol{\delta}^{[l]} = \left( (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \right) \odot \sigma'(\mathbf{z}^{[l]})\]

06

Worked Example

Let's work through a complete example with a simple 2-layer network. We'll compute every value step by step so you can see exactly how backpropagation works.

Network Setup

Architecture: 2 inputs → 2 hidden neurons → 1 output
Activation: Sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\)
Loss: Mean Squared Error \(\mathcal{L} = \frac{1}{2}(y - \hat{y})^2\)

Initial Values Input: \(\mathbf{x} = [0.5, 0.3]^T\), Target: \(y = 1\) Layer 1 weights and biases: \[\mathbf{W}^{[1]} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}\] Layer 2 weights and bias: \[\mathbf{W}^{[2]} = \begin{bmatrix} 0.5 & 0.6 \end{bmatrix}, \quad b^{[2]} = 0.1\]

Forward Pass — Computing Activations

Hidden Layer (Layer 1)

Step-by-Step: z¹ Calculation Apply: \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}\) \[z_1^{[1]} = (0.1 \times 0.5) + (0.2 \times 0.3) + 0.1 = 0.05 + 0.06 + 0.1 = \mathbf{0.21}\] \[z_2^{[1]} = (0.3 \times 0.5) + (0.4 \times 0.3) + 0.1 = 0.15 + 0.12 + 0.1 = \mathbf{0.37}\]

Step-by-Step: a¹ Calculation (Sigmoid) Apply: \(\mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]})\) \[a_1^{[1]} = \sigma(0.21) = \frac{1}{1 + e^{-0.21}} = \frac{1}{1.8106} = \mathbf{0.5523}\] \[a_2^{[1]} = \sigma(0.37) = \frac{1}{1 + e^{-0.37}} = \frac{1}{1.6907} = \mathbf{0.5914}\]

Output Layer (Layer 2)

Step-by-Step: z² and ŷ Calculation Apply: \(z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]}\) \[z^{[2]} = (0.5 \times 0.5523) + (0.6 \times 0.5914) + 0.1\] \[= 0.2762 + 0.3548 + 0.1 = \mathbf{0.631}\] Apply sigmoid to get prediction: \[\hat{y} = \sigma(0.631) = \frac{1}{1 + e^{-0.631}} = \mathbf{0.6526}\]

Loss Calculation \[\mathcal{L} = \frac{1}{2}(y - \hat{y})^2 = \frac{1}{2}(1 - 0.6526)^2 = \frac{1}{2}(0.3474)^2 = \mathbf{0.0604}\]

Backward Pass — Computing Deltas & Gradients

Output Layer Delta (δ²)

Step-by-Step: δ² Computation Formula: \(\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial a^{[2]}} \cdot \sigma'(z^{[2]})\) Part 1: Loss derivative w.r.t. output \[\frac{\partial \mathcal{L}}{\partial a^{[2]}} = \frac{\partial}{\partial \hat{y}}\left[\frac{1}{2}(y-\hat{y})^2\right] = -(y - \hat{y}) = -(1 - 0.6526) = \mathbf{-0.3474}\] Part 2: Sigmoid derivative \[\sigma'(z^{[2]}) = \sigma(z^{[2]})(1 - \sigma(z^{[2]})) = 0.6526 \times (1 - 0.6526) = 0.6526 \times 0.3474 = \mathbf{0.2267}\] Final: Multiply together \[\delta^{[2]} = (-0.3474) \times 0.2267 = \mathbf{-0.0788}\]

Hidden Layer Delta (δ¹)

Step-by-Step: δ¹ Computation Formula: \(\boldsymbol{\delta}^{[1]} = ((\mathbf{W}^{[2]})^T \delta^{[2]}) \odot \sigma'(\mathbf{z}^{[1]})\) Part 1: Propagate error backward \[(\mathbf{W}^{[2]})^T \delta^{[2]} = \begin{bmatrix} 0.5 \\ 0.6 \end{bmatrix} \times (-0.0788) = \begin{bmatrix} -0.0394 \\ -0.0473 \end{bmatrix}\] Part 2: Compute sigmoid derivatives for each hidden neuron \[\sigma'(z_1^{[1]}) = a_1^{[1]}(1-a_1^{[1]}) = 0.5523 \times 0.4477 = \mathbf{0.2472}\] \[\sigma'(z_2^{[1]}) = a_2^{[1]}(1-a_2^{[1]}) = 0.5914 \times 0.4086 = \mathbf{0.2417}\] Final: Element-wise multiply \[\boldsymbol{\delta}^{[1]} = \begin{bmatrix} -0.0394 \\ -0.0473 \end{bmatrix} \odot \begin{bmatrix} 0.2472 \\ 0.2417 \end{bmatrix} = \begin{bmatrix} \mathbf{-0.0097} \\ \mathbf{-0.0114} \end{bmatrix}\]

Computing Weight Gradients

Layer 2 Gradients Formula: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} (\mathbf{a}^{[1]})^T\) \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = -0.0788 \times [0.5523, 0.5914] = [\mathbf{-0.0435}, \mathbf{-0.0466}]\] \[\frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} = \mathbf{-0.0788}\]

Layer 1 Gradients Formula: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \boldsymbol{\delta}^{[1]} \mathbf{x}^T\) \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \begin{bmatrix} -0.0097 \\ -0.0114 \end{bmatrix} \times [0.5, 0.3] = \begin{bmatrix} -0.0049 & -0.0029 \\ -0.0057 & -0.0034 \end{bmatrix}\]

Weight Update (with η = 0.5)

Updated Weights Formula: \(\mathbf{W}_{new} = \mathbf{W}_{old} - \eta \cdot \nabla \mathcal{L}\) \[W^{[2]}_{0,new} = 0.5 - 0.5 \times (-0.0435) = 0.5 + 0.0218 = \mathbf{0.5218}\] \[W^{[2]}_{1,new} = 0.6 - 0.5 \times (-0.0466) = 0.6 + 0.0233 = \mathbf{0.6233}\] Notice: gradients are negative (loss decreasing direction), so weights increase to get output closer to target of 1.

07

Implementation

Here's a complete Python implementation of backpropagation from scratch:

Python

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes):
        """Initialize with Xavier initialization"""
        self.L = len(layer_sizes) - 1
        self.weights = {}
        self.biases = {}
        
        for l in range(1, self.L + 1):
            self.weights[l] = np.random.randn(
                layer_sizes[l], layer_sizes[l-1]
            ) * np.sqrt(2 / layer_sizes[l-1])
            self.biases[l] = np.zeros((layer_sizes[l], 1))
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        """Forward pass"""
        self.cache = {'a': {0: X}, 'z': {}}
        a = X
        for l in range(1, self.L + 1):
            z = self.weights[l] @ a + self.biases[l]
            a = self.sigmoid(z)
            self.cache['z'][l] = z
            self.cache['a'][l] = a
        return a
    
    def backward(self, y_true):
        """Backpropagation"""
        m = y_true.shape[1]
        gradients = {'dW': {}, 'db': {}}
        
        # Output layer
        a_L = self.cache['a'][self.L]
        z_L = self.cache['z'][self.L]
        delta = (a_L - y_true) * self.sigmoid_derivative(z_L)
        
        gradients['dW'][self.L] = delta @ self.cache['a'][self.L-1].T / m
        gradients['db'][self.L] = np.sum(delta, axis=1, keepdims=True) / m
        
        # Hidden layers
        for l in range(self.L - 1, 0, -1):
            delta = (self.weights[l+1].T @ delta) * \
                    self.sigmoid_derivative(self.cache['z'][l])
            gradients['dW'][l] = delta @ self.cache['a'][l-1].T / m
            gradients['db'][l] = np.sum(delta, axis=1, keepdims=True) / m
        
        return gradients
    
    def train(self, X, y, epochs, lr):
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = np.mean((y_pred - y) ** 2) / 2
            grads = self.backward(y)
            
            for l in range(1, self.L + 1):
                self.weights[l] -= lr * grads['dW'][l]
                self.biases[l] -= lr * grads['db'][l]
            
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss:.6f}")

# Example: XOR problem
X = np.array([[0,0,1,1], [0,1,0,1]])
y = np.array([[0,1,1,0]])

nn = NeuralNetwork([2, 4, 1])
nn.train(X, y, epochs=5000, lr=1.0)
print(nn.forward(X))

08

Advanced Topics

The Vanishing Gradient Problem

When gradients are multiplied through many layers, they can become exponentially small (vanish) or large (explode).

🔧

Solutions to Vanishing Gradients

ReLU activation: Preserves gradient magnitude
Residual connections: Skip connections for gradient flow
Batch normalization: Prevents saturation
Careful initialization: Xavier or He initialization

Gradient Descent Variants

📉 Gradient Descent Optimization

Adam Optimizer \[\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla \mathcal{L}\] \[\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) (\nabla \mathcal{L})^2\] \[\theta_{t+1} = \theta_t - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}\]

📦 Backprop Through Special Layers

Convolutional Layers

Backprop through convolutions uses the "im2col" trick - unrolling the convolution into matrix multiplication, then applying standard backprop.

Batch Normalization

Requires computing gradients through the normalization statistics. The gradient flows through both the normalized values and the learned scale/shift parameters.

Dropout

During training, gradients only flow through non-dropped units (scaled by 1/p). During inference, no dropout is applied.

Residual Connections

Skip connections allow gradients to flow directly: \(\frac{\partial}{\partial x}(x + F(x)) = 1 + \frac{\partial F}{\partial x}\), preventing vanishing gradients.

📐 Matrix Calculus Quick Reference

Common Derivatives \[\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\] \[\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\] \[\frac{\partial}{\partial \mathbf{X}}(\mathbf{a}^T\mathbf{X}\mathbf{b}) = \mathbf{a}\mathbf{b}^T\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}^T) = \mathbf{A}\]

09

Practice & Debug

Gradient Checking

Always verify your gradients numerically! Gradient checking compares your analytical gradients to numerical approximations.

Common Bugs & Debugging Tips

🐛

Common Backprop Bugs

Shape mismatches: Always verify tensor dimensions
Forgetting bias gradients: Remember \(\nabla_b = \delta\)
Wrong averaging: Divide by batch size in the right place
Numerical instability: Clip values, use stable softmax
Gradient not flowing: Check for dead ReLUs, vanishing gradients

10

Resources

Framework Documentation

Key Takeaways

Backpropagation efficiently computes gradients using the chain rule
The error term δ captures each neuron's contribution to loss
Gradients flow backward, scaled by weights and activation derivatives
Modern frameworks handle backprop automatically via autodiff
Always verify with gradient checking during development

∞

Summary

Backpropagation is one of the most important algorithms in machine learning. Understanding it deeply will make you a better practitioner and give you intuition for debugging training issues and designing architectures.

Keep practicing, implement it from scratch, and don't be afraid to dive into the math. The concepts here form the foundation of all modern deep learning.

Introduction & Motivation

Why Do We Need Backpropagation?

The Learning Process Overview

Forward Propagation

Neuron Computation

Layer-wise Computation

Activation Functions

Interactive Neural Network

Loss Functions

Mean Squared Error (MSE)

Binary Cross-Entropy

Categorical Cross-Entropy

The Chain Rule

Computational Graph

The Backpropagation Algorithm

The Key Quantity: δ (Delta) — The Error Signal

🔍 Breaking Down Delta — What Each Part Means

🎯 Why Delta Matters — The Bridge to Gradients

The Backpropagation Equations — Step by Step

Step 1: Output Layer Error (δ at layer L)

Step 2: Propagate Error Backward (δ for hidden layers)

Complete Algorithm Summary

Worked Example

Network Setup

Forward Pass — Computing Activations

Hidden Layer (Layer 1)

Output Layer (Layer 2)

Backward Pass — Computing Deltas & Gradients

Output Layer Delta (δ²)

Hidden Layer Delta (δ¹)

Computing Weight Gradients

Weight Update (with η = 0.5)

Implementation

Advanced Topics

The Vanishing Gradient Problem

Gradient Descent Variants

Convolutional Layers

Batch Normalization

Dropout

Residual Connections

Practice & Debug

Gradient Checking

Common Bugs & Debugging Tips

Resources

Further Reading

Framework Documentation

Summary