Deep Learning Fundamentals

Understanding
Backpropagation

A comprehensive guide to the algorithm that revolutionized neural networks. Master the theory, mathematics, and intuition behind gradient computation.

Begin Learning Jump to Example
Contents
01

Introduction & Motivation

Backpropagation (backward propagation of errors) is the cornerstone algorithm for training neural networks. First introduced in the 1970s and popularized by Rumelhart, Hinton, and Williams in 1986, it provides an efficient method for computing gradients in multi-layer networks.

At its heart, backpropagation answers a fundamental question: How much does each weight in a neural network contribute to the overall error? By answering this question, we can adjust weights to minimize error and improve predictions.

Definition: Backpropagation

Backpropagation is an algorithm for computing the gradient of a loss function with respect to the weights of a neural network. It applies the chain rule of calculus systematically, propagating error signals backward through the network from output to input.

Why Do We Need Backpropagation?

Neural networks learn by adjusting their weights to minimize a loss function. This requires knowing the gradient—the direction and magnitude of change that will reduce the loss. For a network with millions of parameters, computing these gradients efficiently is crucial.

💡
Key Insight

Without backpropagation, computing gradients would require \(O(W^2)\) operations (where \(W\) is the number of weights). Backpropagation reduces this to \(O(W)\), making deep learning computationally feasible.

The Learning Process Overview

  1. Forward Pass: Input data flows through the network, layer by layer, producing an output prediction.
  2. Loss Computation: The prediction is compared to the true label using a loss function, producing a scalar error value.
  3. Backward Pass: Gradients are computed by propagating the error backward through the network using the chain rule.
  4. Weight Update: Weights are adjusted in the direction that minimizes the loss (gradient descent).
🧠 Animated Forward/Backward Pass
Watch data flow forward (cyan) through the network, then gradients flow backward (magenta).
02

Forward Propagation

Before understanding backpropagation, we must first understand how data flows forward through a neural network. This forward pass transforms inputs into outputs through a series of linear transformations and non-linear activations.

Neuron Computation

A single neuron computes a weighted sum of its inputs, adds a bias, and applies an activation function:

Single Neuron \[z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b\] \[a = \sigma(z)\]
Formula Breakdown — Single Neuron

\(z\) — Pre-activation (weighted sum)

  • The raw output before applying the activation function
  • Example: if \(\mathbf{x} = [0.5, 0.3]\), \(\mathbf{w} = [0.4, 0.6]\), \(b = 0.1\)
  • Then \(z = (0.4 \times 0.5) + (0.6 \times 0.3) + 0.1 = 0.2 + 0.18 + 0.1 = 0.48\)

\(\sigma(z)\) — Activation function

  • Introduces non-linearity (otherwise the network is just linear!)
  • Squashes values to a specific range (e.g., sigmoid: 0 to 1)

\(a\) — Activation (output)

  • The final output of the neuron after activation
  • This becomes the input to neurons in the next layer

Layer-wise Computation

For efficiency, we compute all neurons in a layer at once using matrix operations:

Layer Computation \[\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\] \[\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})\]
Formula Breakdown — Matrix Dimensions

For layer \(l\) with \(n^{[l]}\) neurons receiving input from layer \(l-1\) with \(n^{[l-1]}\) neurons:

\(\mathbf{W}^{[l]}\) — Weight matrix: \(n^{[l]} \times n^{[l-1]}\)

  • Row \(i\): weights for neuron \(i\) in this layer
  • Column \(j\): weights connecting from neuron \(j\) in previous layer
  • \(W_{ij}^{[l]}\): weight from neuron \(j\) in layer \(l-1\) to neuron \(i\) in layer \(l\)

\(\mathbf{a}^{[l-1]}\) — Input from previous layer: \(n^{[l-1]} \times 1\)

\(\mathbf{b}^{[l]}\) — Bias vector: \(n^{[l]} \times 1\)

\(\mathbf{z}^{[l]}, \mathbf{a}^{[l]}\) — Output vectors: \(n^{[l]} \times 1\)

Activation Functions

📈 Activation Function Explorer
Move mouse over the graph to see values
Common Activation Functions

Sigmoid:

\[\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))\]

ReLU:

\[\text{ReLU}(z) = \max(0, z), \quad \text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}\]

Interactive Neural Network

Hover over neurons to see connections Tap on neurons to see connections

03

Loss Functions

The loss function quantifies how far our predictions are from the true values. It's the objective we're trying to minimize during training.

Mean Squared Error (MSE)

MSE Loss \[\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] \[\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)\]

Binary Cross-Entropy

Binary Cross-Entropy \[\mathcal{L}_{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]\]

Categorical Cross-Entropy

Categorical Cross-Entropy \[\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)\]
04

The Chain Rule

The chain rule is the mathematical foundation of backpropagation. It allows us to compute derivatives of composite functions.

The Chain Rule

If \(y = f(u)\) and \(u = g(x)\), then: \[\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\]

Computational Graph

Consider a simple computation: \(f = (x + y) \cdot z\)

📊 Computational Graph
🔗
The Power of Local Computation

Each node only needs to know its local gradient—how its output changes with respect to its inputs. The chain rule combines these local gradients to compute global gradients.

Multivariate Chain Rule \[\frac{\partial \mathcal{L}}{\partial x} = \sum_{i} \frac{\partial \mathcal{L}}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}\]
05

The Backpropagation Algorithm

Now we combine everything: forward propagation, loss functions, and the chain rule to derive backpropagation—the elegant algorithm that makes training deep networks possible.

The Key Quantity: δ (Delta) — The Error Signal

The delta (δ) is the most important concept in backpropagation. It represents the error signal at each layer—specifically, how much the total loss would change if we slightly changed the pre-activation value at that layer.

Definition: Delta (δ) — The Error Term

For layer \(l\), the delta is defined as the partial derivative of the loss with respect to the pre-activation:

Error Term Definition \[\boldsymbol{\delta}^{[l]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}}\]

🔍 Breaking Down Delta — What Each Part Means

Formula Breakdown

Let's dissect the delta formula piece by piece:

\(\boldsymbol{\delta}^{[l]}\) — The error signal vector at layer \(l\)

  • This is a vector with one value per neuron in layer \(l\)
  • Each element tells us: "How much would the loss change if this neuron's pre-activation changed?"

\(\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}}\) — Partial derivative of loss w.r.t. pre-activation

  • \(\mathcal{L}\) is the scalar loss value (e.g., 0.5 for MSE)
  • \(\mathbf{z}^{[l]}\) is the weighted sum before activation: \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\)
  • This derivative captures the sensitivity of loss to changes in \(\mathbf{z}\)
💡
Why Delta at z (not a)?

We define delta at \(\mathbf{z}\) (pre-activation) rather than \(\mathbf{a}\) (post-activation) because weights directly affect \(\mathbf{z}\). This makes computing weight gradients simpler: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \boldsymbol{\delta} \cdot \mathbf{a}^T\). If we used \(\mathbf{a}\), we'd need an extra chain rule step every time.

🎯 Why Delta Matters — The Bridge to Gradients

Once we have \(\boldsymbol{\delta}^{[l]}\), computing the gradients we need for learning becomes trivial:

From Delta to Gradients

Weight Gradient:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\]

↳ "Error at this layer" × "Input to this layer" = "How to adjust weights"

Bias Gradient:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]}\]

↳ The bias gradient is simply delta itself (since \(\frac{\partial z}{\partial b} = 1\))

The Backpropagation Equations — Step by Step

Step 1: Output Layer Error (δ at layer L)

At the output layer, we can compute delta directly from the loss function:

Output Layer Delta \[\boldsymbol{\delta}^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \odot \sigma'(\mathbf{z}^{[L]})\]
Formula Breakdown — Output Delta

\(\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}}\) — How loss changes with output activation

  • For MSE loss \(\mathcal{L} = \frac{1}{2}(y - \hat{y})^2\): this equals \((\hat{y} - y)\)
  • For Cross-Entropy: this equals \(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)

\(\sigma'(\mathbf{z}^{[L]})\) — Derivative of activation function

  • For Sigmoid: \(\sigma'(z) = \sigma(z)(1 - \sigma(z)) = a(1-a)\)
  • For ReLU: \(\sigma'(z) = 1\) if \(z > 0\), else \(0\)

\(\odot\) — Element-wise (Hadamard) product

  • Multiplies corresponding elements: \([a,b] \odot [c,d] = [ac, bd]\)
  • Each neuron's error is scaled by its activation's sensitivity

Step 2: Propagate Error Backward (δ for hidden layers)

This is where the "back" in backpropagation happens. We propagate the error signal from layer \(l+1\) back to layer \(l\):

Error Propagation Equation \[\boldsymbol{\delta}^{[l]} = \left( (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \right) \odot \sigma'(\mathbf{z}^{[l]})\]
Formula Breakdown — Error Propagation

\((\mathbf{W}^{[l+1]})^T\) — Transposed weight matrix

  • The transpose "reverses" the forward connections
  • In forward pass: layer \(l\) → layer \(l+1\) via \(\mathbf{W}^{[l+1]}\)
  • In backward pass: layer \(l+1\) → layer \(l\) via \((\mathbf{W}^{[l+1]})^T\)

\((\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}\) — Weighted error distribution

  • Each neuron in layer \(l\) contributed to multiple neurons in layer \(l+1\)
  • This term sums up all error contributions, weighted by connection strength
  • Strong weights → more blame assigned to that neuron

\(\odot \sigma'(\mathbf{z}^{[l]})\) — Local gradient scaling

  • Scales the propagated error by the local activation sensitivity
  • If activation was saturated (\(\sigma' \approx 0\)), gradient vanishes
  • This is why ReLU helps: \(\sigma'(z) = 1\) for positive inputs
🎯
The Beautiful Intuition

Think of delta as "blame assignment." The output layer knows the final error. It passes blame backward through the network: "You contributed this much weight to my input, so you get this much blame." Each layer receives blame proportional to its contribution, scaled by how much it could have changed things (the activation derivative).

Complete Algorithm Summary

  1. Forward Pass: Compute all \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\) and \(\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})\)
  2. Output Error: Compute \(\boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}} \mathcal{L} \odot \sigma'(\mathbf{z}^{[L]})\)
  3. Backpropagate: For \(l = L-1, \ldots, 1\): compute \(\boldsymbol{\delta}^{[l]} = ((\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}) \odot \sigma'(\mathbf{z}^{[l]})\)
  4. Compute Gradients: \(\nabla_{\mathbf{W}^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\) and \(\nabla_{\mathbf{b}^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]}\)
  5. Update Weights: \(\mathbf{W}^{[l]} \leftarrow \mathbf{W}^{[l]} - \eta \nabla_{\mathbf{W}^{[l]}} \mathcal{L}\)

Let's derive the backpropagation equation using the chain rule:

Derivation

We want \(\frac{\partial \mathcal{L}}{\partial z_j^{[l]}}\) for neuron \(j\) in layer \(l\).

The pre-activation \(z_j^{[l]}\) affects the loss through all neurons in layer \(l+1\):

\[\frac{\partial \mathcal{L}}{\partial z_j^{[l]}} = \sum_k \frac{\partial \mathcal{L}}{\partial z_k^{[l+1]}} \cdot \frac{\partial z_k^{[l+1]}}{\partial z_j^{[l]}}\]

Now, \(z_k^{[l+1]} = \sum_j W_{kj}^{[l+1]} a_j^{[l]} + b_k^{[l+1]}\) and \(a_j^{[l]} = \sigma(z_j^{[l]})\), so:

\[\frac{\partial z_k^{[l+1]}}{\partial z_j^{[l]}} = W_{kj}^{[l+1]} \cdot \sigma'(z_j^{[l]})\]

Substituting back:

\[\delta_j^{[l]} = \sum_k \delta_k^{[l+1]} \cdot W_{kj}^{[l+1]} \cdot \sigma'(z_j^{[l]})\]

In matrix form (pulling out the common \(\sigma'\) term):

\[\boldsymbol{\delta}^{[l]} = \left( (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \right) \odot \sigma'(\mathbf{z}^{[l]})\]
06

Worked Example

Let's work through a complete example with a simple 2-layer network. We'll compute every value step by step so you can see exactly how backpropagation works.

Network Setup

Architecture: 2 inputs → 2 hidden neurons → 1 output
Activation: Sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\)
Loss: Mean Squared Error \(\mathcal{L} = \frac{1}{2}(y - \hat{y})^2\)

Initial Values

Input: \(\mathbf{x} = [0.5, 0.3]^T\), Target: \(y = 1\)

Layer 1 weights and biases:

\[\mathbf{W}^{[1]} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}\]

Layer 2 weights and bias:

\[\mathbf{W}^{[2]} = \begin{bmatrix} 0.5 & 0.6 \end{bmatrix}, \quad b^{[2]} = 0.1\]

Forward Pass — Computing Activations

Hidden Layer (Layer 1)

Step-by-Step: z¹ Calculation

Apply: \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}\)

\[z_1^{[1]} = (0.1 \times 0.5) + (0.2 \times 0.3) + 0.1 = 0.05 + 0.06 + 0.1 = \mathbf{0.21}\] \[z_2^{[1]} = (0.3 \times 0.5) + (0.4 \times 0.3) + 0.1 = 0.15 + 0.12 + 0.1 = \mathbf{0.37}\]
Step-by-Step: a¹ Calculation (Sigmoid)

Apply: \(\mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]})\)

\[a_1^{[1]} = \sigma(0.21) = \frac{1}{1 + e^{-0.21}} = \frac{1}{1.8106} = \mathbf{0.5523}\] \[a_2^{[1]} = \sigma(0.37) = \frac{1}{1 + e^{-0.37}} = \frac{1}{1.6907} = \mathbf{0.5914}\]

Output Layer (Layer 2)

Step-by-Step: z² and ŷ Calculation

Apply: \(z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]}\)

\[z^{[2]} = (0.5 \times 0.5523) + (0.6 \times 0.5914) + 0.1\] \[= 0.2762 + 0.3548 + 0.1 = \mathbf{0.631}\]

Apply sigmoid to get prediction:

\[\hat{y} = \sigma(0.631) = \frac{1}{1 + e^{-0.631}} = \mathbf{0.6526}\]
Loss Calculation \[\mathcal{L} = \frac{1}{2}(y - \hat{y})^2 = \frac{1}{2}(1 - 0.6526)^2 = \frac{1}{2}(0.3474)^2 = \mathbf{0.0604}\]

Backward Pass — Computing Deltas & Gradients

Output Layer Delta (δ²)

Step-by-Step: δ² Computation

Formula: \(\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial a^{[2]}} \cdot \sigma'(z^{[2]})\)

Part 1: Loss derivative w.r.t. output

\[\frac{\partial \mathcal{L}}{\partial a^{[2]}} = \frac{\partial}{\partial \hat{y}}\left[\frac{1}{2}(y-\hat{y})^2\right] = -(y - \hat{y}) = -(1 - 0.6526) = \mathbf{-0.3474}\]

Part 2: Sigmoid derivative

\[\sigma'(z^{[2]}) = \sigma(z^{[2]})(1 - \sigma(z^{[2]})) = 0.6526 \times (1 - 0.6526) = 0.6526 \times 0.3474 = \mathbf{0.2267}\]

Final: Multiply together

\[\delta^{[2]} = (-0.3474) \times 0.2267 = \mathbf{-0.0788}\]

Hidden Layer Delta (δ¹)

Step-by-Step: δ¹ Computation

Formula: \(\boldsymbol{\delta}^{[1]} = ((\mathbf{W}^{[2]})^T \delta^{[2]}) \odot \sigma'(\mathbf{z}^{[1]})\)

Part 1: Propagate error backward

\[(\mathbf{W}^{[2]})^T \delta^{[2]} = \begin{bmatrix} 0.5 \\ 0.6 \end{bmatrix} \times (-0.0788) = \begin{bmatrix} -0.0394 \\ -0.0473 \end{bmatrix}\]

Part 2: Compute sigmoid derivatives for each hidden neuron

\[\sigma'(z_1^{[1]}) = a_1^{[1]}(1-a_1^{[1]}) = 0.5523 \times 0.4477 = \mathbf{0.2472}\] \[\sigma'(z_2^{[1]}) = a_2^{[1]}(1-a_2^{[1]}) = 0.5914 \times 0.4086 = \mathbf{0.2417}\]

Final: Element-wise multiply

\[\boldsymbol{\delta}^{[1]} = \begin{bmatrix} -0.0394 \\ -0.0473 \end{bmatrix} \odot \begin{bmatrix} 0.2472 \\ 0.2417 \end{bmatrix} = \begin{bmatrix} \mathbf{-0.0097} \\ \mathbf{-0.0114} \end{bmatrix}\]

Computing Weight Gradients

Layer 2 Gradients

Formula: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} (\mathbf{a}^{[1]})^T\)

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = -0.0788 \times [0.5523, 0.5914] = [\mathbf{-0.0435}, \mathbf{-0.0466}]\] \[\frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} = \mathbf{-0.0788}\]
Layer 1 Gradients

Formula: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \boldsymbol{\delta}^{[1]} \mathbf{x}^T\)

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \begin{bmatrix} -0.0097 \\ -0.0114 \end{bmatrix} \times [0.5, 0.3] = \begin{bmatrix} -0.0049 & -0.0029 \\ -0.0057 & -0.0034 \end{bmatrix}\]

Weight Update (with η = 0.5)

Updated Weights

Formula: \(\mathbf{W}_{new} = \mathbf{W}_{old} - \eta \cdot \nabla \mathcal{L}\)

\[W^{[2]}_{0,new} = 0.5 - 0.5 \times (-0.0435) = 0.5 + 0.0218 = \mathbf{0.5218}\] \[W^{[2]}_{1,new} = 0.6 - 0.5 \times (-0.0466) = 0.6 + 0.0233 = \mathbf{0.6233}\]

Notice: gradients are negative (loss decreasing direction), so weights increase to get output closer to target of 1.

07

Implementation

Here's a complete Python implementation of backpropagation from scratch:

Python
import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes):
        """Initialize with Xavier initialization"""
        self.L = len(layer_sizes) - 1
        self.weights = {}
        self.biases = {}
        
        for l in range(1, self.L + 1):
            self.weights[l] = np.random.randn(
                layer_sizes[l], layer_sizes[l-1]
            ) * np.sqrt(2 / layer_sizes[l-1])
            self.biases[l] = np.zeros((layer_sizes[l], 1))
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        """Forward pass"""
        self.cache = {'a': {0: X}, 'z': {}}
        a = X
        for l in range(1, self.L + 1):
            z = self.weights[l] @ a + self.biases[l]
            a = self.sigmoid(z)
            self.cache['z'][l] = z
            self.cache['a'][l] = a
        return a
    
    def backward(self, y_true):
        """Backpropagation"""
        m = y_true.shape[1]
        gradients = {'dW': {}, 'db': {}}
        
        # Output layer
        a_L = self.cache['a'][self.L]
        z_L = self.cache['z'][self.L]
        delta = (a_L - y_true) * self.sigmoid_derivative(z_L)
        
        gradients['dW'][self.L] = delta @ self.cache['a'][self.L-1].T / m
        gradients['db'][self.L] = np.sum(delta, axis=1, keepdims=True) / m
        
        # Hidden layers
        for l in range(self.L - 1, 0, -1):
            delta = (self.weights[l+1].T @ delta) * \
                    self.sigmoid_derivative(self.cache['z'][l])
            gradients['dW'][l] = delta @ self.cache['a'][l-1].T / m
            gradients['db'][l] = np.sum(delta, axis=1, keepdims=True) / m
        
        return gradients
    
    def train(self, X, y, epochs, lr):
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = np.mean((y_pred - y) ** 2) / 2
            grads = self.backward(y)
            
            for l in range(1, self.L + 1):
                self.weights[l] -= lr * grads['dW'][l]
                self.biases[l] -= lr * grads['db'][l]
            
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss:.6f}")

# Example: XOR problem
X = np.array([[0,0,1,1], [0,1,0,1]])
y = np.array([[0,1,1,0]])

nn = NeuralNetwork([2, 4, 1])
nn.train(X, y, epochs=5000, lr=1.0)
print(nn.forward(X))
08

Advanced Topics

The Vanishing Gradient Problem

When gradients are multiplied through many layers, they can become exponentially small (vanish) or large (explode).

🔧
Solutions to Vanishing Gradients
  • ReLU activation: Preserves gradient magnitude
  • Residual connections: Skip connections for gradient flow
  • Batch normalization: Prevents saturation
  • Careful initialization: Xavier or He initialization

Gradient Descent Variants

📉 Gradient Descent Optimization
Adam Optimizer \[\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla \mathcal{L}\] \[\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) (\nabla \mathcal{L})^2\] \[\theta_{t+1} = \theta_t - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}\]

Convolutional Layers

Backprop through convolutions uses the "im2col" trick - unrolling the convolution into matrix multiplication, then applying standard backprop.

Batch Normalization

Requires computing gradients through the normalization statistics. The gradient flows through both the normalized values and the learned scale/shift parameters.

Dropout

During training, gradients only flow through non-dropped units (scaled by 1/p). During inference, no dropout is applied.

Residual Connections

Skip connections allow gradients to flow directly: \(\frac{\partial}{\partial x}(x + F(x)) = 1 + \frac{\partial F}{\partial x}\), preventing vanishing gradients.

Common Derivatives \[\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\] \[\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\] \[\frac{\partial}{\partial \mathbf{X}}(\mathbf{a}^T\mathbf{X}\mathbf{b}) = \mathbf{a}\mathbf{b}^T\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}^T) = \mathbf{A}\]
09

Practice & Debug

Gradient Checking

Always verify your gradients numerically! Gradient checking compares your analytical gradients to numerical approximations.

Common Bugs & Debugging Tips

🐛
Common Backprop Bugs
  • Shape mismatches: Always verify tensor dimensions
  • Forgetting bias gradients: Remember \(\nabla_b = \delta\)
  • Wrong averaging: Divide by batch size in the right place
  • Numerical instability: Clip values, use stable softmax
  • Gradient not flowing: Check for dead ReLUs, vanishing gradients
10

Resources

Further Reading

Framework Documentation

Key Takeaways
  • Backpropagation efficiently computes gradients using the chain rule
  • The error term δ captures each neuron's contribution to loss
  • Gradients flow backward, scaled by weights and activation derivatives
  • Modern frameworks handle backprop automatically via autodiff
  • Always verify with gradient checking during development

Summary

Backpropagation is one of the most important algorithms in machine learning. Understanding it deeply will make you a better practitioner and give you intuition for debugging training issues and designing architectures.

Keep practicing, implement it from scratch, and don't be afraid to dive into the math. The concepts here form the foundation of all modern deep learning.