Abraham Dada

Zero To One: A Conceptual Intro To Deep Learning In Python

Published: 18th December 2024

1. Foundational Mathematical Concepts

Deep learning builds on several core areas of mathematics. Linear algebra, calculus, probability, and optimisation form the "languages" of machine learning, providing the notation and tools to understand and design neural networks.

Linear Algebra

Linear algebra is fundamental for representing data and computations in deep learning. Vectors (1D arrays) and matrices (2D arrays) are used to represent inputs, outputs, weights, and transformations. For example, the computation in a neural network layer is essentially a matrix-vector multiplication: inputs (vector) multiplied by weights (matrix) to produce outputs. Key concepts include:

Overall, linear algebra provides the framework for describing neural network calculations. It "cannot be overemphasised how fundamental linear algebra is to deep learning" – concepts like singular value decomposition or eigenvalues underpin advanced techniques, but at minimum one should be comfortable with basic matrix math and notations.

Calculus and Backpropagation

Calculus – especially differential calculus – is the tool that allows neural networks to learn. Neural networks learn by updating parameters (weights) in the direction that reduces the loss (error). This requires computing gradients (derivatives) of the loss with respect to each parameter, which is done via backpropagation (backward propagation of errors). Key points include:

In summary, calculus allows us to optimise neural networks. The network's training is essentially an iterative calculus exercise: compute gradients via chain rule and update weights opposite to the gradient (this is gradient descent, covered below). As one source puts it, "the chain rule is applied extensively by the backpropagation algorithm in order to calculate the error gradient of the loss function with respect to each weight".

Probability and Statistics

Probability theory provides the framework for reasoning about uncertainty, which is central to machine learning. Neural networks often output probabilities (for classification), and learning algorithms may assume certain data distributions. Important aspects include:

In practice, a deep learning practitioner should grasp that probability is about modelling and handling uncertainty. It underpins evaluation metrics (e.g. accuracy is essentially estimating a probability of correct classification) and advanced techniques like Bayesian neural networks or dropout (which Gal & Ghahramani showed can be interpreted as approximate Bayesian inference).

Optimisation Theory

Optimisation is the process of finding the best parameters (weights) for the neural network by minimising the loss function. Deep learning relies on iterative, gradient-based optimisation algorithms since direct analytic solutions are usually impossible for complex models. Key ideas:

A solid grasp of basic optimisation ensures you understand why your network is or isn't learning. For example, if training loss is not decreasing, it could be an optimisation issue (learning rate, getting stuck in a plateau) rather than model capacity. Concepts like the learning curve (plot of loss vs. epochs) and early stopping (stopping when validation loss stops improving to avoid overfitting) also come from optimisation and generalisation theory.

2. Neural Network Fundamentals

With the mathematical foundation in place, we can delve into the fundamentals of neural networks. A neural network is essentially a function approximator composed of many simple, connected processing units (neurons) arranged in layers. Key concepts include the basic neuron (perceptron), activation functions, network architecture (feedforward layers), loss functions, and the training process (gradient descent/backpropagation).

Perceptrons and Artificial Neurons

The perceptron is the simplest neural network unit, originally proposed by Frank Rosenblatt in 1957. In the context of neural networks, a perceptron is an artificial neuron that computes a weighted sum of its inputs and applies a non-linear activation function (originally a step function) to produce an output. In simple terms, a perceptron takes several binary inputs, weights them, sums them, and outputs a 0 or 1 depending on whether the sum exceeds a threshold.

Key points about perceptrons (and neurons in general):

A single perceptron is limited, but it's the building block of larger networks. Multi-Layer Perceptrons (MLPs) are just networks of perceptrons (artificial neurons) stacked in layers. By combining many such units, we obtain the capacity to learn complex functions.

Activation Functions and Non-Linearity

Activation functions define the output of a neuron given its input sum. Without an activation function (or using only a linear function), a network of any depth would collapse to an equivalent single-layer linear model (because composition of linear functions is linear). Thus, non-linear activation functions are essential to enable neural networks to model complex, non-linear relationships. Some common activations:

In essence, the activation function introduces non-linearity, which is why a neural network with even one hidden layer (and non-linear activations) can approximate complex functions. In fact, with non-linear activations like sigmoid or ReLU, a two-layer network can approximate any continuous function on a bounded domain (this is the Universal Approximation Theorem). Activation functions are chosen based on the task (e.g. softmax for classification probabilities, no activation or linear for a regression output) and practical considerations (ReLU for deep hidden layers). Modern networks often use ReLU by default in hidden layers because it's simple and effective.

Feedforward Networks (Multi-Layer Perceptrons)

A feedforward neural network (also called a fully-connected network or multilayer perceptron) is the archetypal neural network where information flows in one direction from input to output. These networks consist of an input layer, one or more hidden layers of neurons, and an output layer, with each layer fully connected to the next.

In a feedforward network:

A simple example is a network for classifying images of digits (like MNIST): input layer might have 784 neurons (for 28×28 pixel values), one or more hidden layers (say 128 neurons each with ReLU activations), and an output layer of 10 neurons (one per digit class, using softmax to output probabilities). During a forward pass, the data "feeds forward" through each layer's linear combination and activation to produce an output.

Important properties:

Feedforward networks are the foundation of deep learning – more specialised architectures (CNNs, RNNs, etc.) build upon or modify the feedforward structure to handle specific data types. But understanding an MLP – inputs flowing through weighted sums and activations to produce outputs – is key to understanding all neural networks.

Loss Functions

A loss function (also called cost function or objective function) quantifies how well the neural network is performing by comparing the network's outputs to the true target values. The loss function guides the training: the optimizer tweaks weights to minimise this loss. Choosing the right loss function depends on the task:

A good loss function is differentiable (so we can compute gradients). It should also align well with the metric we care about. Sometimes the metric of interest (e.g. accuracy) is not differentiable, so we train with a surrogate loss (cross-entropy is good for accuracy).

It's important to understand that the loss drives training: "a loss function measures the difference between a model's predicted outputs and the actual target values". Lower loss means better model performance on the training data. During training we monitor the loss, and also measure the loss (or related metrics) on validation data to ensure the model is learning to generalise, not just fit the training set (more on this in the evaluation section).

Gradient Descent and Backpropagation

We touched on this under calculus and optimisation, but to summarise the practical process: gradient descent with backpropagation is the algorithm that trains the neural network by minimising the loss. The steps in one iteration (for example, one mini-batch of data) are:

  1. Forward Pass: Compute the outputs of the network for the given input batch. This involves applying each layer's weights and activation in sequence (feedforward).
  2. Compute Loss: Compare the outputs to the true labels and compute the loss using the chosen loss function.
  3. Backward Pass (Backpropagation): Compute gradients of the loss with respect to each weight in the network. Backprop starts at the output layer and applies the chain rule to propagate gradients backwards through the network layers. Each weight w gets a gradient \(\partial L/\partial w\) indicating how increasing w would increase the loss.
  4. Gradient Descent Step: Update each weight by a small amount in the opposite direction of its gradient: \(w := w - \eta (\partial L/\partial w)\). Here \(\eta\) is the learning rate. This step hopefully reduces the loss slightly.
  5. Repeat for many iterations (over many batches, for multiple epochs) until the model converges (or other stopping criteria).

Backpropagation is essentially the bookkeeping method to efficiently calculate all those partial derivatives. It leverages the layered structure of the network to compute gradients from the output back to the input, reusing intermediate results (this is much faster than naively perturbing each weight to see its effect). As one resource succinctly states, "PyTorch deposits the gradients of the loss w.r.t. each parameter" when you call loss.backward() – this is an implementation of backprop. Then calling the optimizer's step (optimizer.step()) will adjust the weights using those gradients.

Variants: In practice, we often use stochastic or mini-batch gradient descent, meaning each update uses a subset of the training data. This introduces randomness (hence "stochastic") which can help escape shallow local minima. Many improvements like Momentum, Adam, etc., modify how the gradient is used for updates (momentum adds a fraction of previous update, Adam adapts per-weight learning rates, etc.), but they still rely on gradients from backprop.

Summary: Gradient descent + backprop is what "learning" means in a neural network. It is an iterative process of incremental improvement: each step nudges the weights to slightly reduce the error. Over many iterations, if all goes well, the network ends up in a state that produces very low loss on the training data (i.e., it has learned to approximate the desired function). Understanding this process is crucial for debugging training (e.g., if loss is not decreasing, something is wrong with gradients, learning rate, or model capacity).

3. Deep Learning from Scratch in Python (NumPy only)

To really cement understanding, it's helpful to build a simple neural network from scratch in Python, without using high-level frameworks. By using only NumPy (or even pure Python for simplicity), you can appreciate what the libraries are doing under the hood. Here's a step-by-step outline to implement a basic neural network training loop from scratch:

  1. Define the Network Architecture: Decide the number of layers, neurons, and activation functions. For example, a small network with 2 inputs, 1 hidden layer of 3 neurons (ReLU activation), and 1 output neuron (sigmoid activation for binary classification).
  2. Initialise Weights and Biases: Create NumPy arrays for weights and biases of each layer. A common practice is to initialise with small random values (e.g. Gaussian with mean 0 and small stddev) so that symmetry is broken and neurons don't all produce the same output. For our example, weight matrices shapes would be (2×3) for input-to-hidden and (3×1) for hidden-to-output, plus bias vectors of length 3 and 1 respectively.
  3. Forward Pass (Prediction): Implement a function to take an input array and compute the output. Using our example:
    • Compute hidden layer pre-activation: h = np.dot(x, W1) + b1 (x is 1×2, W1 is 2×3, result 1×3).
    • Apply activation: h_act = np.maximum(0, h) if ReLU.
    • Compute output pre-activation: o = np.dot(h_act, W2) + b2 (h_act is 1×3, W2 is 3×1, result 1×1).
    • Output activation: y_pred = sigmoid(o) for final probability.
  4. Loss Calculation: Compute the loss for the output. For instance, use mean squared error or binary cross-entropy depending on the task. If doing a simple regression, MSE might be fine; for binary classification, use cross-entropy.
  5. Backward Pass (Manual Gradient Computation): Using calculus, derive the gradients of the loss w.r.t. each parameter. This is the trickiest part to do manually:
    • Compute gradient at output: e.g. with binary cross-entropy and sigmoid output, \(\frac{\partial L}{\partial o} = y_{\text{pred}} - y_{\text{true}}\) (for MSE it would be \(2(y_{\text{pred}} - y_{\text{true}})\) times derivative of sigmoid).
    • Backpropagate to hidden-output weights: \(\frac{\partial L}{\partial W2} = h_{\text{act}}^T \cdot \frac{\partial L}{\partial o}\). For bias2: it's just \(\partial L / \partial o\) (since bias adds directly to o).
    • Backpropagate to hidden layer: use W2 to distribute gradient to hidden neurons. For ReLU activation, the gradient through ReLU is passed only for neurons that were active (for which h > 0); neurons with h \le 0 have zero gradient (ReLU's derivative is 0 when inactive). Compute \(\frac{\partial L}{\partial h} = \frac{\partial L}{\partial o} \cdot W2^T\), then set those entries to zero where h \le 0 (ReLU backprop).
    • Backpropagate to input-hidden weights: \(\frac{\partial L}{\partial W1} = x^T \cdot \frac{\partial L}{\partial h}\). And bias1 gradient is just \(\partial L / \partial h\) (for each hidden neuron).
    • This chain of derivatives is an application of the chain rule – exactly what backprop does. Our manual steps mimic an automated backpropagation.
  6. Weight Update: Once gradients are computed, update each parameter: W -= learning_rate * dW (and similarly for biases). Use a small learning rate (e.g. 0.01) and ensure to subtract the gradient (to go in the descent direction).
  7. Loop Training: Loop over many epochs. For each epoch, optionally shuffle your training data and iterate through it (for large data, use mini-batches). Compute forward pass, loss, backprop gradients, update weights. Monitor the loss.
  8. Evaluation: After training, test the network on some held-out data to see if it generalised.

Even this simple 2-layer network requires careful coding of gradients. Many beginners get the signs or shapes wrong, so it's useful to test gradient computations with numerical checks. But successfully coding a network from scratch is enlightening: you see that a neural network is just a bunch of multiplications, additions, and function evaluations, nothing mystical.

For instance, a Real Python tutorial builds a neural network from scratch and demonstrates manually applying the chain rule and parameter updates. It shows how backpropagation is essentially book-keeping of partial derivatives. By doing it yourself, you appreciate what frameworks like PyTorch or TensorFlow are automating for you.

Here is a simplified example of forward and backward passes for our small network:

# Define sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Forward pass
def forward(x, W1, b1, W2, b2):
    # Hidden layer
    z1 = np.dot(x, W1) + b1
    a1 = np.maximum(0, z1)  # ReLU activation
    
    # Output layer
    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)
    
    return z1, a1, z2, a2

# Compute loss (binary cross-entropy)
def compute_loss(y_pred, y_true):
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Backward pass
def backward(x, y, z1, a1, z2, a2, W1, W2):
    m = x.shape[0]  # Batch size
    
    # Output layer gradients
    dz2 = a2 - y  # Gradient of loss w.r.t. z2
    dW2 = np.dot(a1.T, dz2) / m
    db2 = np.sum(dz2, axis=0) / m
    
    # Hidden layer gradients
    dz1 = np.dot(dz2, W2.T)
    dz1[z1 <= 0] = 0  # ReLU gradient (zero for inactive neurons)
    dW1 = np.dot(x.T, dz1) / m
    db1 = np.sum(dz1, axis=0) / m
    
    return dW1, db1, dW2, db2

# Update parameters
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate):
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    return W1, b1, W2, b2

# Training loop (simplified)
def train(X, Y, hidden_size, learning_rate, epochs):
    input_size = X.shape[1]
    output_size = 1
    
    # Initialize weights
    W1 = np.random.randn(input_size, hidden_size) * 0.01
    b1 = np.zeros((1, hidden_size))
    W2 = np.random.randn(hidden_size, output_size) * 0.01
    b2 = np.zeros((1, output_size))
    
    for epoch in range(epochs):
        # Forward pass
        z1, a1, z2, a2 = forward(X, W1, b1, W2, b2)
        
        # Compute loss
        loss = compute_loss(a2, Y)
        
        # Backward pass
        dW1, db1, dW2, db2 = backward(X, Y, z1, a1, z2, a2, W1, W2)
        
        # Update parameters
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate)
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss}")
    
    return W1, b1, W2, b2

NumPy vs Pure Python: Using NumPy for linear algebra is important for efficiency. A pure Python loop to sum over neurons would be very slow. NumPy operates in C under the hood, making it much faster. Even our scratch implementation relies on NumPy's dot for matrix multiplication. This highlights why deep learning libraries are so necessary – they are heavily optimised (often using GPU computations) to handle the large linear algebra operations in neural nets.

After completing a from-scratch implementation, you should have a solid grasp of how forward and backward passes work. At that point, you're ready to appreciate higher-level frameworks which simplify these steps while providing additional functionality.

4. Implementing Neural Networks with PyTorch

While learning from-scratch is valuable, in practice we use frameworks like PyTorch to build and train deep learning models efficiently. PyTorch provides automatic differentiation (so you don't have to manually code backprop) and many utilities for model building, data loading, etc. In this section, we'll cover how to implement neural networks with PyTorch, including data pipelines, model definition, training loops, optimisers, regularisation, and saving/loading models.

Data Pipelines: Datasets and DataLoaders

Real-world data often does not come in neat NumPy arrays ready for training. PyTorch provides abstractions to streamline data handling:

Using these abstractions greatly eases the training loop. As the PyTorch tutorial states, "The Dataset and DataLoader classes encapsulate the process of pulling your data from storage and exposing it to your training loop in batches." For example, if you have images on disk, a Dataset might load an image and its label in __getitem__, and the DataLoader will take care of calling this and bundling results into batches (and shuffling order each epoch, etc.).

Example: Suppose we want to train on the MNIST digit dataset:

from torchvision import datasets, transforms
train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

This gives us an iterator train_loader that yields 64 images (as tensors) and labels at a time, randomly shuffled each epoch.

Defining Model Architecture (nn.Module)

PyTorch models are usually defined by subclassing torch.nn.Module. This base class provides a lot of functionality, but fundamentally, you need to define two things in your subclass:

  1. __init__ Constructor: Set up the layers of the network.
  2. forward Method: Define how to compute the output from input by using those layers.

PyTorch's torch.nn module provides many building blocks (layers, activations, etc.) to use in your model. For example, nn.Linear for a fully connected layer, nn.Conv2d for a convolutional layer, nn.ReLU for activation, etc.

"Every module in PyTorch subclasses nn.Module. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily." In code, a simple model might look like:

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        # Define layers
        self.fc1 = nn.Linear(784, 128)    # fully connected: 784->128
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)     # fully connected: 128->10 (for 10 classes)
    
    def forward(self, x):
        # Forward pass: note we don't call backward here, PyTorch autograd will handle it
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In the above:

PyTorch, by design, allows the forward method to be written with normal Python control flow (loops, ifs, etc.), which makes it very flexible (this is part of its "define-by-run" dynamic graph approach).

Using an nn.Sequential is an even quicker way for simple stack of layers:

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

This avoids writing an explicit class; however, for anything non-linear in flow or with multiple inputs, a custom nn.Module subclass is clearer.

In summary, to define a model:

PyTorch will automatically create the computation graph as you perform operations in forward. You never call forward directly; instead you call the model on an input like outputs = model(inputs) – under the hood, __call__ is defined to wrap around forward and handle bookkeeping. The gradient graph is built dynamically, so next we can use it for training.

Training Loop: Forward, Loss, Backward, Optimise

Once the model is defined and data loader prepared, the training loop in PyTorch goes through these steps for each batch:

model = SimpleNet()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        # 1. Forward pass
        outputs = model(inputs)              # outputs shape [batch_size, 10]
        loss = loss_fn(outputs, labels)      # compute loss for this batch

        # 2. Backward pass
        optimizer.zero_grad()               # reset gradients from previous step
        loss.backward()                     # compute gradients (dLoss/dWeights)
        # Now, model.parameters() have their .grad attribute set

        # 3. Update weights
        optimizer.step()                    # adjust weights by gradients
    # (Optional) compute validation loss, accuracy, etc.

A few important details:

PyTorch gives flexibility: you could manually loop over model.parameters() and update them, but using optimizer is cleaner and allows using more complex rules (Adam, etc.).

Batch vs Epoch: Typically we loop batches inside an epoch loop. After each epoch, you might shuffle the data or adjust learning rate, etc. It's also common to compute validation metrics at epoch boundaries.

Loss and Metrics: We use nn.CrossEntropyLoss above, which expects raw logits and true class indices, and it computes softmax + cross-entropy internally. If using a different output/target scheme, choose the appropriate loss_fn (PyTorch has many in nn module). It's also common to print or log the loss every few iterations, and track metrics like accuracy on the side.

To ensure correctness, one might add debug prints:

print(f"Epoch {epoch}, Loss: {loss.item()}")

.item() gives the Python float of a 0-dim tensor.

The training loop in PyTorch is explicit (unlike some frameworks that hide it), which makes it flexible. You can add custom behaviour (gradient clipping, learning rate scheduling, etc.) within this loop as needed.

Optimisers and Regularisation in PyTorch

Optimisers: PyTorch's torch.optim package provides many optimisation algorithms:

Switching optimisers is as simple as using a different optim.X class and passing model.parameters(). The rest of the loop remains the same.

Regularisation: Neural networks can easily overfit, so we use regularisation techniques to encourage simpler models:

Example (Weight Decay and Dropout):

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
# Inside model, suppose we define self.dropout = nn.Dropout(0.5) and use it in forward

This sets L2 weight decay, and the model itself has dropout layers. Each training iteration, weight decay will nudge weights to smaller values, and dropout will randomly drop units, both discouraging overfitting.

Gradient Clipping: Another regularisation (or stabilisation) trick for some models (especially RNNs) is to clip gradients if they get too large (to avoid drastic updates that could blow up the model).

PyTorch makes it straightforward to add these techniques, and many can be combined. Regularisation is crucial when training deep networks, as it combats overfitting and can improve the model's ability to generalise to unseen data (see more in the evaluation section).

Saving and Loading Models

Training a model can be time-consuming, so you will want to save the trained model to disk, and later load it for inference or to resume training. PyTorch provides utilities for this:

torch.save(model.state_dict(), "model_weights.pth")

This creates a file (which is actually a serialised PyTorch tensor dict under the hood).

model = SimpleNet()  # must match architecture
model.load_state_dict(torch.load("model_weights.pth"))
model.eval()  # set to evaluation mode

Setting eval() is important for certain layers like dropout or batchnorm so that they behave in inference mode.

torch.save({
    'epoch': current_epoch,
    'model_state': model.state_dict(),
    'optim_state': optimizer.state_dict()
}, "checkpoint.pth")

and later load it and use optimizer.load_state_dict(...) similarly.

Using these tools, you can train a model for hours/days, save the weights, and later use the model in production or in a separate script without retraining. For deployment, often we save the weights and load them in a lighter script that just does inference on new data.

As a quick example:

# After training:
torch.save(model.state_dict(), "mymodel.pth")

# In inference script:
model = SimpleNet()
model.load_state_dict(torch.load("mymodel.pth"))
model.eval()
# Now model can be used to predict

The .eval() call sets the model to evaluation mode (affecting dropout, batchnorm as mentioned). If you were to continue training after loading, you'd use model.train() to put it back in training mode.

5. Core Neural Network Architectures

Over time, deep learning practitioners have developed specialised architectures tailored to different types of data and problems. Here we introduce some core architectures beyond the basic feedforward (dense) network: Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and the closely related LSTMs. We'll also set the stage for more advanced models in the next section.

Multilayer Perceptrons (MLPs)

We've already used the term MLP to describe a feedforward network with one or more hidden layers. To reiterate:

MLPs are considered "vanilla" neural networks and often serve as a baseline. For example, in the early days of deep learning on MNIST, an MLP with one hidden layer of 500 neurons was a reasonable model achieving ~98% accuracy. But for more complex image tasks, CNNs dramatically outperform MLPs by leveraging spatial structure.

One can think of an MLP as learning hierarchical representations: the first hidden layer might detect simple features of the input, the second layer builds on those features to detect more complex patterns, and so on. However, in practice, MLPs with many layers are hard to train (due to issues like vanishing gradients). Modern usage of very deep networks relies on architectural innovations (like skip connections in ResNets) which are beyond plain MLP.

Nonetheless, understanding MLPs is the first step. Any deep network's backbone might contain fully-connected layers at some point (e.g., the last layers of a CNN or transformer are often MLPs). And as mentioned, a sufficiently large MLP can approximate any function theoretically – it's just that other architectures do it more efficiently for specific domains.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks are specialised for grid-structured data like images (2D grids of pixels) or audio spectrograms (2D time-frequency grids). The key idea is to use convolutional layers instead of fully connected layers for early processing, which exploit local spatial coherence:

A CNN for image classification might look like: Input -> [Conv2d -> ReLU -> Conv2d -> ReLU -> Pool] -> [Conv2d -> ReLU -> Pool] -> [Fully Connected -> ReLU -> Fully Connected -> Softmax]. Famous CNN architectures (for more advanced study) include LeNet-5 (one of the first), AlexNet (which kickstarted deep learning for vision in 2012), VGG, ResNet (introduced skip connections, enabling very deep networks), etc.

In summary, "a convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (kernel) optimisation". It leverages local connectivity and parameter sharing. CNNs have been extremely successful in computer vision tasks – image classification, object detection, segmentation – and even for other data like audio and text (where 1D or 2D convolutions can apply).

Here's a simple CNN in PyTorch for MNIST classification:

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # First convolutional layer: 1 input channel, 32 output channels, 3x3 kernel
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        # Max pooling layer: 2x2 kernel with stride 2
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        # Second convolutional layer: 32 input channels, 64 output channels, 3x3 kernel
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        # Fully connected layers
        self.fc1 = nn.Linear(64 * 7 * 7, 128)  # After 2 pooling layers, 28x28 -> 7x7
        self.fc2 = nn.Linear(128, 10)
        
    def forward(self, x):
        # First conv block
        x = self.pool(F.relu(self.conv1(x)))  # Conv -> ReLU -> Pool
        # Second conv block
        x = self.pool(F.relu(self.conv2(x)))  # Conv -> ReLU -> Pool
        # Flatten the output for the fully connected layer
        x = x.view(-1, 64 * 7 * 7)
        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks are designed to handle sequential data, such as time series, text, or sequences of events. Unlike feedforward networks that assume all inputs are independent, RNNs introduce loops (recurrence) in the network that allow information to persist from one step of the sequence to the next.

In an RNN, we process one element of the sequence at a time, and the network maintains a hidden state that carries information about previous elements. Conceptually:

Because of this recurrence, RNNs can, in principle, retain memory of arbitrarily long sequences (though in practice vanilla RNNs struggle with long-term dependencies due to fading gradients).

Important points:

In PyTorch, you can use nn.RNN, nn.LSTM, or nn.GRU layers which encapsulate the recurrence. These can process an entire sequence (optionally with packing for variable lengths, etc.) and yield the outputs and final hidden state.

RNNs (and LSTMs/GRUs) were the go-to solution for sequence learning tasks (NLP, speech) until the advent of Transformers (discussed later). They are still useful in certain settings, especially where sequence lengths aren't too long or when streaming data in real-time (where you process one step at a time and need to carry state).

Here's a simple RNN for sequence classification in PyTorch:

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        # RNN layer: input_size -> hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        # Fully connected output layer: hidden_size -> output_size
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        # Initialize hidden state with zeros
        batch_size = x.size(0)
        h0 = torch.zeros(1, batch_size, self.hidden_size).to(x.device)
        
        # RNN forward pass
        # out shape: batch_size x seq_length x hidden_size
        # h_n shape: 1 x batch_size x hidden_size
        out, h_n = self.rnn(x, h0)
        
        # Use the final hidden state for classification
        h_n = h_n.squeeze(0)  # Remove the direction dimension
        out = self.fc(h_n)
        return out

Long Short-Term Memory (LSTM) Networks

LSTM networks are a special kind of RNN that can learn long-term dependencies more effectively than plain RNNs. Introduced by Hochreiter & Schmidhuber (1997), LSTMs combat the vanishing gradient problem with an internal architecture that explicitly controls the flow of information.

An LSTM unit has gates that regulate the hidden state:

Additionally, LSTM maintains a separate cell state \(C_t\) that runs through time with linear interactions (just additions and multiplications with gate values), which helps preserve long-term information. The combination of gates allows the LSTM to retain information over long periods (hundreds of time steps) if necessary, by setting forget gate near 1 and input gate near 0 for those time steps (thus carrying the cell state unchanged).

In practice, LSTMs significantly improved the ability of RNNs to capture long-term dependencies. For example, in language, LSTMs can remember context from many words earlier (like gender of a subject to use correct pronoun later, etc.) better than vanilla RNNs.

Key points:

lstm = nn.LSTM(input_size=..., hidden_size=..., num_layers=..., batch_first=True)
output_sequence, (h_n, c_n) = lstm(input_sequence)

The output_sequence contains outputs at each time step (unless you only want final output), and (h_n, c_n) are the final hidden and cell states.

LSTMs (and GRUs) were the state-of-the-art for language tasks like translation, until the Transformer networks arrived, which we'll cover next. But even today, LSTMs find use in certain niche areas or where data is limited and one wants a proven architecture.

Summary: If you have sequential data:

Here's an example of an LSTM for text classification in PyTorch:

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(LSTMClassifier, self).__init__()
        # Embedding layer: vocab_size -> embedding_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # LSTM layer: embedding_dim -> hidden_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        # Dropout for regularisation
        self.dropout = nn.Dropout(0.5)
        # Fully connected output layer: hidden_dim -> output_dim
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        # text shape: batch_size x seq_length
        
        # Get embeddings
        embedded = self.embedding(text)  # batch_size x seq_length x embedding_dim
        
        # Run LSTM
        # out shape: batch_size x seq_length x hidden_dim
        # hidden shape: (1 x batch_size x hidden_dim, 1 x batch_size x hidden_dim)
        out, (hidden, cell) = self.lstm(embedded)
        
        # Use the final hidden state
        hidden = hidden.squeeze(0)  # batch_size x hidden_dim
        
        # Apply dropout
        hidden = self.dropout(hidden)
        
        # Final prediction
        output = self.fc(hidden)
        
        return output

6. Advanced Models: Transformers and Autoencoders

Moving beyond the "core" architectures, we have some advanced models that have risen to prominence. Here we introduce two: Transformers – which have revolutionized sequence processing (especially in NLP) – and Autoencoders – a framework for unsupervised representation learning and data compression. (Other advanced models include GANs, variational autoencoders, graph neural networks, etc., but those are beyond our current scope.)

Transformers

Transformers are a type of deep learning model introduced in 2017 ("Attention is All You Need" by Vaswani et al.) that rely on a mechanism called self-attention to process sequences. Unlike RNNs, Transformers do not process data sequentially step-by-step; instead, they attend to all elements of the sequence in parallel, which allows for much better parallelization and for capturing long-range dependencies without the vanishing gradient issues of RNNs.

Key ideas in Transformers:

Transformers were first applied in NLP for translation, but now form the basis of almost all state-of-the-art language models (BERT, GPT, etc. are Transformer-based). They have also been applied to images (Vision Transformers), audio, and more.

Why are they exciting?

In fact, "in recent years, Transformers, which rely on self-attention mechanisms instead of recurrence, have become the dominant architecture for many sequence-processing tasks… due to their superior handling of long-range dependencies and greater parallelizability." This statement from the RNN context emphasizes how Transformers have overtaken LSTMs/GRUs for NLP tasks. For example, GPT-3 (a Transformer model with 175 billion parameters) can generate text with impressive coherence, and vision transformers are competitive with CNNs for image recognition.

For a Python/PyTorch practitioner, using Transformers can mean either implementing the architecture from scratch (complex, but libraries like PyTorch provide nn.Transformer module), or more commonly using pre-trained models via libraries like HuggingFace Transformers, which abstract away the details.

Transformers are an advanced topic, but even as a near-expert, one should at least conceptually understand attention: each output is a weighted combination of inputs, where the weights are dynamically computed from the inputs themselves. This is very different from the fixed connectivity of CNNs or sequential memory of RNNs.

Autoencoders

Autoencoders are neural networks trained to reproduce their input at the output. They consist of two parts:

The goal of an autoencoder is to learn a compressed representation of the data – essentially performing data compression and reconstruction. By training the network to output exactly what was input, we force it to learn which aspects of the input are most salient (especially if the bottleneck dimension is much smaller than the input). As one article put it: "Autoencoders are ingenious neural network architectures that offer data compression and reconstruction capabilities. They consist of an encoder to compress data and a decoder to reconstruct it back to its original form."

Why are autoencoders useful?

Use cases:

In code, an autoencoder might look like:

# Define encoder
encoder = nn.Sequential(
    nn.Linear(784, 64),
    nn.ReLU(),
    nn.Linear(64, 32)   # compress to 32 dims
)
# Define decoder
decoder = nn.Sequential(
    nn.Linear(32, 64),
    nn.ReLU(),
    nn.Linear(64, 784),
    nn.Sigmoid()        # using sigmoid to output pixel values 0-1
)
# Training: use input itself as target
for x in data_loader:
    z = encoder(x)
    x_recon = decoder(z)
    loss = mse_loss(x_recon, x)   # reconstruction error
    ...

If we set 32 << 784, the network is forced to compress.

There are many variants of autoencoders (sparse autoencoders, variational autoencoders which impose a distribution on latent space, convolutional autoencoders for images, sequence autoencoders, etc.). The fundamental concept remains: encoder-decoder with a bottleneck, trained by reconstructing inputs.

Autoencoders highlight an important concept in deep learning: unsupervised learning of representations. Not all deep learning is about classification or prediction; some is about learning how to represent data in a more efficient or useful way.

7. Model Evaluation and Validation

Training a deep learning model is half the battle – we also need to evaluate how well it generalises to new data and ensure it's actually solving the intended problem. In this section, we discuss evaluation metrics, the use of validation sets, diagnosing overfitting vs. underfitting, and a brief note on interpretability of deep learning models.

Evaluation Metrics

The metric is the quantitative measure of performance you care about, which might differ from the loss function. Common metrics include:

It's important to choose the right metric for your problem domain – one that aligns with success. For instance, accuracy might be high for a model that always predicts the majority class, but recall for the minority class would be terrible – so looking at precision/recall would be necessary.

Often, during training, we monitor validation metrics epoch by epoch to see how the model is improving on unseen data.

Validation and Test Sets

To properly evaluate generalisation, we split our data:

Sometimes only train/test are mentioned, and one can treat test as validation when developing (tuning hyperparams), but then you'd need another fresh set for a true test. In practice, many use k-fold cross-validation for smaller datasets to get more robust estimates.

When training, if you see the training loss going down but validation loss/metric stops improving or gets worse, that's a sign of overfitting (more below). You might then stop training (early stopping) or adjust regularisation.

Overfitting vs Underfitting

Overfitting and Underfitting are two failure modes:

Ideally, we want a model that is just right (a good fit): training error reasonably low and validation error close to training error.

How to combat:

Often, a training curve can help: if training loss and validation loss both decreasing and meet at a plateau, you might be at a good fit. If training loss goes down but validation loss goes down then up, you are overfitting after a certain point (stop before it goes up, or use regularisation). If training loss is high, you underfit or need to train more.

Jeremy Howard's rule of thumb: if you have much higher training accuracy than validation accuracy, you overfit; if both are low, you underfit. Also one can look at validation vs training loss. A Fast.ai forum note says: validation loss should be slightly higher than training loss; if it's much higher, likely overfitting; if it's actually lower than training loss, something interesting like regularisation or randomness in training might be at play (or error in how metrics are calculated).

Interpretability and Explainability

Deep learning models, especially large ones, are often treated as "black boxes" – they make predictions without easily understood reasoning. Model interpretability is about trying to understand why a model made a certain decision or what patterns it has learned.

While interpretability is a broad field of its own (Explainable AI, or XAI), some key approaches for neural networks include:

Interpretability is important in sensitive applications (medicine, finance, law) where one needs to justify decisions. For example, a neural network might predict someone's loan should be denied – interpretability tools might help identify that the decision was mostly influenced by, say, the person's debt-income ratio and credit history length, which makes sense, versus spurious things.

It's worth noting that full interpretability of very large deep models is an open challenge – these techniques provide some insights, but there's ongoing research. However, as a practitioner, being aware of these tools is valuable. Even simple measures like looking at confusion matrices (to see which classes are confused with which) or per-class accuracy can give insight into where the model might be failing or if it has biases.

In summary, while a typical deep learning workflow might focus on improving accuracy or loss, a near-expert should also:

8. Model Deployment Workflows

After training a successful model, the next step is often deployment – making the model available in a production environment so that it can start making predictions on new data (e.g., a web service serving users). Deployment involves considerations beyond model training, including how to integrate with applications, how to scale, and how to optimise for inference speed. We'll discuss a few common deployment scenarios in Python: serving a model through an API (Flask/FastAPI), deploying to the cloud with containers, and using ONNX/TorchScript for optimised, cross-platform inference.

Deploying with Flask/FastAPI (Building an API)

One straightforward way to deploy a model is to wrap it in a web service API. In Python, Flask and FastAPI are popular frameworks for building web servers. The idea:

Flask example (for a simple MNIST classifier):

from flask import Flask, request, jsonify
import torch
from model import MyMNISTModel

app = Flask(__name__)
model = MyMNISTModel()
model.load_state_dict(torch.load("mnist_model.pth"))
model.eval()

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()  # expecting a JSON with an image array perhaps
    image = torch.tensor(data['image']).view(1, 1, 28, 28)  # reshape to model input
    with torch.no_grad():
        output = model(image)
        prob = torch.softmax(output, dim=1)
        pred_class = int(torch.argmax(prob, dim=1))
    return jsonify({'predicted_class': pred_class, 'probabilities': prob.numpy().tolist()})

This is a simplistic example. In practice, you'd probably base64 encode images or use multipart form data for images, etc. But the pattern stands: POST request -> parse input -> run model -> return result.

FastAPI is a newer framework that's very well-suited for building APIs (with automatic docs, async support). The code is similar (with decorators for routes).

Important considerations:

Cloud Deployment and Containerisation (AWS, GCP, Azure with Docker)

For scaling out to many users or integrating into a larger system, you often deploy the model as a microservice in the cloud. Common approach: use Docker containers to package the application (including the model and environment), then run that container on cloud infrastructure or Kubernetes.

Docker containerisation:

For example, a simple Dockerfile might look like:

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 5000
ENTRYPOINT ["python", "app.py"]

This would run your app.py which contains the Flask app.

Cloud platforms:

The common theme: containerisation makes it easy to ensure the same environment in development and production. You containerise your model server and then use cloud orchestration to handle it. In a Kubernetes environment, you'd perhaps have a deployment with X replicas of your model server, behind a load balancer (service).

Scaling and Load: In cloud, you can scale horizontally by increasing container replicas if traffic increases. Cloud monitoring can watch CPU/GPU utilisation or request latency and scale accordingly.

GPU or CPU: On cloud, you can deploy on GPU machines if model inference needs it (e.g. heavy CNN on images in real-time). But GPUs are costly, so often an optimised CPU inference (or smaller model) is used if possible.

Latency vs Throughput: If serving must be low-latency (e.g. real-time user requests), you'd design for that (maybe keep model in memory, avoid too large batches). If high throughput offline predictions, you might batch requests.

Container registries and CI/CD: Typically, you integrate with CI/CD so that when you have a new model version, a pipeline builds a new Docker image, pushes to a registry, and deploys to your cluster.

ONNX and TorchScript for Cross-Platform Inference

Sometimes you want to deploy outside of a Python environment – e.g., in a C++ server, on mobile devices, or simply to optimise inference by removing Python overhead. Two approaches in the PyTorch ecosystem for this are ONNX (Open Neural Network Exchange) and TorchScript.

ONNX (Open Neural Network Exchange):

dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx", input_names=["input"], output_names=["output"])

This traces or scripts the model and produces an ONNX file.

import onnxruntime as ort
ort_session = ort.InferenceSession("model.onnx")
outputs = ort_session.run(None, {"input": input_array})

In essence, using ONNX allows performance and portability: you can reduce latency and deploy across platforms with ONNX Runtime. It decouples the model from the original training code.

TorchScript:

scripted_model = torch.jit.script(model)  # or torch.jit.trace(model, dummy_input)
scripted_model.save("model.pt")

Choosing ONNX vs TorchScript:

Mobile Deployment: ONNX can be used via ONNX Runtime mobile or converted to CoreML for iOS, etc. TorchScript can be used via PyTorch Mobile. These allow models to run on smartphones, IoT devices.

Optimisations: Both ONNX and TorchScript models can be further optimised:

In conclusion, ONNX and TorchScript are valuable tools to bridge the gap from research (Python notebooks) to production environments where you need efficiency and compatibility. They enable cross-platform AI model inference – meaning your model can run in environments that don't have a Python interpreter or even PyTorch installed, and often run faster due to compiled execution.

9. Python Language Features for Deep Learning

Writing good deep learning code isn't just about neural network libraries – leveraging core Python features can make your code more modular, readable, and efficient. We'll highlight a few Python-specific techniques and idioms especially useful in machine learning projects: decorators, generators, context managers, and other helpful patterns.

Using Decorators in ML Code

Decorators are functions that wrap other functions or methods to extend their behaviour without explicitly modifying them. In a machine learning context, decorators can be handy for cross-cutting concerns such as logging, timing, caching, or ensuring pre/post-conditions:

For instance, to time any function:

import time
def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end-start:.3f} seconds")
        return result
    return wrapper

@timeit
def train_one_epoch(...):
    # training code

This would print how long each epoch takes, which helps identify performance bottlenecks.

Another use: Decorating model evaluation function to log metrics to a file for later analysis, or decorate a prediction function in an API to log the inputs and outputs (for monitoring).

Decorators keep the core logic clean (no need to sprinkle logging code everywhere), enabling separation of concerns. Common decorator use cases include logging, enforcing access control, caching results, and measuring execution time – all of which can apply in ML pipelines.

Generators for Data Pipelines

Generators (functions using yield) allow you to create iterators in a memory-efficient way. In deep learning, they are particularly useful for data loading and preprocessing:

Python's generator syntax is convenient:

def data_generator(file_path):
    with open(file_path) as f:
        for line in f:
            # process line to feature and label
            yield feature, label

This yields one sample per iteration without storing the entire file in memory.

Generators are also used under the hood in many frameworks (PyTorch's DataLoader uses multiple worker processes that each yield data). You can also manually create a generator and use it for training:

gen = data_generator("data.txt")
for epoch in range(num_epochs):
    for x, y in gen:
        ... # train on x, y

But note: once a generator is exhausted, it doesn't automatically restart. You'd create a new one per epoch or turn it into a cycling generator.

A neat trick: generator expressions (like list comprehensions but with parentheses) can create simple generators:

gen = (transform(x) for x in range(1000))

This would yield transform(0), transform(1), ..., transform(999) lazily.

Benefits of generators:

One must be careful with generators in multi-threaded or multi-process contexts (like DataLoader). But in pure Python loops, they integrate seamlessly with the for syntax.

In summary, generators allow streaming large datasets or continuous data without running into memory limits and with clear, modular processing steps.

Context Managers in Deep Learning

Context managers (the with statement) are a Python feature that helps with resource management by ensuring setup and teardown code executes reliably. They are very useful in deep learning code:

import time
class Timer:
    def __enter__(self):
        self.start = time.time()
    def __exit__(self, exc_type, exc_val, exc_tb):
        print(f"Elapsed: {time.time()-self.start:.2f}s")
# usage:
with Timer():
    model(inputs)  # times this forward pass

PyTorch's context managers (no_grad, autocast) are heavily used:

with torch.no_grad():
    for x, y in val_loader:
        out = model(x)
        # compute accuracy...

This ensures all requires_grad computations inside are skipped for gradient tracking (makes evaluation faster and uses less memory).

Another scenario: using contextlib.contextmanager decorator you can easily create context managers. For instance, if you have an object that needs to be initialised and cleaned (like maybe a database connection for logging results), you could wrap connect/disconnect in a context manager.

Why use context managers? They make code cleaner and less error-prone by handling the cleanup automatically even if errors occur. As an example, if you manually disable gradients and forget to re-enable, it could mess up training. Using the with ensures it's re-enabled.

In deep learning experiments, context managers help manage GPU memory (freeing cache, etc.), profiling (PyTorch has torch.autograd.profiler.profile() as a context manager), and any resource that must be properly closed.

Other Useful Python Features

A few additional Python features/patterns often helpful:

In building modular and scalable code, these language features help separate concerns:

For example, if you have a training loop and you want to add functionality to record the time of each epoch and maybe pause if system is overloaded – rather than peppering the loop with checks, you might use a decorator or context manager that wraps the epoch.

Finally, writing clean Pythonic code makes it easier for others (and future you) to understand and maintain. Using these features appropriately leads to code that is idiomatic and less error-prone.

Conclusion

This guide has covered a broad range from the theoretical underpinnings of deep learning to the practical skills needed to implement, train, evaluate, and deploy neural networks in Python. To progress from a beginner to having a good understanding of deep learning: