The Building Block: Perceptron

Every neural network is built from simple units called neurons. Each neuron takes weighted inputs, sums them with a bias, and passes the result through an activation function that introduces non-linearity.

Single neuron
\[ z = \sum_{i=1}^{n} w_i x_i + b, \quad a = f(z) \]
where \(w_i\) are learnable weights, \(b\) is a bias, \(f\) is the activation function, and \(a\) is the output passed to the next layer.
Feedforward neural network with two hidden layers

Fig 1. A feedforward network with 2 hidden layers. Each arrow is a weighted connection; nodes apply an activation function.

Activation Functions

Without non-linear activations, a deep network collapses to a single linear transformation regardless of depth. Activations introduce the non-linearity that lets networks model complex patterns.

FunctionFormulaRangeBest used for
ReLU\(\max(0, z)\)[0, ∞)Hidden layers (default choice)
Sigmoid\(\frac{1}{1+e^{-z}}\)(0, 1)Binary classification output
Tanh\(\tanh(z)\)(−1, 1)Hidden layers, zero-centred
Softmax\(\frac{e^{z_k}}{\sum_j e^{z_j}}\)(0,1), sums to 1Multi-class output

Backpropagation & Gradient Descent

Training means finding weights that minimise a loss function via gradient descent. Backpropagation efficiently computes the gradient of the loss with respect to every weight using the chain rule of calculus.

Gradient descent weight update
\[ w_{t+1} = w_t - \eta \cdot \frac{\partial \mathcal{L}}{\partial w_t} \]
\(\eta\) is the learning rate; \(\mathcal{L}\) is the loss (MSE for regression, cross-entropy for classification). Adam adapts \(\eta\) per-parameter, dramatically speeding up training.

⚠️ Vanishing gradients occur in very deep networks when gradients become too small to drive learning. ReLU activations and batch normalisation largely solve this in modern architectures.

Key Hyperparameters

HyperparameterTypical rangeEffect
Learning rate1e-4 to 1e-2Too high = divergence; too low = slow convergence
Batch size32–512Larger = more stable gradients; smaller = more noise (often helpful)
Hidden layers1–10+Depth enables learning hierarchical features
Dropout rate0.1–0.5Regularisation — prevents overfitting
Tree-Based Models LLMs