The Building Block: Perceptron
Every neural network is built from simple units called neurons. Each neuron takes weighted inputs, sums them with a bias, and passes the result through an activation function that introduces non-linearity.
Fig 1. A feedforward network with 2 hidden layers. Each arrow is a weighted connection; nodes apply an activation function.
Activation Functions
Without non-linear activations, a deep network collapses to a single linear transformation regardless of depth. Activations introduce the non-linearity that lets networks model complex patterns.
| Function | Formula | Range | Best used for |
|---|---|---|---|
| ReLU | \(\max(0, z)\) | [0, ∞) | Hidden layers (default choice) |
| Sigmoid | \(\frac{1}{1+e^{-z}}\) | (0, 1) | Binary classification output |
| Tanh | \(\tanh(z)\) | (−1, 1) | Hidden layers, zero-centred |
| Softmax | \(\frac{e^{z_k}}{\sum_j e^{z_j}}\) | (0,1), sums to 1 | Multi-class output |
Backpropagation & Gradient Descent
Training means finding weights that minimise a loss function via gradient descent. Backpropagation efficiently computes the gradient of the loss with respect to every weight using the chain rule of calculus.
⚠️ Vanishing gradients occur in very deep networks when gradients become too small to drive learning. ReLU activations and batch normalisation largely solve this in modern architectures.
Key Hyperparameters
| Hyperparameter | Typical range | Effect |
|---|---|---|
| Learning rate | 1e-4 to 1e-2 | Too high = divergence; too low = slow convergence |
| Batch size | 32–512 | Larger = more stable gradients; smaller = more noise (often helpful) |
| Hidden layers | 1–10+ | Depth enables learning hierarchical features |
| Dropout rate | 0.1–0.5 | Regularisation — prevents overfitting |