Neural Networks — Your Data Clinic Resources

The Building Block: Perceptron

Every neural network is built from simple units called neurons. Each neuron takes weighted inputs, sums them with a bias, and passes the result through an activation function that introduces non-linearity.

Single neuron

\[ z = \sum_{i=1}^{n} w_i x_i + b, \quad a = f(z) \]

where \(w_i\) are learnable weights, \(b\) is a bias, \(f\) is the activation function, and \(a\) is the output passed to the next layer.

Feedforward neural network with two hidden layers

Fig 1. A feedforward network with 2 hidden layers. Each arrow is a weighted connection; nodes apply an activation function.

Activation Functions

Without non-linear activations, a deep network collapses to a single linear transformation regardless of depth. Activations introduce the non-linearity that lets networks model complex patterns.

Function	Formula	Range	Best used for
ReLU	\(\max(0, z)\)	[0, ∞)	Hidden layers (default choice)
Sigmoid	\(\frac{1}{1+e^{-z}}\)	(0, 1)	Binary classification output
Tanh	\(\tanh(z)\)	(−1, 1)	Hidden layers, zero-centred
Softmax	\(\frac{e^{z_k}}{\sum_j e^{z_j}}\)	(0,1), sums to 1	Multi-class output

Backpropagation & Gradient Descent

Training means finding weights that minimise a loss function via gradient descent. Backpropagation efficiently computes the gradient of the loss with respect to every weight using the chain rule of calculus.

Gradient descent weight update

\[ w_{t+1} = w_t - \eta \cdot \frac{\partial \mathcal{L}}{\partial w_t} \]

\(\eta\) is the learning rate; \(\mathcal{L}\) is the loss (MSE for regression, cross-entropy for classification). Adam adapts \(\eta\) per-parameter, dramatically speeding up training.

⚠️ Vanishing gradients occur in very deep networks when gradients become too small to drive learning. ReLU activations and batch normalisation largely solve this in modern architectures.

Key Hyperparameters

Hyperparameter	Typical range	Effect
Learning rate	1e-4 to 1e-2	Too high = divergence; too low = slow convergence
Batch size	32–512	Larger = more stable gradients; smaller = more noise (often helpful)
Hidden layers	1–10+	Depth enables learning hierarchical features
Dropout rate	0.1–0.5	Regularisation — prevents overfitting

Neural Networks:Learning from Layers

The Building Block: Perceptron

Activation Functions

Backpropagation & Gradient Descent

Key Hyperparameters

Neural Networks:
Learning from Layers