What is Clustering?

Clustering is an unsupervised learning task — there are no labels. The goal is to group data points so that points within a cluster are more similar to each other than to those in other clusters. Common applications: customer segmentation, anomaly detection, document grouping, and data exploration before modelling.

K-Means

The most widely used clustering algorithm. It partitions n points into K clusters by iteratively assigning each point to its nearest centroid, then recomputing centroids until convergence.

K-Means objective (minimise inertia)
\[ J = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]
Minimise total within-cluster sum of squared distances to the centroid \(\boldsymbol{\mu}_k\). Converges to a local minimum — run multiple times with K-Means++ initialisation for robustness.
K-Means clustering with K=3 showing centroids

Fig 1. K-Means with K=3. Points are assigned to nearest centroid; centroids are recomputed until stable.

⚠️ K-Means assumes spherical, similarly-sized clusters and requires specifying K. Use the elbow method (plot inertia vs K) or silhouette score to select K when the true grouping is unknown.

DBSCAN & Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering) doesn't require specifying K and discovers arbitrarily shaped clusters. It labels sparse points as noise, making it excellent for anomaly detection. Hierarchical clustering builds a dendrogram of all possible groupings — useful when the hierarchy itself is informative or you're unsure of the right number of clusters.

AlgorithmCluster shapeRequires K?Handles noise?Best for
K-MeansSphericalYesNoCustomer segmentation, fast iteration
DBSCANArbitraryNoYesAnomaly detection, spatial data
HierarchicalAnyNo (post-hoc)PartiallyTaxonomy, exploratory analysis

Evaluating Cluster Quality

Silhouette Score
\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]
where \(a(i)\) = mean distance to other points in the same cluster, \(b(i)\) = mean distance to points in the nearest other cluster. Score ranges from -1 to 1; higher is better. A score near 0 means overlapping clusters.
LLMs Dimensionality Reduction