Clustering — Your Data Clinic Resources

What is Clustering?

Clustering is an unsupervised learning task — there are no labels. The goal is to group data points so that points within a cluster are more similar to each other than to those in other clusters. Common applications: customer segmentation, anomaly detection, document grouping, and data exploration before modelling.

K-Means

The most widely used clustering algorithm. It partitions n points into K clusters by iteratively assigning each point to its nearest centroid, then recomputing centroids until convergence.

K-Means objective (minimise inertia)

\[ J = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]

Minimise total within-cluster sum of squared distances to the centroid \(\boldsymbol{\mu}_k\). Converges to a local minimum — run multiple times with K-Means++ initialisation for robustness.

K-Means clustering with K=3 showing centroids

Fig 1. K-Means with K=3. Points are assigned to nearest centroid; centroids are recomputed until stable.

⚠️ K-Means assumes spherical, similarly-sized clusters and requires specifying K. Use the elbow method (plot inertia vs K) or silhouette score to select K when the true grouping is unknown.

DBSCAN & Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering) doesn't require specifying K and discovers arbitrarily shaped clusters. It labels sparse points as noise, making it excellent for anomaly detection. Hierarchical clustering builds a dendrogram of all possible groupings — useful when the hierarchy itself is informative or you're unsure of the right number of clusters.

Algorithm	Cluster shape	Requires K?	Handles noise?	Best for
K-Means	Spherical	Yes	No	Customer segmentation, fast iteration
DBSCAN	Arbitrary	No	Yes	Anomaly detection, spatial data
Hierarchical	Any	No (post-hoc)	Partially	Taxonomy, exploratory analysis

Evaluating Cluster Quality

Silhouette Score

\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]

where \(a(i)\) = mean distance to other points in the same cluster, \(b(i)\) = mean distance to points in the nearest other cluster. Score ranges from -1 to 1; higher is better. A score near 0 means overlapping clusters.

Clustering:Finding Natural Groups

What is Clustering?

K-Means

DBSCAN & Hierarchical Clustering

Evaluating Cluster Quality

Clustering:
Finding Natural Groups