What is Clustering?
Clustering is an unsupervised learning task — there are no labels. The goal is to group data points so that points within a cluster are more similar to each other than to those in other clusters. Common applications: customer segmentation, anomaly detection, document grouping, and data exploration before modelling.
K-Means
The most widely used clustering algorithm. It partitions n points into K clusters by iteratively assigning each point to its nearest centroid, then recomputing centroids until convergence.
Fig 1. K-Means with K=3. Points are assigned to nearest centroid; centroids are recomputed until stable.
⚠️ K-Means assumes spherical, similarly-sized clusters and requires specifying K. Use the elbow method (plot inertia vs K) or silhouette score to select K when the true grouping is unknown.
DBSCAN & Hierarchical Clustering
DBSCAN (Density-Based Spatial Clustering) doesn't require specifying K and discovers arbitrarily shaped clusters. It labels sparse points as noise, making it excellent for anomaly detection. Hierarchical clustering builds a dendrogram of all possible groupings — useful when the hierarchy itself is informative or you're unsure of the right number of clusters.
| Algorithm | Cluster shape | Requires K? | Handles noise? | Best for |
|---|---|---|---|---|
| K-Means | Spherical | Yes | No | Customer segmentation, fast iteration |
| DBSCAN | Arbitrary | No | Yes | Anomaly detection, spatial data |
| Hierarchical | Any | No (post-hoc) | Partially | Taxonomy, exploratory analysis |