The Curse of Dimensionality

As the number of features grows, data becomes increasingly sparse โ€” distances between points lose meaning, and models require exponentially more data to generalise. Dimensionality reduction addresses this by finding compact representations that preserve the structure we care about.

๐Ÿ“Œ Beyond modelling benefits, dimensionality reduction is indispensable for visual exploration: reducing to 2D lets you spot cluster structure, outliers, and patterns before building any model.

Principal Component Analysis (PCA)

PCA finds a new set of orthogonal axes (principal components) that capture maximum variance. The first component points in the direction of greatest variance; each subsequent component is orthogonal to all previous and captures the next most variance.

PCA โ€” eigenvector decomposition
\[ \Sigma = \frac{1}{n} X^T X, \quad \Sigma \mathbf{v}_k = \lambda_k \mathbf{v}_k \]
The principal components \(\mathbf{v}_k\) are the eigenvectors of the covariance matrix \(\Sigma\), sorted by decreasing eigenvalue \(\lambda_k\). Each eigenvalue equals the variance captured by that component.
PCA principal components on a 2D scatter plot

Fig 1. PCA identifies the directions of greatest variance (PC1, PC2). Projecting onto PC1 alone retains most information.

Explained Variance Ratio

Plot cumulative explained variance as you add components. A common heuristic: retain enough components to explain 95% of variance. If 3 components capture 95% of 100-feature data, you've achieved a 33ร— compression with minimal information loss.

t-SNE and UMAP: Non-Linear Methods

PCA is linear โ€” it can only find linear structure. t-SNE and UMAP use non-linear mappings, far better at preserving local cluster structure. They are the standard tools for visualising high-dimensional data like embeddings.

MethodTypeGlobal structure?SpeedBest used for
PCALinearYesFastPre-processing, variance exploration, feature engineering
t-SNENon-linearNo (local only)SlowVisualisation of cluster structure
UMAPNon-linearBetter than t-SNEFastVisualisation + can be used for downstream tasks

โš ๏ธ t-SNE and UMAP are for visualisation only โ€” the axes are not interpretable and cluster distances in 2D are not meaningful. Use PCA components as actual features in downstream models, not t-SNE or UMAP projections.

Clustering All Resources