Dimensionality Reduction — Your Data Clinic Resources

The Curse of Dimensionality

As the number of features grows, data becomes increasingly sparse — distances between points lose meaning, and models require exponentially more data to generalise. Dimensionality reduction addresses this by finding compact representations that preserve the structure we care about.

📌 Beyond modelling benefits, dimensionality reduction is indispensable for visual exploration: reducing to 2D lets you spot cluster structure, outliers, and patterns before building any model.

Principal Component Analysis (PCA)

PCA finds a new set of orthogonal axes (principal components) that capture maximum variance. The first component points in the direction of greatest variance; each subsequent component is orthogonal to all previous and captures the next most variance.

PCA — eigenvector decomposition

\[ \Sigma = \frac{1}{n} X^T X, \quad \Sigma \mathbf{v}_k = \lambda_k \mathbf{v}_k \]

The principal components \(\mathbf{v}_k\) are the eigenvectors of the covariance matrix \(\Sigma\), sorted by decreasing eigenvalue \(\lambda_k\). Each eigenvalue equals the variance captured by that component.

PCA principal components on a 2D scatter plot

Fig 1. PCA identifies the directions of greatest variance (PC1, PC2). Projecting onto PC1 alone retains most information.

Explained Variance Ratio

Plot cumulative explained variance as you add components. A common heuristic: retain enough components to explain 95% of variance. If 3 components capture 95% of 100-feature data, you've achieved a 33× compression with minimal information loss.

t-SNE and UMAP: Non-Linear Methods

PCA is linear — it can only find linear structure. t-SNE and UMAP use non-linear mappings, far better at preserving local cluster structure. They are the standard tools for visualising high-dimensional data like embeddings.

Method	Type	Global structure?	Speed	Best used for
PCA	Linear	Yes	Fast	Pre-processing, variance exploration, feature engineering
t-SNE	Non-linear	No (local only)	Slow	Visualisation of cluster structure
UMAP	Non-linear	Better than t-SNE	Fast	Visualisation + can be used for downstream tasks

⚠️ t-SNE and UMAP are for visualisation only — the axes are not interpretable and cluster distances in 2D are not meaningful. Use PCA components as actual features in downstream models, not t-SNE or UMAP projections.

Dimensionality Reduction:Less is More Visible

The Curse of Dimensionality

Principal Component Analysis (PCA)

t-SNE and UMAP: Non-Linear Methods

Dimensionality Reduction:
Less is More Visible