What is Linear Regression?
Linear regression is a statistical modelling technique used to quantify and understand the relationship between one or more independent (explanatory) variables and a dependent (target) variable. It answers questions like: "If our marketing spend increases by £10,000, how does first-day viewership change?"
The key assumption is that this relationship can be approximated by a straight line — or a hyperplane when many variables are involved.
Statistical models like linear regression prioritise inference — understanding the relationships between variables, testing hypotheses, and quantifying uncertainty. Machine learning models prioritise prediction accuracy. The two are complementary: regression is often your first stop because it's interpretable and its outputs (coefficients, p-values) are directly actionable.
Fig 1. The line of best fit (teal) minimises the sum of squared residuals (gold dashed lines).
The Model Equation
Simple Linear Regression
Multiple Linear Regression
Ordinary Least Squares (OLS)
The Four Core Assumptions
The relationship between predictors and target must be approximately linear. Test: pairplots. Fix: log or polynomial transformations.
Independent variables should not be highly correlated. Test: correlation heatmap or VIF > 5–10. Fix: remove or merge correlated predictors.
Residuals should have constant variance. Test: residuals vs. fitted plot — no funnel shape. Fix: transform the dependent variable.
Errors should follow a normal distribution (required for valid inference). Test: histogram of residuals — should be bell-shaped.
Fig 2. Evenly spread residuals (left) = good. Fan shape (right) = heteroskedasticity.
Evaluation Metrics
| Metric | Formula | Interpretation | Note |
|---|---|---|---|
| R² | \(1 - SSR/SST\) | % of variance explained | Higher is better |
| Adj. R² | \(1 - \frac{(1-R^2)(n-1)}{n-p-1}\) | R² penalised for extra variables | Use when comparing models |
| MAE | \(\frac{1}{n}\sum|y_i - \hat{y}_i|\) | Average absolute error | Robust to outliers |
| RMSE | \(\sqrt{\frac{1}{n}\sum(y_i-\hat{y}_i)^2}\) | Root mean squared error | Penalises large errors |
⚠️ RMSE is sensitive to outliers. Because errors are squared, a single large residual can dominate. If your data has influential outliers, MAE gives a more representative picture of typical error.
Statistical Significance: p-values & Coefficients
📌 Practical insight: In our experience, removing variables with p > 0.05 rarely hurts predictive performance while improving interpretability significantly. In a real-world OTT viewership model, dropping 14 insignificant variables moved Adj. R² by only 0.1 percentage point.
Bias–Variance Tradeoff
Fig 3. Simple models have high bias; complex models have high variance. The sweet spot minimises total error.
| Model Type | Bias | Variance | Symptom |
|---|---|---|---|
| Too simple (underfitting) | High | Low | Poor train AND test performance |
| Well-calibrated | Low | Low | Good performance on both |
| Too complex (overfitting) | Low | High | Great train, poor test performance |
Model Validation Strategies
K-Fold Cross-Validation
Fig 4. Each fold serves as validation once. Final score = average across all 5.
Leave-One-Out Cross-Validation (LOOCV)
A special case where K = n. Every data point serves as the validation set exactly once. LOOCV gives the most thorough use of training data and the most stable estimates — at the cost of compute. In a real-world viewership model, LOOCV produced the best RMSE (0.040 vs 0.049 for the base model).
Bootstrapping
Variable Selection
📌 Key finding: Removing statistically insignificant variables (p > 0.05) rarely hurts predictive performance. A lean model with 5–6 strong predictors often matches a bloated 20-variable model while being far more interpretable and less prone to overfitting.