What is Linear Regression?

Linear regression is a statistical modelling technique used to quantify and understand the relationship between one or more independent (explanatory) variables and a dependent (target) variable. It answers questions like: "If our marketing spend increases by £10,000, how does first-day viewership change?"

The key assumption is that this relationship can be approximated by a straight line — or a hyperplane when many variables are involved.

Statistical vs. Machine Learning Models

Statistical models like linear regression prioritise inference — understanding the relationships between variables, testing hypotheses, and quantifying uncertainty. Machine learning models prioritise prediction accuracy. The two are complementary: regression is often your first stop because it's interpretable and its outputs (coefficients, p-values) are directly actionable.

Line of best fit minimising sum of squared residuals

Fig 1. The line of best fit (teal) minimises the sum of squared residuals (gold dashed lines).

The Model Equation

Simple Linear Regression

Model
\[ \hat{y} = \beta_0 + \beta_1 x \]
where \(\hat{y}\) is the predicted value, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope — the change in \(y\) for a one-unit increase in \(x\).

Multiple Linear Regression

General Form
\[ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon \]
\(\varepsilon\) is the error term, assumed normally distributed with mean 0. Each \(\beta_k\) measures the marginal effect of predictor \(k\), holding all others constant.

Ordinary Least Squares (OLS)

Objective — minimise SSR
\[ \text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Squaring errors penalises large deviations more heavily, which is why outliers pull the line toward them.

The Four Core Assumptions

1. Linearity

The relationship between predictors and target must be approximately linear. Test: pairplots. Fix: log or polynomial transformations.

2. No Multicollinearity

Independent variables should not be highly correlated. Test: correlation heatmap or VIF > 5–10. Fix: remove or merge correlated predictors.

3. Homoskedasticity

Residuals should have constant variance. Test: residuals vs. fitted plot — no funnel shape. Fix: transform the dependent variable.

4. Normality of Residuals

Errors should follow a normal distribution (required for valid inference). Test: histogram of residuals — should be bell-shaped.

Homoskedastic vs heteroskedastic residual plots

Fig 2. Evenly spread residuals (left) = good. Fan shape (right) = heteroskedasticity.

Evaluation Metrics

MetricFormulaInterpretationNote
\(1 - SSR/SST\)% of variance explainedHigher is better
Adj. R²\(1 - \frac{(1-R^2)(n-1)}{n-p-1}\)R² penalised for extra variablesUse when comparing models
MAE\(\frac{1}{n}\sum|y_i - \hat{y}_i|\)Average absolute errorRobust to outliers
RMSE\(\sqrt{\frac{1}{n}\sum(y_i-\hat{y}_i)^2}\)Root mean squared errorPenalises large errors

⚠️ RMSE is sensitive to outliers. Because errors are squared, a single large residual can dominate. If your data has influential outliers, MAE gives a more representative picture of typical error.

Statistical Significance: p-values & Coefficients

t-statistic for a coefficient
\[ t = \frac{\hat{\beta}_k}{\text{SE}(\hat{\beta}_k)} \]
Large |t| → small p-value → strong evidence the predictor is useful. Conventional threshold: p < 0.05.

📌 Practical insight: In our experience, removing variables with p > 0.05 rarely hurts predictive performance while improving interpretability significantly. In a real-world OTT viewership model, dropping 14 insignificant variables moved Adj. R² by only 0.1 percentage point.

Bias–Variance Tradeoff

Error Decomposition
\[ \text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise} \]
Bias-variance tradeoff curve

Fig 3. Simple models have high bias; complex models have high variance. The sweet spot minimises total error.

Model TypeBiasVarianceSymptom
Too simple (underfitting)HighLowPoor train AND test performance
Well-calibratedLowLowGood performance on both
Too complex (overfitting)LowHighGreat train, poor test performance

Model Validation Strategies

K-Fold Cross-Validation

K-Fold cross-validation diagram

Fig 4. Each fold serves as validation once. Final score = average across all 5.

CV Score
\[ \text{CV Score} = \frac{1}{K} \sum_{k=1}^{K} \text{Performance}_k \]

Leave-One-Out Cross-Validation (LOOCV)

A special case where K = n. Every data point serves as the validation set exactly once. LOOCV gives the most thorough use of training data and the most stable estimates — at the cost of compute. In a real-world viewership model, LOOCV produced the best RMSE (0.040 vs 0.049 for the base model).

Bootstrapping

Bootstrap estimate
\[ \hat{\theta}^* = \frac{1}{B} \sum_{b=1}^{B} \hat{\theta}_b \]
Repeat B times: draw n samples with replacement, fit model, record statistic. Average gives a robust estimate with a quantifiable confidence interval.

Variable Selection

📌 Key finding: Removing statistically insignificant variables (p > 0.05) rarely hurts predictive performance. A lean model with 5–6 strong predictors often matches a bloated 20-variable model while being far more interpretable and less prone to overfitting.

VIF — detecting multicollinearity
\[ \text{VIF}_j = \frac{1}{1 - R^2_j} \]
where \(R^2_j\) is the R² from regressing predictor \(j\) on all others. VIF > 5 is concerning; VIF > 10 is severe.
All Resources Tree-Based Models