Decision Trees: The Foundation

A decision tree learns by repeatedly asking questions that best separate the data. At each node, it finds the split (feature + threshold) that maximises information gain or minimises impurity. The result is a transparent, human-readable model.

Decision tree for viewership prediction

Fig 1. A decision tree for viewership prediction. Each internal node is a binary question; leaves contain the prediction.

Gini Impurity
\[ G = 1 - \sum_{k=1}^{K} p_k^2 \]
G = 0 means a perfectly pure node (all one class). For regression trees, variance reduction is used instead of Gini.

Random Forests: Wisdom of Crowds

A single decision tree overfits easily. Random forests fix this by training hundreds of trees on random subsamples of data, each using a random subset of features at every split. Predictions are then averaged across all trees.

The two sources of randomness — row sampling (bagging) and feature sampling — ensure the trees are decorrelated. Averaging many weakly-correlated trees dramatically reduces variance while keeping bias low.

Random Forest prediction (regression)
\[ \hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(\mathbf{x}) \]
where \(T_b(\mathbf{x})\) is the prediction of the \(b\)-th tree. Feature importance = average impurity decrease across all trees and all splits on that feature.

Gradient Boosting: Sequential Refinement

Where random forests build trees in parallel, gradient boosting builds them sequentially — each new tree specifically targets the residual errors of the ensemble so far. Modern implementations (XGBoost, LightGBM, CatBoost) are among the most powerful tools in applied ML.

Boosting update rule
\[ F_{m}(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot T_m(\mathbf{x}) \]
\(\eta\) is the learning rate (0.01–0.3). Small \(\eta\) means each tree contributes less, requiring more trees but reducing overfitting. \(T_m\) is fit on the pseudo-residuals (negative gradient of the loss).
ModelKey strengthKey weaknessTypical use
Decision TreeFully interpretableOverfits easilyExploratory analysis, rule extraction
Random ForestRobust, low varianceSlower inferenceTabular data, feature importance
Gradient BoostingState-of-art accuracyMany hyperparametersCompetitions, production ML
Linear Regression Neural Networks