Decision Trees: The Foundation
A decision tree learns by repeatedly asking questions that best separate the data. At each node, it finds the split (feature + threshold) that maximises information gain or minimises impurity. The result is a transparent, human-readable model.
Fig 1. A decision tree for viewership prediction. Each internal node is a binary question; leaves contain the prediction.
Random Forests: Wisdom of Crowds
A single decision tree overfits easily. Random forests fix this by training hundreds of trees on random subsamples of data, each using a random subset of features at every split. Predictions are then averaged across all trees.
The two sources of randomness — row sampling (bagging) and feature sampling — ensure the trees are decorrelated. Averaging many weakly-correlated trees dramatically reduces variance while keeping bias low.
Gradient Boosting: Sequential Refinement
Where random forests build trees in parallel, gradient boosting builds them sequentially — each new tree specifically targets the residual errors of the ensemble so far. Modern implementations (XGBoost, LightGBM, CatBoost) are among the most powerful tools in applied ML.
| Model | Key strength | Key weakness | Typical use |
|---|---|---|---|
| Decision Tree | Fully interpretable | Overfits easily | Exploratory analysis, rule extraction |
| Random Forest | Robust, low variance | Slower inference | Tabular data, feature importance |
| Gradient Boosting | State-of-art accuracy | Many hyperparameters | Competitions, production ML |