Tree-Based Models — Your Data Clinic Resources

Decision Trees: The Foundation

A decision tree learns by repeatedly asking questions that best separate the data. At each node, it finds the split (feature + threshold) that maximises information gain or minimises impurity. The result is a transparent, human-readable model.

Fig 1. A decision tree for viewership prediction. Each internal node is a binary question; leaves contain the prediction.

Gini Impurity

\[ G = 1 - \sum_{k=1}^{K} p_k^2 \]

G = 0 means a perfectly pure node (all one class). For regression trees, variance reduction is used instead of Gini.

Random Forests: Wisdom of Crowds

A single decision tree overfits easily. Random forests fix this by training hundreds of trees on random subsamples of data, each using a random subset of features at every split. Predictions are then averaged across all trees.

The two sources of randomness — row sampling (bagging) and feature sampling — ensure the trees are decorrelated. Averaging many weakly-correlated trees dramatically reduces variance while keeping bias low.

Random Forest prediction (regression)

\[ \hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(\mathbf{x}) \]

where \(T_b(\mathbf{x})\) is the prediction of the \(b\)-th tree. Feature importance = average impurity decrease across all trees and all splits on that feature.

Gradient Boosting: Sequential Refinement

Where random forests build trees in parallel, gradient boosting builds them sequentially — each new tree specifically targets the residual errors of the ensemble so far. Modern implementations (XGBoost, LightGBM, CatBoost) are among the most powerful tools in applied ML.

Boosting update rule

\[ F_{m}(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot T_m(\mathbf{x}) \]

\(\eta\) is the learning rate (0.01–0.3). Small \(\eta\) means each tree contributes less, requiring more trees but reducing overfitting. \(T_m\) is fit on the pseudo-residuals (negative gradient of the loss).

Model	Key strength	Key weakness	Typical use
Decision Tree	Fully interpretable	Overfits easily	Exploratory analysis, rule extraction
Random Forest	Robust, low variance	Slower inference	Tabular data, feature importance
Gradient Boosting	State-of-art accuracy	Many hyperparameters	Competitions, production ML

Tree-Based Models:Splitting to Understand

Decision Trees: The Foundation

Random Forests: Wisdom of Crowds

Gradient Boosting: Sequential Refinement

Tree-Based Models:
Splitting to Understand