When making a decision tree with no pruning it can tend to overfit the training data . To reduce this we can “prune” the decision tree. We do this by looking at how useful each of the branches are - then removing the ones that don’t add enough value. This is controlled by a variable $\alpha$ which increases the cost of having more leaf nodes.

Suppose we are in modelling framework where $T$ is our training data and let $\hat{f}$ be our decision tree. Assume $R(\hat{f}, T)$ is our function for evaluating how good a fit our tree is (likely this will use information entropy or the Gini index ). Lastly let $L(\hat{f})$ be the number of leaf nodes in our decision tree $\hat{f}$. Set

$$ R_{\alpha}(\hat{f}, T) = R(\hat{f}, T) + \alpha L(\hat{f}) $$

to be the value we want to minimise on our sub-trees.

Optimally we would iterate over all potential subtrees of $\hat{f}$ and find the sub-tree that optimises $R_{\alpha}$ on $T$. However, computationally this can be expensive. Therefore it is usual to find a heuristic to choose a branch to prune instead.

One method to do this it to calculate the effective cost complexity of a non-terminal node. To give you the intuition behind this let $G = (V,E)$ be our decision tree and extend $R$ to be defined at $v \in V$ where $R(v)$ is the evaluation function evaluated on the training data that makes it to $v \in V$ (like when training the decision tree ). Then the effect of a leaf node $v \in L(G)$ on $R_{\alpha}(\hat{f}, T)$ is

$$ R(v) + \alpha. $$

To define the effective cost complexity we will find the $\alpha$ such that the contribution of an internal node to $R_{\alpha}$ would be the same as for $G_v$ the subtree rooted at $v$. This is

$$ \alpha_{eff}(v) = \frac{R(v) - \sum_{x \in L(G_v)} R(x)}{\vert L (G_v) \vert - 1}. $$

(It can be derived by equating the contribution of $v$ against the contribution of the terminal nodes $G_v$ contains.)

The process then prunes branches with the lowest $\alpha_{eff}$ until that value is greater than $\alpha$.

# Pseudocode

CPP(decision_tree, alpha, R):
	Input:
		decision_tree with a set of vertices V
		alpha positive constant to determine pruning
		R the evaluation function such as Entropy or Gini, this will have to
			defined on vertices of V (for the training data that gets
			classifed to v)
	Output:
		a pruned decision tree
1. Set best_tree = decision_tree
2. For each non-leaf set alpha_eff(v) = calculate_effective_alpha(decision_tree, v, R)
3. Set min_eff_alpha = min(alpha_eff(v) for v in V non-leaf)
4. While min_eff_alpha < alpha
	4.1. Find v such that alpha_eff(v) = min_eff_alpha
	4.2. Prune best_tree at v
	4.3. Let P be the set of vertices with v downstream of it.
	4.4. For each p in P set alpha_eff(p) = calculate_effective_alpha(best_tree, p, R)
	4.5. Set min_eff_alpha = min(alpha_eff(v) for v in best_tree non-leaf)
	4.6. Break if best_tree only has 1 vertex.
5. return best_tree.

calculate_effective_alpha(tree, v, R):
	Input
		tree is a decision tree
		v is a vertex in V that is not a leaf node
		R the evaluation function such as Entropy or Gini, this will have to
			defined on vertices of V (for the training data that gets
			classifed to v)
	Output
		alpha_eff the effective alpha for that vertex
1. Let L be the set of leaf vertices with v unstream of it in tree.
2. Set total_leaf_weight = sum(R(x) for x in L)
3. Return (R(v) - total_leaf_weight)/(|L| - 1)

# Run time

Optimally we would iterate over all potential subtrees of $\hat{f}$ and find the sub-tree that optimises $R_{\alpha}$ on $T$. However, computationally this can be expensive. Instead we can use effective cost complexity which cuts down run time but can still take $O(\vert V \vert^3)$.

# Correctness

This can reduce overfitting, however the parameter $\alpha$ needs to be fine tuned. It is best to use cross validation for this purpose.

# Cost complexity pruning for decision trees (CPP)

# Pseudocode

# Run time

# Correctness