Boosting is an Ensemble learning method that uses multiple instances of another model to average them out to make a classifier that is better than any one individual one.

Suppose we are in the modelling framework where $B = \{-1,1\}$ and we have a training set $T$ where $(a_t, b_t) = t \in T$. Boosting iteratively train classifiers $h_i: A \rightarrow \mathbb{R}$ based on how the past classifiers have performed. To do this before training $h_i$ we define a distribution $\mathbb{D}_i$ on $A_T = \{a_t \vert t \in T\}$.

We set

$$\mathbb{D}_1(a) = \frac{1}{\vert T \vert}.$$

To define the other terms lets introduce some notation.

Let $\epsilon_i$ be the error rate of $h_i$ with respect to $\mathbb{D}_i$

Set our learning rate

$$ \alpha_i = \frac{1}{2} \ln \left ( \frac{1 - \epsilon_i}{\epsilon_i} \right ). $$

Other choices of $\alpha_i$ can be made.

Now define $\mathbb{D}_{i+1}$ iteratively by setting

$$ C_{i,t} = \mathbb{D}_i(a_t) \cdot e^{- \alpha_i \ b_t \ h_i(a_t)}, \ \mbox{ with } \ Z_i = \sum_{t \in T} C_{i,t}$$

to give

$$\mathbb{D}_{i+1}(a_t) = \frac{C_{i,t}}{Z_i}. $$

Note that if $b_t$ and $h_i(a_t)$ share the same sign then the power of $e$ is negative and we decrease the weight of $a_t$ whereas if they don’t we increase the weight of $a_t$ (generally there are some cases where this isn’t true!).

Once we have trained for some large number times $N$ we aggregate this all by defining our output

$$\hat{f}(a) = \mbox{sgn}\left ( \sum_{i = 1}^N \alpha_t h_t(a) \right )$$

with $\mbox{sgn}$ being the Sign function.

Pseudocode

Name(classifier, N, T):
	Input:
		classifier - some way to train a classifier given some training data and a distrubution.
		N - the number of itterations we want to do.
		T - pairs of points (a_t, b_t) to train on.
	Output:
		a moddel f on the space A to {-1, 1}.
1. Set D: T -> [0,1] to be D(t) = 1/|T|
2. For i = 1, ... , number_of_itterations do
	2.1. let h_i = classifier(T, D)
	2.2. Set alpha_i = calculate_learning_rate(h_i, T, D)
	2.2. Set D = Update_distrubution(h_i, T, D, alpha_i)
3. Return sgn(sum(alpha_t h_t(x) for t in T))

calculate_learning_rate(h, T, D):
	Input:
		h - A classifier mapping points in A to a real number
		T - pairs of points (a_t, b_t) to train on.
		D - A distrubution over T
	Output:
		a learning rate for this itteration
1. Set e = 0
2. For each t in T
	2.1. If b_t != sgn(h(a_t)) then
		2.1.1 Set e = e + D(t)
3. Return ln((1-e)/e)/2

Update_distrubution(h, T, D, alpha)
	Input:
		h - A classifier mapping points in A to a real number
		T - pairs of points (a_t, b_t) to train on.
		D - A distrubution over T
		alhpa - a learning rate
	Output:
		a new distrubution over T
1. For each t in T
	2.1. Set C_t = D(t) exp(- alpha b_t h(a_t))
2. Set Z = sum(C_t for t in T)
3. Intialise new distrubution over T D'
4. For each t in T
	4.1. Set D'(t) = C_t/Z
5. Return D'

Pseudocode

Correctness