# Naive Bayes classifier
Last edited: 2024-02-24
Suppose we have a classification problem and we are in the modelling framework . We are given features $A = \oplus_{i=1}^n A_i$ where each feature $A_i$ has finite values (also split up the random variable $X$ into $X = \oplus_{i=1}^n X_i$ for our input). Let our classification also have finite domain $B$.
We can use our testing data to estimate $\mathbb{P}[Y=b]$ for each $b \in B$ and $\mathbb{P}[X_i = a_i \vert Y = b]$ for each $a_i \in A_i$ and $b \in B$.
Making the assumption that $Y$ and each of the $X_i$ form a Bayesian network where $Y$ is the root with one arrow to each of the $X_i$ with the probabilities above.
Using Bayes rule then marginalisation and chain rule we can calculate
$$ \mathbb{P}[Y = b \vert X = (a_1, a_2, \ldots, a_n)] = \frac{\mathbb{P}[Y=b] \prod_{i=1}^n \mathbb{P}[A_i = a_i \vert Y = b]}{Z} $$with the normalisation term $Z = \sum_{b' \in B} \mathbb{P}[Y=b'] \prod_{i=1}^n \mathbb{P}[A_i = a_i \vert Y = b']$. Then to compute the Maximum a posteriori probability we don’t need to compute $Z$ and we let
$$ f_{MAP}(a_1, a_2, \ldots, a_n) = \mbox{arg}\max_{b \in B} \mathbb{P}[Y=b] \prod_{i=1}^n \mathbb{P}[A_i = a_i \vert Y = b]. $$This exactly forms our Naive Bayes classifier for this problem.
# Run time
It is eager learner so requires you to sample the data initially. Though we only ever need to see each data point once - so is fairly fast. The estimating of the parameters is fairly simple. This model also has the minimum number of parameters you need to factor in most
# Correctness
Whilst this does may a large assumption about the different attributes. Empirically this is considered very successful. It is thought that Google use this a lot in gmail for actual spam!
# Limitations
- It makes a strong assumption that the attributes are not related to one another, which seems ridiculous.
- It is very venerable to co-dependent variables as its influence will grow exponentially the more variables are related. (i.e. if 3 variables were the same we would cube that term.)
- Though in practice it turns out not to matter that much!
- If any of the $\mathbb{P}[A_i = a_i \vert Y = b]$ is zero, it zeros out the entire product.
- We should account for this by always making these values non-zero. This is called “smoothing” of the probabilities.
- We need a very large amount of data to guarantee our probabilities are accurate.