Week 2 - Neural networks

OMSCS

The basis of all neural networks is a simplified model of a neuron in the human body.

Perceptron (neural network)

Where we define an activation function as.

Activation function

First we will just take the function to be Binary step which introduces an extra parameter $\theta$.

Binary step

Binary operations with perceptron’s

Logical and ($\land$)

Let $n = 2$ and have $x_1, x_2 \in \{0,1\}$ then if we set $w_1 = w_2 = 0.5$ and $\theta = 1$ then $p$ becomes

$$p(x_1, x_2) = \begin{cases} 1 & \mbox{if } x_1 = x_2 = 1\\ 0 & \mbox{otherwise.}\end{cases}$$

Logical or ($\lor$)

Let $n = 2$ and have $x_1, x_2 \in \{0,1\}$ then if we set $w_1 = w_2 = 1$ and $\theta = 1$ then $p$ becomes

$$p(x_1, x_2) = \begin{cases} 1 & \mbox{if } x_1 + x_2 \geq 1\\ 0 & \mbox{otherwise.}\end{cases}$$

Not

Let $n = 1$ and have $x \in \{0,1\}$ then if we set $w = -1$ and $\theta = 0$ then $p$ becomes

$$p(x) = \begin{cases} 1 & \mbox{if } x = 0\\ 0 & \mbox{otherwise.}\end{cases}$$

Xor

Define two perceptron’s $p_{\land} : \mathbb{R}^2 \rightarrow \{0,1\}$ (from before) and $p : \mathbb{R}^3 \rightarrow \{0,1\}$ with $w_1 = w_2 = 1$, $w_3 = -2$ and $\theta = 0.5$ then

$$p(x) = \begin{cases} 1 & \mbox{if } x \in \{(1,0,0), (0,1,0), (1,1,0)\}\\ 0 & \mbox{otherwise.}\end{cases}$$

Now consider $x_1, x_2 \in \{0,1\}$ and the conjunction

$$p(x_1, x_2, p_{\land}(x_1,x_2)) = \begin{cases} 1 & \mbox{if } (x_1, x_2) \in \{(1,0), (0,1)\} \\ 0 & \mbox{otherwise.} \end{cases}$$

Perceptron rule

Perceptron rule

Gradient decent

With the perceptron rule we applied an activation function that made $p$ non-differentiable. Lets get rid of this for now and see if we can use differentiation to get another training method.

Let our training data $(x,y) \in T$ have $x = (x_1, x_2, \ldots, x_n) \in \mathbb{R}^n$ and $w = (w_1, w_2, \ldots w_n)$ be our weights for a perceptron as before. The define the following error function

$$ E(w) = \frac{1}{2} \sum_{(x,y) \in T} (y - a)^2, \ \ \ \mbox{where} \ \ \ a = \sum_{i=1}^n x_iw_i. $$

Note $E$ is a function of the weights in this example.

Then we can find a local derivative to decide on the best direction to reduce the weights.

$$ \begin{align*} \frac{\partial E}{\partial w_i} & = \frac{1}{2} \sum_{(x,y) \in T} -2 (y-a) \frac{\partial a}{\partial w_i}\\ & = - \sum_{(x,y) \in T} (y-a) x_i \end{align*} $$

This is very similar to the perceptron rule. Now to completely the training using this we would travel in the opposite of the direction of found by this derivative. This method is called Gradient decent.

Gradient decent

Comparison

Whilst the perceptron rule converges in finite time for linear separable datasets it is unstable on datasets that are not linearly separable. The advantage of gradient decent is that it is stable on all datasets but it has the issue of converging only to local minimum.

Sigmoid function

Sigmoid function

Where this function looks similar to the binary step function with one big advantage, that it is differentiable.

$$\frac{d \sigma}{d a} = \sigma(a) (1 - \sigma(a))$$

Which shows the function is relative stable when $\sigma(a)$ is close to either $0$ or $1$ but will have the largest change when it is close to $1/2$.

This allow us to use gradient decent on a perceptron that is using the Sigmoid function as its activation function.

Neural network

Neural network

If we use the Sigmoid function for each of the perceptrons activation function or some other differentiable activation function - the function this neural network represents is also differentiable. In this case we can run gradient decent which in this case is called Backward propagation of errors (Back propagation).

Backward propagation of errors (Back propagation)

This will still have the issue with local optimum vs global optimum. Therefore there are some more advanced techniques people use:

  • Momentum: Making the learning term very large to skip over local minima,
  • Higher order derivatives,
  • Random optimization, or
  • penalty for complexity.

Complexity in a neural network is defined by:

  • more nodes,
  • more layers, and
  • larger numbers as the parameters.

Bias for neural networks

Restriction bias

This is the types of functions neural networks can represent.

Whilst this appears like a multi perceptron with Sigmoid activation function with two hidden layers has no restriction bias - once you have picked your hyper parameters, i.e. the number of layers and nodes, then you have restricted the number of functions you can represent.

Danger of overfitting

It is important to use ideas like cross validation to avoid overfitting. For neural networks it is important to avoid large values also.

Preference bias

Starting weights

Normally when you start back propagation the weights start with small random values. Random values helps us avoid the same local minima each run. Small values has low complexity.

  • Training neural networks normally have a preference for low complexity networks over high complexity ones.
  • We prefer correct answers to incorrect ones.