# Review

The first part of this course is a review of:

There is a change of notation within this course compared to the referenced lecture. Instead of using $U(s)$ for the utility, we use $V(s)$ for value. Instead of considering the reward function $R: S \rightarrow \mathbb{R}$, we consider it $R: S \times A \rightarrow \mathbb{R}$, i.e. the reward takes into account the action you have taken. Therefore, the Bellman equation in this notation is restated below.

$$ V(s) = \max_{a \in A_s} \left ( R(s,a) + \gamma \sum_{s' \in S} T(s,a,s') V(s') \right ) $$

Reminder of the below notation.

Discount factor: $\gamma$ with $0 \leq \gamma < 1$.
Transition probability: Given you are in state $s$ and you take action $a$ $T(s,a,s')$ is the probability you end up in state $s'$.
States: $S$ is the set of all states.
Actions: $A$ is the set of all actions, it could depend on the state therefore we talk about $A_s$ for the actions at state $s \in S$.

# Quality

Within the Bellman equation , if we take what is within the brackets and set it to a new function $Q(s,a)$ (the quality of taking action $a$ in state $s$), we derive the following set of equations.

$$ \begin{aligned} V(s) & = \max_{a \in A_s} Q(s,a)\\ Q(s,a) & = R(s,a) + \gamma \sum_{s' \in S} T(s,a,s') V(s')\\ & = R(s,a) + \gamma \sum_{s' \in S} T(s,a,s') \max_{a' \in A_{s'}} Q(s',a') \end{aligned} $$

The motivation for doing this will come later; however, intuitively this form will be more useful when you do not have access to $T(s,a,s')$ and $R(s,a)$ directly. Instead, you can only sample experience data.

# Continuations

We can apply a similar trick to derive a third form of the Bellman equation ; this time we just set $C(s,a)$ to be the summation within the definition of $Q(s,a)$.

$$ \begin{aligned} Q(s,a) & = R(s,a) + C(s,a)\\ C(s,a) & = \gamma \sum_{s' \in S} T(s,a,s') \max_{a' \in A_{s'}} Q(s',a')\\ & = \gamma \sum_{s' \in S} T(s,a,s') \max_{a' \in A_{s'}} R(s',a') + C(s',a') \end{aligned} $$

Each of these will enable us to do reinforcement learning in different circumstances - but notice how they relate to one another.

	$V(s)$	$Q(s, a)$	$C(s,a)$
$V(s)$	$V(s) = V(s)$	$V(s) = \max_{a \in A_s} Q(s,a)$	$V(s) = \max_{a \in A_s} \left(R(s,a) + C(s,a) \right )$
$Q(s,a)$	$Q(s,a) = R(s,a) + \gamma \sum_{s' \in S} T(s,a,s') V(s')$	$Q(s,a) = Q(s,a)$	$Q(s,a) = R(s,a) + C(s,a)$
$C(s,a)$	$C(s,a) = \gamma \sum_{s' \in S} T(s,a,s')V(s')$	$C(s,a) = \gamma \sum_{s' \in S} T(s,a,s') \max_{a' \in A_{s'}} Q(s', a')$	$C(s,a) = C(s,a)$

If we find $V(s)$, we need to know both the transition probabilities and the reward to derive either $Q$ or $C$; however, $Q$ and $C$ have a nice property: from $Q$ we only need to know the transition probabilities to find $C$ and $V$, whereas from $C$ we only need to know the reward to determine $V$ and $Q$.

# Week 1 - Smoov & Curly's Bogus Journey

# Review

# Quality

# Continuations