Week 11 - Markov Decision Processes
Suppose we have a world with 12 states defined by a 2d grid $S = \{(x,y) \vert 1 \leq x \leq 4, 1 \leq y \leq 3\}$. Where state $(2,2)$ is can’t be entered, $(4,3)$ is home, $(4,2)$ is a pit of death, and $(1,1)$ is where we start.
Here we define actions $A = \{$up, down, left, right$\}$ where we can’t take an action that would lead us off the board or into $(2,2)$. This gives us our function for $A(s)$. This world is probabilistic so for each action we have a $0.8$ chance of doing what you would expect the action to do, but a $0.1$ chance of taking an action perpendicular to it (if you can’t take that action then you stay where you are). This defines our transition probabilities, $T$. Lastly if we make it home we get a point if we die in the pit of death we lose a point. All this data defines a Markov decision process, in fact a special kind of process.
Given a Markov decision process we want to come up with a policy $\pi: S \rightarrow A$ that determines what action we take in each state.
Assigning credit
In problems like the above we only get the reward at the end. Therefore at the end of the game we need to know which moves were good, and which were bad. This is hard to do without better feedback.
Cumulative rewards
Suppose we now get a reward for each state and the score of the run was the cumulative score of all the states we have visited.
For example in the example above instead of just providing a reward for the pit of death and getting home instead we provided a reward for all the other squares it might influence what we want to do. If it were strongly positive we might want to stay out for longer. If it were strongly negative we might chance walking past the pit of death to get home sooner or even jump in the pit of death!
Trouble with cumulative rewards and infinite time
Suppose our actions will lead to a sequence of rewards $r_i$ for $i \in \mathbb{N}$. If there exists some $c > 0$ such that for all $r_i > c$ then if we sum the rewards
$$ \sum_{i=1}^{\infty} r_i = \infty $$Instead suppose there was another sequence of actions that lead to rewards $s_i$ with $s_i > r_i$ for all $i \in \mathbb{N}$. Then taking this second set of actions we have
$$ \sum_{i=1}^{\infty} s_i = \infty. $$So we will have no preference over these two sets of actions even though it seems like the second set are strictly better. This is an issue so we want to factor this out.
Whilst we might worry that convergence of the above takes infinite time - in reality we only care about when the policy based on these utilities stops changing. This might take less time.