Week 1 - Chapter 3, Finite Markov Decision Process
Compulsory reading from Reinforcement learning: An introduction.
Finite Markov Decision Process
Finite Markov Decision Process
Using the function $p$ we can calculate some other probabilities.
State transition probability: When in state $s_t$ and you take the action $a_t$ the probability of ending up in $s_{t+1}$:
$$ t(s_t,a_t,s_{t+1}) = \sum_{r \in R} p(s_{t+1}, r \vert s_t, a_t) $$Expected state-action reward: Given you are in state $s_t$ and take action $a_t$ what is your expected reward:
$$ r(s_t,a_t) = \mathbb{E}[R_t \vert s_t, a_t] = \sum_{r \in R} r \sum_{s_{t+1}}p(s_{t+1}, r \vert s_t, a_t) $$Discounted returns
If we run a game with infinite steps which has positive expected reward for each step, then the value of the game is infinite. Which can make it hard to distinguish better options - as they both are over the long term worth infinite value. Therefore we instead consider discounted rewards.
Value and policy functions
The goal of reinforcement learning is to determine the optimum policy that maximizes some reward function, such as discounted rewards. Though if the reward is too far in the future it can be hard to determine what to do earlier on. Therefore we determine the ‘value’ of a given state - this is a statement about its long term worth to the actor.
The best value function accurate evaluates the states long term worth to the actor. When using discounted rewards this can be given by the Bellman equation.
It is important to note that for lots of problems computing this optimum is infeasible. For example in games like chess or go. So reinforcement learning focuses on techniques to provide computationally efficient ways to approximate this optimal.