Compulsory reading from Reinforcement learning: An introduction .

# Finite Markov Decision Process

Using the function $p$ (defined in the finite Markov decision process), we can calculate some other probabilities.

State transition probability: When in state $s_t$ and you take the action $a_t$ the probability of ending up in $s_{t+1}$:

$$ T(s_t,a_t,s_{t+1}) = \sum_{r \in R} p(s_{t+1}, r \vert s_t, a_t) $$

Expected state-action reward: Given you are in state $s_t$ and take action $a_t$ what is your expected reward:

$$ r(s_t,a_t) = \mathbb{E}[R_t \vert s_t, a_t] = \sum_{r \in R} r \sum_{s_{t+1}}p(s_{t+1}, r \vert s_t, a_t) $$

# Discounted returns

If we run a game with infinite steps with positive expected reward for each step, then the value of the game is infinite. This can make it hard to distinguish better options, as they are both worth infinite value over the long term. Therefore we instead consider discounted rewards .

Discounted rewards

# Value and policy functions

policy

The goal of reinforcement learning is to determine the optimum policy that maximises some reward function, such as discounted rewards . However, if the reward is too far in the future, it can be hard to determine what to do earlier on. Therefore we determine the ‘value’ of a given state, which is a statement about its long-term worth to the actor.

Value function (RL)

The best value function accurately evaluates the state’s long-term worth to the actor. When using discounted rewards , this can be given by the Bellman equation .

Bellman equation

It is important to note that for many problems, computing this optimum is infeasible. For example, in games like chess or Go. Therefore, reinforcement learning focuses on techniques to provide computationally efficient ways to approximate this optimum.

# Week 1 - Chapter 3, Finite Markov Decision Process

# Finite Markov Decision Process

# Discounted returns

# Value and policy functions