Week 12 - Reinforcement learning

OMSCS

Transitions (MDP)

Reinforcement learning

There are 4 kinds of process we think about:

We can compose these strategies in two major ways.

The first uses a simulator and learner to generate at RL planner. This method was the first major successful RL algorithm creating a competitive backgammon solver.

Model-based reinforcement learning

Approaches to Reinforcement learning

There are 3 different approaches to RL,

  • Policy search: Here you iterate through different polices to get the closest fit to the data.
  • Value-function based: Here you use your example transitions to generate a utility function which maps state to value. Which like in Policy Iteration (MDP) you can use make policy.
  • Model-based reinforcement learning: You use the transitions to generate a model, which consists of a transition function and reward function.

Value-function based

When using Value iteration (MDP) we look for the optimum utility function

$$ U^{\ast}(s) = R(s) + \gamma \max_{a \in A(s)} \sum_{s' \in S} T(s,a,s') U^{\ast}(s'). $$

Whereas for Policy Iteration (MDP) we look for the optimum policy given a utility function

$$\pi^{\ast} = \mbox{arg}\max_{a \in A_s} \sum_{s' \in S} T(s,a,s')U^{\ast}(s').$$

We would like a way to incorporate both approaches.

Q-function (RL)

The Q-function (RL) has nice properties as

$$ U^{\ast}(s) = \max_{a \in A_s} Q(s,a) $$

from the definition of $Q$. But also we can write

$$ \pi^{\ast}(s) = \mbox{arg}\max_{a \in A_s} Q(s,a) $$

as both $\gamma$ and $R(s)$ don’t effect the $\mbox{arg}\max$.

Incremental learning

Incremental learning

If we have a sequence where

$$\sum_{t \in \mathbb{N}} \alpha_t = \infty, \mbox{ and } \sum_{t \in \mathbb{N}} \alpha_t^2 < \infty $$

with $X$ is a random variable and $x_t$ are all i.i.d. then $V = \mathbb{E}[X]$.

Q-learning

Choosing an action

The two main ideas for choosing actions are:

  • choose randomly, and
  • choose the best action.

The trouble with the first is you never use what you are learning - so it will be slow to explore the space you care about. The trouble with the second is you might not explore routes that are better than your current guess.

Explore exploit dilemma

You want to pick a balance between the two of them.

Epsilon-greedy exploration

In the case where you really do explore infinitely if $\epsilon_t \rightarrow 0$ then $\hat{Q} \rightarrow Q$ and $\hat{\pi} \rightarrow \pi^{\ast}$.

Optimistic exploration