Reinforcement Learning (II.) - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Reinforcement Learning (II.)

Description:

Reinforcement Learning (II.) Ata Kaban A.Kaban_at_cs.bham.ac.uk School of Computer Science University of Birmingham Recall Policy: what to do Reward: what is good Value ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 20
Provided by: axk
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning (II.)


1
Reinforcement Learning (II.)
  • Ata Kaban
  • A.Kaban_at_cs.bham.ac.uk
  • School of Computer Science
  • University of Birmingham

2
Recall
  • Policy what to do
  • Reward what is good
  • Value what is good because it predicts reward
  • Model what follows what
  • Reinforcement Learning learning what to do from
    interactions with the environment

3
Recall
  • Markov Decision Process
  • rt and st1 depend only on the current (st) state
    and action (at)
  • Goal get as much eventual reward as possible no
    matter from which state you start off.

4
Todays lecture
  • We recall the formalism for the deterministic case
  • We reformulate the formalism for
    non-deterministic case
  • We learn about
  • Bellman equations
  • Optimal policy
  • Policy iteration
  • Q-learning in nondeterministic environment

5
What to learn
  • Learn an action policy
  • so that it maximises the expected eventual reward
    from each state
  • Learn this from interaction examples, i.e. data
    of the form ((s,a),r)
  • Learn an action policy
  • so that it maximises the eventual reward from
    each state
  • Learn this from interaction examples, i.e. data
    of the form ((s,a),r)

6
  • Notations used
  • We assume we are at a time t, so
  • Recall summarize other notations as well
  • Policy ?.
  • Remember in the deterministic case, ?(s) is an
    action
  • In the non-deterministic case, ?(s) is a random
    variable, ie we can only talk about ?(as), which
    is the probability of doing action a in state s
  • State values in a policy V?(s)
  • Values of state-action pairs (i.e. Q-values)
    Q(s,a)
  • State transitions the next state depends on the
    current state and current action.
  • the state that deterministically follows s if
    action a is taken ?(s,a)
  • Now the state transitions may also be
    probabilistic then, the probability that s
    follows s if action a is taken is p(ss,a)

7
State value function
  • How much reward can I expect to accumulate from
    state s if I follow policy p
  • This is called Bellman equation
  • It is a linear system which has a unique solution!
  • How much reward can I accumulate from state s if
    I follow policy p

8
Example
  • Compute V for p
  • Compute Va for random pol pa

V a(s6) 0.5(1000.90)
0.5(00.90) 50 V a(s5) 0.33(0 0.950)
0.66(0 0.90) 30 V a(s4)
0.5(0 0.930) 0.5(00.90)
13.5 Etc If computed for all states, then start
again and keep iterating till the values converge

V(s6) 100 0.90 100 V(s5) 0 0.9100
90 V(s4) 0 0.990 81
Btw, has p(ss,a) disappeared?
9
What is good about it?
  • For any policy, Bellman eq has unique solution
    ?it has unique solution for an optimal policy as
    well. Denote this by V.
  • The optimal policy is what we want to learn.
    Denote it by ?
  • If we could learn V, then with one look-ahead we
    could compute ?
  • How to learn V? It depends on ?, which is
    unknown as well
  • Iterate and improve on both in each iteration
    until converging to V and an associated ?. This
    is called (generalised) policy iteration.

10
Generalised Policy Iteration
Geometric illustration
11
Before going on what does it mean optimal
policy more exactly?
  • p is said to be better then another policy p,
  • In an MDP there always exists at least one policy
    which is better than all others. This is called
    optimal policy.
  • Any policy which is greedy with respect to V is
    an optimal policy.

12
What is bad about it?
  • We need to know the model of the environment for
    doing policy iteration.
  • i.e. we need to know the state transitions (what
    follows what)
  • We need to be able to look one step ahead
  • i.e. to try out all actions in order to choose
    the best one
  • In some applications this is not feasible
  • Is some others it is
  • Can you think of any examples?
  • a fierce battle?
  • an Internet crawler?

13
Looking ahead
Backup diagram
14
The other routeAction value function
  • How much eventual reward can I expect to get if
    making action a from state s.
  • This is also a Bellman equation
  • It has a unique solution!
  • How much eventual reward can I get if making
    action a from state s.

15
What is good about it?
  • If we know Q, look-ahead is not needed for
    following an optimal policy!
  • i.e. if we know all action values then just do
    the best action from each state.
  • Different implementations exist in the literature
    to improve efficiency. We will stick with turning
    the Bellman equation for action-value functions
    into an iterative update.

16
Simple example of updating Q
s3
Grid world, the rewards are given on the arrows
s1
s2
Simple, i.e. observe this is a deterministic
world.
s6
s4
s5
Here, the Q values from a previous iteration are
given on the arrows
Recall
17
(No Transcript)
18
  • First iteration
  • Q(L1,A)00.9(0.7500.30)
  • Q(L1,S)00.9(0.5(-50)0.50)
  • Q(L1,M)
  • What is your optimal action plan?

19
Key points
  • Learning by reinforcement
  • Markov Decision Processes
  • Value functions
  • Optimal policy
  • Bellman equations
  • Methods and implementations for computing value
    functions
  • Policy iteration
  • Q-learning
Write a Comment
User Comments (0)
About PowerShow.com