Q-learning, SARSA, and Radioactive Breadcrumbs

1 / 19
About This Presentation
Title:

Q-learning, SARSA, and Radioactive Breadcrumbs

Description:

Same is true of many other RL algs. But we can do better (sometimes by orders of magnitude) ... problem: Every step, Q only does a one-step backup. Forgot where ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 20
Provided by: csU94
Learn more at: https://www.cs.unm.edu

less

Transcript and Presenter's Notes

Title: Q-learning, SARSA, and Radioactive Breadcrumbs


1
Q-learning, SARSA, and Radioactive Breadcrumbs
  • SB Ch.6 and 7

2
Administrivia
  • Office hours truncated (900-1015) on Nov 17
  • Someone scheduled a meeting for me -P
  • HW3 assigned today
  • Due Dec 2
  • Large HW, but you have a little extra time on it

3
The Q-learning algorithm
  • Algorithm Q_learn
  • Inputs State space S Act. space A
  • Discount ? (0lt?lt1) Learning rate a (0ltalt1)
  • Outputs Q
  • Repeat
  • sget_current_world_state()
  • apick_next_action(Q,s)
  • (r,s)act_in_world(a)
  • Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
  • Until (bored)

4
(No Transcript)
5
Why does this work?
  • Still... Why should that weighted avg be the
    right thing?
  • Compare w/ Bellman eqn...
  • I.e., the update is based on a sample of the true
    distribution, T, rather than the full expectation
    that is used in the Bellman eqn/policy iteration
    alg
  • First time agent finds a rewarding state, sr, a
    of that reward will be propagated back by one
    step via Q update to sr-1, a state one step away
    from sr
  • Next time, the state two away from sr will be
    updated, and so on...

6
Picking the action
  • One critical step underspecified in Q learn alg
  • apick_next_action(Q,s)
  • How should you pick an action at each step?
  • Could pick greedily according to Q
  • Might tend to keep doing the same thing and not
    explore at all. Need to force exploration.
  • Could pick an action at random
  • Ignores everything youve learned about Q so far
  • Would you still converge?

7
Off-policy learning
  • Exploit a critical property of the Q learn alg
  • Lemma (w/o proof) The Q learning algorithm will
    converge to the correct Q independently of the
    policy being executed, so long as
  • Every (s,a) pair is visited infinitely often in
    the infinite limit
  • a is chosen to be small enough (usually decayed)
  • I.e., Q learning doesnt care what policy is
    being executed -- will still converge
  • Called an off-policy method the policy being
    learned can be diff than the policy being executed

8
Almost greedy exploring
  • Off-policy property tells us were free to pick
    any policy we like to explore, so long as we
    guarantee infinite visits to each (s,a) pair
  • Might as well choose one that does (mostly) as
    well as we know how to do at each step
  • Cant be just greedy w.r.t. Q (why?)
  • Typical answers
  • e-greedy execute argmaxaQ(s,a) w/ prob (1-e)
    and a random action w/ prob e
  • Boltzmann exploration pick action a w/ prob

9
The value of experience
  • We observed that Q learning converges
    slooooooowly...
  • Same is true of many other RL algs
  • But we can do better (sometimes by orders of
    magnitude)
  • Whatre the biggest hurdles to Q convergence?

10
The value of experience
  • We observed that Q learning converges
    slooooooowly...
  • Same is true of many other RL algs
  • But we can do better (sometimes by orders of
    magnitude)
  • Whatre the biggest hurdles to Q convergence?
  • Well, there are many
  • Big one, though, is poor use of experience
  • Each timestep only changes one Q(s,a) value
  • Takes many steps to back up experience very far

11
That eligible state
  • Basic problem Every step, Q only does a one-step
    backup
  • Forgot where it was before that
  • No sense of the sequence of state/actions that
    got it where it is now
  • Want to have a long-term memory of where the
    agent has been update the Q values for all of
    them
  • Idea called eligibility traces
  • Have a memory cell for each state/action pair
  • Set memory when visit that state/action
  • Each step, update all eligible states

12
Retrenching from Q
  • Can integrate eligibility traces w/ Q-learning
  • But its a bit of a pain
  • Need to track when agent is on policy or off
    policy, etc.
  • Good discussion in Sutton Barto
  • Well focus on a (slightly) simpler learning alg
  • SARSA learning
  • V. similar to Q learning
  • Strictly on policy only learns about policy its
    actually executing
  • E.g., learns instead of

13
The Q-learning algorithm
  • Algorithm Q_learn
  • Inputs State space S Act. space A
  • Discount ? (0lt?lt1) Learning rate a (0ltalt1)
  • Outputs Q
  • Repeat
  • sget_current_world_state()
  • apick_next_action(Q,s)
  • (r,s)act_in_world(a)
  • Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
  • Until (bored)

14
SARSA-learning algorithm
  • Algorithm SARSA_learn
  • Inputs State space S Act. space A
  • Discount ? (0lt?lt1) Learning rate a (0ltalt1)
  • Outputs Q
  • sget_current_world_state()
  • apick_next_action(Q,s)
  • Repeat
  • (r,s)act_in_world(a)
  • apick_next_action(Q,s)
  • Q(s,a)Q(s,a)a(r?Q(s,a)-Q(s,a))
  • aa ss
  • Until (bored)

15
SARSA vs. Q
  • SARSA and Q-learning very similar
  • SARSA updates Q(s,a) for the policy its actually
    executing
  • Lets the pick_next_action() function pick action
    to update
  • Q updates Q(s,a) for greedy policy w.r.t. current
    Q
  • Uses max_a to pick action to update
  • might be diff than the action it executes at s
  • In practice Q will learn the true p, but
    SARSA will learn about what its actually doing
  • Exploration can get Q-learning in trouble...

16
Getting Q in trouble...
Cliff walking example (Sutton Barto, Sec 6.5)
17
Getting Q in trouble...
Cliff walking example (Sutton Barto, Sec 6.5)
18
Radioactive breadcrumbs
  • Can now define eligibility traces for SARSA
  • In addition to Q(s,a) table, keep an e(s,a) table
  • Records eligibility (real number) for each
    state/action pair
  • At every step ((s,a,r,s,a) tuple)
  • Increment e(s,a) for current (s,a) pair by 1
  • Update all Q(s,a) vals in proportion to their
    e(s,a)
  • Decay all e(s,a) by factor of ??
  • Leslie Kaelbling calls this the radioactive
    breadcrumbs form of RL

19
SARSA(?)-learning alg.
  • Algorithm SARSA(?)_learn
  • Inputs S, A, ? (0lt?lt1), a (0ltalt1), ? (0lt?lt1)
  • Outputs Q
  • e(s,a)0 // for all s, a
  • sget_curr_world_st() apick_nxt_act(Q,s)
  • Repeat
  • (r,s)act_in_world(a)
  • apick_next_action(Q,s)
  • dr?Q(s,a)-Q(s,a)
  • e(s,a)1
  • foreach (s,a) pair in (SXA)
  • Q(s,a)Q(s,a)ae(s,a)d
  • e(s,a)??
  • aa ss
  • Until (bored)
Write a Comment
User Comments (0)