Q-learning, SARSA, and Radioactive Breadcrumbs

About This Presentation

Title:

Q-learning, SARSA, and Radioactive Breadcrumbs

Description:

Same is true of many other RL algs. But we can do better (sometimes by orders of magnitude) ... problem: Every step, Q only does a one-step backup. Forgot where ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 20

Provided by: csU94

Learn more at: https://www.cs.unm.edu

more less

Transcript and Presenter's Notes

Title: Q-learning, SARSA, and Radioactive Breadcrumbs

1
Q-learning, SARSA, and Radioactive Breadcrumbs

SB Ch.6 and 7

2
Administrivia

Office hours truncated (900-1015) on Nov 17
Someone scheduled a meeting for me -P
HW3 assigned today
Due Dec 2
Large HW, but you have a little extra time on it

3
The Q-learning algorithm

Algorithm Q_learn
Inputs State space S Act. space A
Discount ? (0lt?lt1) Learning rate a (0ltalt1)
Outputs Q
Repeat
sget_current_world_state()
apick_next_action(Q,s)
(r,s)act_in_world(a)
Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
Until (bored)

4
(No Transcript)
5
Why does this work?

Still... Why should that weighted avg be the
right thing?
Compare w/ Bellman eqn...
I.e., the update is based on a sample of the true
distribution, T, rather than the full expectation
that is used in the Bellman eqn/policy iteration
alg
First time agent finds a rewarding state, sr, a
of that reward will be propagated back by one
step via Q update to sr-1, a state one step away
from sr
Next time, the state two away from sr will be
updated, and so on...

6
Picking the action

One critical step underspecified in Q learn alg
apick_next_action(Q,s)
How should you pick an action at each step?
Could pick greedily according to Q
Might tend to keep doing the same thing and not
explore at all. Need to force exploration.
Could pick an action at random
Ignores everything youve learned about Q so far
Would you still converge?

7
Off-policy learning

Exploit a critical property of the Q learn alg
Lemma (w/o proof) The Q learning algorithm will
converge to the correct Q independently of the
policy being executed, so long as
Every (s,a) pair is visited infinitely often in
the infinite limit
a is chosen to be small enough (usually decayed)
I.e., Q learning doesnt care what policy is
being executed -- will still converge
Called an off-policy method the policy being
learned can be diff than the policy being executed

8
Almost greedy exploring

Off-policy property tells us were free to pick
any policy we like to explore, so long as we
guarantee infinite visits to each (s,a) pair
Might as well choose one that does (mostly) as
well as we know how to do at each step
Cant be just greedy w.r.t. Q (why?)
Typical answers
e-greedy execute argmaxaQ(s,a) w/ prob (1-e)
and a random action w/ prob e
Boltzmann exploration pick action a w/ prob

9
The value of experience

We observed that Q learning converges
slooooooowly...
Same is true of many other RL algs
But we can do better (sometimes by orders of
magnitude)
Whatre the biggest hurdles to Q convergence?

10
The value of experience

We observed that Q learning converges
slooooooowly...
Same is true of many other RL algs
But we can do better (sometimes by orders of
magnitude)
Whatre the biggest hurdles to Q convergence?
Well, there are many
Big one, though, is poor use of experience
Each timestep only changes one Q(s,a) value
Takes many steps to back up experience very far

11
That eligible state

Basic problem Every step, Q only does a one-step
backup
Forgot where it was before that
No sense of the sequence of state/actions that
got it where it is now
Want to have a long-term memory of where the
agent has been update the Q values for all of
them
Idea called eligibility traces
Have a memory cell for each state/action pair
Set memory when visit that state/action
Each step, update all eligible states

12
Retrenching from Q

Can integrate eligibility traces w/ Q-learning
But its a bit of a pain
Need to track when agent is on policy or off
policy, etc.
Good discussion in Sutton Barto
Well focus on a (slightly) simpler learning alg
SARSA learning
V. similar to Q learning
Strictly on policy only learns about policy its
actually executing
E.g., learns instead of

13
The Q-learning algorithm

Algorithm Q_learn
Inputs State space S Act. space A
Discount ? (0lt?lt1) Learning rate a (0ltalt1)
Outputs Q
Repeat
sget_current_world_state()
apick_next_action(Q,s)
(r,s)act_in_world(a)
Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
Until (bored)

14
SARSA-learning algorithm

Algorithm SARSA_learn
Inputs State space S Act. space A
Discount ? (0lt?lt1) Learning rate a (0ltalt1)
Outputs Q
sget_current_world_state()
apick_next_action(Q,s)
Repeat
(r,s)act_in_world(a)
apick_next_action(Q,s)
Q(s,a)Q(s,a)a(r?Q(s,a)-Q(s,a))
aa ss
Until (bored)

15
SARSA vs. Q

SARSA and Q-learning very similar
SARSA updates Q(s,a) for the policy its actually
executing
Lets the pick_next_action() function pick action
to update
Q updates Q(s,a) for greedy policy w.r.t. current
Q
Uses max_a to pick action to update
might be diff than the action it executes at s
In practice Q will learn the true p, but
SARSA will learn about what its actually doing
Exploration can get Q-learning in trouble...

16
Getting Q in trouble...
Cliff walking example (Sutton Barto, Sec 6.5)
17
Getting Q in trouble...
Cliff walking example (Sutton Barto, Sec 6.5)
18
Radioactive breadcrumbs