Title: Q-learning, SARSA, and Radioactive Breadcrumbs
1Q-learning, SARSA, and Radioactive Breadcrumbs
2Administrivia
- Office hours truncated (900-1015) on Nov 17
- Someone scheduled a meeting for me -P
- HW3 assigned today
- Due Dec 2
- Large HW, but you have a little extra time on it
3The Q-learning algorithm
- Algorithm Q_learn
- Inputs State space S Act. space A
- Discount ? (0lt?lt1) Learning rate a (0ltalt1)
- Outputs Q
- Repeat
- sget_current_world_state()
- apick_next_action(Q,s)
- (r,s)act_in_world(a)
- Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
- Until (bored)
4(No Transcript)
5Why does this work?
- Still... Why should that weighted avg be the
right thing? - Compare w/ Bellman eqn...
- I.e., the update is based on a sample of the true
distribution, T, rather than the full expectation
that is used in the Bellman eqn/policy iteration
alg - First time agent finds a rewarding state, sr, a
of that reward will be propagated back by one
step via Q update to sr-1, a state one step away
from sr - Next time, the state two away from sr will be
updated, and so on...
6Picking the action
- One critical step underspecified in Q learn alg
- apick_next_action(Q,s)
- How should you pick an action at each step?
- Could pick greedily according to Q
- Might tend to keep doing the same thing and not
explore at all. Need to force exploration. - Could pick an action at random
- Ignores everything youve learned about Q so far
- Would you still converge?
7Off-policy learning
- Exploit a critical property of the Q learn alg
- Lemma (w/o proof) The Q learning algorithm will
converge to the correct Q independently of the
policy being executed, so long as - Every (s,a) pair is visited infinitely often in
the infinite limit - a is chosen to be small enough (usually decayed)
- I.e., Q learning doesnt care what policy is
being executed -- will still converge - Called an off-policy method the policy being
learned can be diff than the policy being executed
8Almost greedy exploring
- Off-policy property tells us were free to pick
any policy we like to explore, so long as we
guarantee infinite visits to each (s,a) pair - Might as well choose one that does (mostly) as
well as we know how to do at each step - Cant be just greedy w.r.t. Q (why?)
- Typical answers
- e-greedy execute argmaxaQ(s,a) w/ prob (1-e)
and a random action w/ prob e - Boltzmann exploration pick action a w/ prob
9The value of experience
- We observed that Q learning converges
slooooooowly... - Same is true of many other RL algs
- But we can do better (sometimes by orders of
magnitude) - Whatre the biggest hurdles to Q convergence?
10The value of experience
- We observed that Q learning converges
slooooooowly... - Same is true of many other RL algs
- But we can do better (sometimes by orders of
magnitude) - Whatre the biggest hurdles to Q convergence?
- Well, there are many
- Big one, though, is poor use of experience
- Each timestep only changes one Q(s,a) value
- Takes many steps to back up experience very far
11That eligible state
- Basic problem Every step, Q only does a one-step
backup - Forgot where it was before that
- No sense of the sequence of state/actions that
got it where it is now - Want to have a long-term memory of where the
agent has been update the Q values for all of
them - Idea called eligibility traces
- Have a memory cell for each state/action pair
- Set memory when visit that state/action
- Each step, update all eligible states
12Retrenching from Q
- Can integrate eligibility traces w/ Q-learning
- But its a bit of a pain
- Need to track when agent is on policy or off
policy, etc. - Good discussion in Sutton Barto
- Well focus on a (slightly) simpler learning alg
- SARSA learning
- V. similar to Q learning
- Strictly on policy only learns about policy its
actually executing - E.g., learns instead of
13The Q-learning algorithm
- Algorithm Q_learn
- Inputs State space S Act. space A
- Discount ? (0lt?lt1) Learning rate a (0ltalt1)
- Outputs Q
- Repeat
- sget_current_world_state()
- apick_next_action(Q,s)
- (r,s)act_in_world(a)
- Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
- Until (bored)
14SARSA-learning algorithm
- Algorithm SARSA_learn
- Inputs State space S Act. space A
- Discount ? (0lt?lt1) Learning rate a (0ltalt1)
- Outputs Q
- sget_current_world_state()
- apick_next_action(Q,s)
- Repeat
- (r,s)act_in_world(a)
- apick_next_action(Q,s)
- Q(s,a)Q(s,a)a(r?Q(s,a)-Q(s,a))
- aa ss
- Until (bored)
15SARSA vs. Q
- SARSA and Q-learning very similar
- SARSA updates Q(s,a) for the policy its actually
executing - Lets the pick_next_action() function pick action
to update - Q updates Q(s,a) for greedy policy w.r.t. current
Q - Uses max_a to pick action to update
- might be diff than the action it executes at s
- In practice Q will learn the true p, but
SARSA will learn about what its actually doing - Exploration can get Q-learning in trouble...
16Getting Q in trouble...
Cliff walking example (Sutton Barto, Sec 6.5)
17Getting Q in trouble...
Cliff walking example (Sutton Barto, Sec 6.5)
18Radioactive breadcrumbs
- Can now define eligibility traces for SARSA
- In addition to Q(s,a) table, keep an e(s,a) table
- Records eligibility (real number) for each
state/action pair - At every step ((s,a,r,s,a) tuple)
- Increment e(s,a) for current (s,a) pair by 1
- Update all Q(s,a) vals in proportion to their
e(s,a) - Decay all e(s,a) by factor of ??
- Leslie Kaelbling calls this the radioactive
breadcrumbs form of RL
19SARSA(?)-learning alg.
- Algorithm SARSA(?)_learn
- Inputs S, A, ? (0lt?lt1), a (0ltalt1), ? (0lt?lt1)
- Outputs Q
- e(s,a)0 // for all s, a
- sget_curr_world_st() apick_nxt_act(Q,s)
- Repeat
- (r,s)act_in_world(a)
- apick_next_action(Q,s)
- dr?Q(s,a)-Q(s,a)
- e(s,a)1
- foreach (s,a) pair in (SXA)
- Q(s,a)Q(s,a)ae(s,a)d
- e(s,a)??
- aa ss
- Until (bored)