Reinforcement Learning: Learning Algorithms - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Reinforcement Learning: Learning Algorithms

Description:

Claim: Both converge to V (.) From now on St = S(t) 1. 2. 3. 4. 5 ... Once values converged. or .. Always at the states visited. 24. Monte-Carlo: Evaluation ... – PowerPoint PPT presentation

Number of Views:323
Avg rating:3.0/5.0
Slides: 73
Provided by: CsabaSze
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning: Learning Algorithms


1
Reinforcement LearningLearning Algorithms
  • Csaba Szepesvári
  • University of Alberta
  • Kioloa, MLSS08
  • Slides http//www.cs.ualberta.ca/szepesva/MLSS08
    /

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAA
2
Contents
  • Defining the problem(s)
  • Learning optimally
  • Learning a good policy
  • Monte-Carlo
  • Temporal Difference (bootstrapping)
  • Batch fitted value iteration and relatives

3
The Learning Problem
  • The MDP is unknown but the agent can interact
    with the system
  • Goals
  • Learn an optimal policy
  • Where do the samples come from?
  • Samples are generated externally
  • The agent interacts with the system to get the
    samples (active learning)
  • Performance measure What is the performance of
    the policy obtained?
  • Learn optimally Minimize regret while
    interacting with the system
  • Performance measure loss in rewards due to not
    using the optimal policy from the beginning
  • Exploration vs. exploitation

4
Learning from Feedback
  • A protocol for prediction problems
  • xt situation (observed by the agent)
  • yt 2 Y value to be predicted
  • pt 2 Y predicted value (can depend on all past
    values ) learning!)
  • rt(xt,yt,y) value of predicting y loss of
    learner ?t rt(xt,yt,y)-rt(xt, yt,pt)
  • Supervised learningagent is told yt,
    rt(xt,yt,.)
  • Regression rt(xt,yt,y)-(y-yt)2 ? ?t(yt-pt)2
  • Full information prediction problem8 y2 Y,
    rt(xt,y) is communicated to the agent, but not yt
  • Bandit (partial information) problemrt(xt,pt)
    is communicated to the agent only

5
Learning Optimally
  • Explore or exploit?
  • Bandit problems
  • Simple schemes
  • Optimism in the face of uncertainty (OFU) ? UCB
  • Learning optimally in MDPs with the OFU
    principle

6
Learning Optimally Exploration vs.
Exploitation
  • Two treatments
  • Unknown success probabilities
  • Goal
  • find the best treatment while loosing few
    patients
  • Explore or exploit?

7
Exploration vs. Exploitation Some Applications
  • Simple processes
  • Clinical trials
  • Job shop scheduling (random jobs)
  • What ad to put on a web-page
  • More complex processes (memory)
  • Optimizing production
  • Controlling an inventory
  • Optimal investment
  • Poker
  • ..

8
Bernoulli Bandits
  • Payoff is 0 or 1
  • Arm 1
  • R1(1), R2(1), R3(1), R4(1),
  • Arm 2
  • R1(2), R2(2), R3(2), R4(2),

0
1
0
0
1
1
0
1
9
Some definitions
Now t9 T1(t-1) 4 T2(t-1) 4 A1 1, A2 2,
  • Payoff is 0 or 1
  • Arm 1
  • R1(1), R2(1), R3(1), R4(1),
  • Arm 2
  • R1(2), R2(2), R3(2), R4(2),

0
1
0
0
1
1
0
1
10
The Exploration/Exploitation Dilemma
  • Action values Q(a) ERt(a)
  • Suppose you form estimates
  • The greedy action at t is
  • Exploitation When the agent chooses to follow
    At
  • Exploration When the agent chooses to do
    something else
  • You cant exploit all the time you cant explore
    all the time
  • You can never stop exploring but you should
    always reduce exploring. Maybe.

11
Action-Value Methods
  • Methods that adapt action-value estimates and
    nothing else
  • How to estimate action-values?
  • Sample average
  • Claim if
    nt(a)!1
  • Why??

12
e-Greedy Action Selection
  • Greedy action selection
  • e-Greedy

. . . the simplest way to balance exploration
and exploitation
13
10-Armed Testbed
  • n 10 possible actions
  • Repeat 2000 times
  • Q(a) N(0,1)
  • Play 1000 rounds
  • Rt(a) N(Q(a),1)

14
e-Greedy Methods on the 10-Armed Testbed
15
Softmax Action Selection
  • Problem with ²-greedy Neglects action values
  • Softmax idea grade action probs. by estimated
    values.
  • Gibbs, or Boltzmann action selection, or
    exponential weights
  • ? ?t is the computational temperature

16
Incremental Implementation
  • Sample average
  • Incremental computation
  • Common update rule form
  • NewEstimate OldEstimate
    StepSizeTarget OldEstimate

17
UCB Upper Confidence Bounds
Auer et al. 02
  • Principle Optimism in the face of uncertainty
  • Works when the environment is not adversary
  • Assume rewards are in 0,1. Let
  • (pgt2)
  • For a stationary environment, with iid rewards
    this algorithm is hard to beat!
  • Formally regret in T steps is O(log T)
  • Improvement Estimate variance, use it in place
    of p AuSzeMu 07
  • This principle can be used for achieving small
    regret in the full RL problem!

18
UCRL2 UCB Applied to RL
  • Auer, Jaksch Ortner 07
  • Algorithm UCRL2()
  • Phase initialization
  • Estimate mean model p0 using maximum likelihood
    (counts)
  • C p p(.x,a)-p0(.x,a) c X
    log(AT/delta) / N(x,a)
  • p argmaxp ½(p), ¼ ¼(p)
  • N0(x,a) N(x,a), 8 (x,a)2 X A
  • Execution
  • Execute ¼ until some (x,a) have been visited at
    least N0(x,a) times in this phase

19
UCRL2 Results
  • Def Diameter of an MDP MD(M) maxx,y min¼ E
    T(x?y ¼)
  • Regret bounds
  • Lower bound ELn ?( ( D X A T )1/2)
  • Upper bounds
  • w.p. 1-/T, LT O( D X ( A T log(
    AT/)1/2 )
  • w.p. 1-,LT O( D2 X2 A log( AT/)/ )
    performance gap between best and second best
    policy

20
Learning a Good Policy
  • Monte-Carlo methods
  • Temporal Difference methods
  • Tabular case
  • Function approximation
  • Batch learning

21
Learning a good policy
  • Model-based learning
  • Learn p,r
  • Solve the resulting MDP
  • Model-free learning
  • Learn the optimal action-value function and
    (then) act greedily
  • Actor-critic learning
  • Policy gradient methods
  • Hybrid
  • Learn a model and mix planning and a model-free
    method e.g. Dyna

22
Monte-Carlo Methods
  • Episodic MDPs!
  • Goal Learn V¼(.)
  • V¼(x) E¼ ?tt RtX0x
  • (Xt,At,Rt) -- trajectory of ¼
  • Visits to a state
  • f(x) min tXt x
  • First visit
  • E(x) t Xt x
  • Every visit
  • Return
  • S(t) 0Rt 1 Rt1
  • K independent trajectories ? S(k), E(k), f(k),
    k1..K
  • First-visit MC
  • Average over
  • S(k)( f(k)(x) ) k1..K
  • Every-visit MC
  • Average over
  • S(k)( t ) k1..K , t2 E(k)(x)
  • Claim Both converge to V¼(.)
  • From now on St S(t)

Singh Sutton 96
23
Learning to Control with MC
  • Goal Learn to behave optimally
  • Method
  • Learn Q¼(x,a)
  • ..to be used in an approximate policy iteration
    (PI) algorithm
  • Idea/algorithm
  • Add randomness
  • Goal all actions are sampled eventually
    infinitely often
  • e.g., ²-greedy or exploring starts
  • Use the first-visit or the every-visit method to
    estimate Q¼(x,a)
  • Update policy
  • Once values converged.. or ..
  • Always at the states visited

24
Monte-Carlo Evaluation
  • Convergence rate Var(S(0)Xx)/N
  • Advantages over DP
  • Learn from interaction with environment
  • No need for full models
  • No need to learn about ALL states
  • Less harm by Markovian violations (no
    bootstrapping)
  • Issue maintaining sufficient exploration
  • exploring starts, soft policies

25
Temporal Difference Methods
Samuel, 59, Holland 75, Sutton 88
  • Every-visit Monte-Carlo
  • V(Xt) ? V(Xt) t(Xt) (St V(Xt))
  • Bootstrapping
  • St Rt St1
  • St Rt V(Xt1)
  • TD(0)
  • V(Xt) ? V(Xt) t(Xt) ( St V(Xt) )
  • Value iteration
  • V(Xt) ? E St Xt
  • Theorem Let Vt be the sequence of functions
    generated by TD(0). Assume 8 x, w.p.1 ?t
    t(x)1, ?t t2(x)lt1. Then Vt ? V¼ w.p.1
  • Proof Stochastic approximationsVt1Tt(Vt,Vt),
    Ut1Tt(Ut,V¼) ? TV¼.Jaakkola et al., 94,
    Tsitsiklis 94, SzeLi99

26
TD or MC?
  • TD advantages
  • can be fully incremental, i.e., learn before
    knowing the final outcome
  • Less memory
  • Less peak computation
  • learn without the final outcome
  • From incomplete sequences
  • MC advantage
  • Less harm by Markovian violations
  • Convergence rate?
  • Var(S(0)Xx) decides!

27
Learning to Control with TD
  • Q-learning Watkins 90Q(Xt,At) ? Q(Xt,At)
    t(Xt,At) RtmaxaQ (Xt1,a)Q(Xt,At)
  • Theorem Converges to Q JJS94, Tsi94,SzeLi99
  • SARSA Rummery Niranjan 94
  • At Greedy²(Q,Xt)
  • Q(Xt,At) ? Q(Xt,At) t(Xt,At) RtQ
    (Xt1,At1)Q(Xt,At)
  • Off-policy (Q-learning) vs. on-policy (SARSA)
  • Expecti-SARSA
  • Actor-Critic Witten 77, Barto, Sutton
    Anderson 83, Sutton 84

28
Cliffwalking
e-greedy, e 0.1
29
N-step TD Prediction
  • Idea Look farther into the future when you do TD
    backup (1, 2, 3, , n steps)

30
N-step TD Prediction
  • Monte Carlo
  • St Rt Rt1 .. T-t RT
  • TD St(1) Rt V(Xt1)
  • Use V to estimate remaining return
  • n-step TD
  • 2 step return
  • St(2) Rt Rt1 2 V(Xt2)
  • n-step return
  • St(n) Rt Rt1 n V(Xtn)

31
Learning with n-step Backups
  • Learning with n-step backups
  • V(Xt) ? V(Xt) t( St(n) - V(Xt))
  • n controls how much to bootstrap

32
Random Walk Examples
  • How does 2-step TD work here?
  • How about 3-step TD?

33
A Larger Example
  • Task 19 state random walk
  • Do you think there is an optimal n? for
    everything?

34
Averaging N-step Returns
  • Idea backup an average of several returns
  • e.g. backup half of 2-step and half of 4-step
  • complex backup

One backup
35
Forward View of TD(l)
Sutton 88
  • Idea Average over multiple backups
  • l-return
  • St() (1-) ?n0..1 n St(n1)
  • TD()
  • V(Xt) t( St() -V(Xt))
  • Relation to TD(0) and MC
  • 0 ? TD(0)
  • 1 ? MC

36
l-return on the Random Walk
  • Same 19 state random walk as before
  • Why intermediate values of l are best?

37
Backward View of TD(l)
Sutton 88, Singh Sutton 96
  • t Rt V(Xt1) V(Xt)
  • V(x) ? V(x) t t e(x)
  • e(x) ? e(x) I(xXt)
  • Off-line updates ?Same as FW TD()
  • e(x) eligibility trace
  • Accumulating trace
  • Replacing traces speed up convergence
  • e(x) ? max( e(x), I(xXt) )

38
Function Approximation with TD
39
Gradient Descent Methods
  • Assume Vt is a differentiable function of ?
  • Vt(x) V(x?).
  • Assume, for now, training examples of the form
  • (Xt, V?(Xt))

40
Performance Measures
  • Many are applicable but
  • a common and simple one is the mean-squared error
    (MSE) over a distribution P
  • Why P?
  • Why minimize MSE?
  • Let us assume that P is always the distribution
    of states at which backups are done.
  • The on-policy distribution the distribution
    created while following the policy being
    evaluated. Stronger results are available for
    this distribution.

41
Gradient Descent
  • Let L be any function of the parameters.Its
    gradient at any point ? in this space is
  • Iteratively move down the gradient

42
Gradient Descent in RL
  • Function to descent on
  • Gradient
  • Gradient descent procedure
  • Bootstrapping with St
  • TD(?) (forward view)

43
Linear Methods
Sutton 84, 88, Tsitsiklis Van Roy 97
  • Linear FAPP V(xµ) µ T Á(x)
  • rµ V(xµ) Á(x)
  • Tabular representation Á(x)y I(xy)
  • Backward view
  • t Rt V(Xt1) V(Xt)
  • µ ? µ t t e
  • e ? e rµ V(Xtµ)
  • Theorem TsiVaR97 Vt converges to V s.t.
    V-V¼D,2 V¼- V¼D,2/(1-).

44
Control with FA
Rummery Niranjan 94
  • Learning state-action values
  • Training examples
  • The general gradient-descent rule
  • Gradient-descent Sarsa(l)

45
Mountain-Car Task
Sutton 96, Singh Sutton 96
46
Mountain-Car Results
47
Bairds Counterexample Off-policy Updates Can
Diverge
Baird 95
48
Bairds Counterexample Cont.
49
Should We Bootstrap?
50
Batch Reinforcement Learning
51
Batch RL
  • Goal Given the trajectory of the behavior policy
    ¼b X1,A1,R1, , Xt, At, Rt, , XNcompute a
    good policy!
  • Batch learning
  • Properties
  • Data collection is not influenced
  • Emphasis is on the quality of the solution
  • Computational complexity plays a secondary role
  • Performance measures
  • V(x) V¼(x)1 supx V(x) - V¼(x)
    supx V(x) - V¼(x)
  • V(x) - V¼(x)2 s (V(x)-V¼(x))2 d¹(x)

52
Solution methods
Bradtke, Barto 96, Lagoudakis, Parr 03,
AnSzeMu 07
  • Build a model
  • Do not build a model, but find an approximation
    to Q
  • using value iteration gt fitted Q-iteration
  • using policy iteration gt
  • Policy evaluated by approximate value iteration
    Policy evaluated by Bellman-residual minimization
    (BRM)
  • Policy evaluated by least-squares temporal
    difference learning (LSTD) gt LSPI
  • Policy search

53
Evaluating a policy Fitted value iteration
  • Choose a function space F.
  • Solve for i1,2,,M the LS (regression) problems
  • Counterexamples?!?!? Baird 95, Tsitsiklis
    and van Roy 96
  • When does this work??
  • Requirement If M is big enough and the number of
    samples is big enough QM should be close to Q¼
  • We have to make some assumptions on F

54
Least-squares vs. gradient
  • Linear least squares (ordinary regression) yt
    wT xt ²t (xt,yt) jointly distributed
    r.v.s., iid, E²txt0.
  • Seeing (xt,yt), t1,,T, find out w.
  • Loss function L(w) E (y1 wT x1 )2 .
  • Least-squares approach
  • wT argminw ?t1T (yt wT xt)2
  • Stochastic gradient method
  • wt1 wt t (yt-wtT xt) xt
  • Tradeoffs
  • Sample complexity How good is the estimate
  • Computational complexity How expensive is the
    computation?

55
Fitted value iteration Analysis
After AnSzeMu 07
  • Goal Bound QM - Q¼¹2 in terms of
    maxm ²mº2 , ²mº2 s ²m2(x,a)
    º(dx,da),where Qm1 T¼Qm ²m , ²-1 Q0-Q¼
  • Um Qm Q¼

56
Analysis/2
57
Summary
  • If the regression errors are all small and the
    system is noisy (8 ¼,½, ½ P¼ C1 º) then the
    final error will be small.
  • How to make the regression errors small?
  • Regression error decomposition

58
Controlling the approximation error
59
Controlling the approximation error
60
Controlling the approximation error
61
Controlling the approximation error
  • Assume smoothness!

62
Learning with (lots of) historical data
  • Data A long trajectory of some exploration
    policy
  • Goal Efficient algorithm to learn a policy
  • Idea Use fitted action-values
  • Algorithms
  • Bellman residual minimization, FQI AnSzeMu 07
  • LSPI Lagoudakis, Parr 03
  • Bounds
  • Oracle inequalities (BRM, FQI and LSPI)
  • ) consistency

63
BRM insight
AnSzeMu 07
  • TD error ?tRt Q(Xt1,¼(Xt1))-Q(Xt,At)
  • Bellman error EE ?t Xt,At 2
  • What we can compute/estimate EE ?t2 Xt,At
  • They are different!
  • However

64
Loss function
65
Algorithm (BRM)
66
Do we need to reweight or throw away data?
  • NO!
  • WHY?
  • Intuition from regression
  • m(x) EYXx can be learnt no matter what p(x)
    is!
  • ?(ax) the same should be possible!
  • BUT..
  • Performance might be poor! gt YES!
  • Like in supervised learning when training and
    test distributions are different

67
Bound
68
The concentration coefficients
  • Lyapunov exponents
  • Our case
  • yt is infinite dimensional
  • Pt depends on the policy chosen
  • If top-Lyap exp. 0, we are good?

69
Open question
  • Abstraction
  • Let
  • True?

70
Relation to LSTD
AnSzeMu 07
  • LSTD
  • Linear function space
  • Bootstrap the normal equation

71
Open issues
  • Adaptive algorithms to take advantage of
    regularity when present to address the curse of
    dimensionality
  • Penalized least-squares/aggregation?
  • Feature relevance
  • Factorization
  • Manifold estimation
  • Abstraction build automatically
  • Active learning
  • Optimal on-line learning for infinite problems

72
References
  • Auer et al. 02 P. Auer, N. Cesa-Bianchi and P.
    Fischer Finite time analysis of the multiarmed
    bandit problem, Machine Learning, 47235256,
    2002.
  • AuSzeMu 07 J.-Y. Audibert, R. Munos and Cs.
    Szepesvári Tuning bandit algorithms in
    stochastic environments, ALT, 2007.
  • Auer, Jaksch Ortner 07 P. Auer, T. Jaksch
    and R. Ortner Near-optimal Regret Bounds for
    Reinforcement Learning, (2007), available
    athttp//www.unileoben.ac.at/infotech/publicatio
    ns/ucrlrevised.pdf
  • Singh Sutton 96 S.P. Singh and R.S.
    SuttonReinforcement learning with replacing
    eligibility traces. Machine Learning, 22123158,
    1996.
  • Sutton 88 R.S. Sutton Learning to predict by
    the method of temporal differences. Machine
    Learning, 3944, 1988.
  • Jaakkola et al. 94 T. Jaakkola, M.I. Jordan,
    and S.P. Singh On the convergence of stochastic
    iterative dynamic programming algorithms. Neural
    Computation, 6 11851201, 1994.
  • Tsitsiklis, 94 J.N. Tsitsiklis Asynchronous
    stochastic approximation and Q-learning. Machine
    Learning, 16185202, 1994.
  • SzeLi99 Cs. Szepesvári and M.L. Littman A
    Unified Analysis of Value-Function-Based
    Reinforcement-Learning Algorithms, Neural
    Computation, 11, 20172059, 1999.
  • Watkins 90 C.J.C.H. Watkins Learning from
    Delayed Rewards, PhD Thesis, 1990.
  • Rummery and Niranjan 94 G.A. Rummery and M.
    Niranjan On-line Q-learning using connectionist
    systems. Technical Report CUED/F-INFENG/TR 166,
    Cambridge University Engineering Department,
    1994.
  • Sutton 84 R.S. Sutton Temporal Credit
    Assignment in Reinforcement Learning. PhD
    thesis, University of Massachusetts, Amherst, MA,
    1984.
  • Tsitsiklis Van Roy 97 J.N. Tsitsiklis and B.
    Van Roy An analysis of temporal-difference
    learning with function approximation. IEEE
    Transactions on Automatic Control, 42674690,
    1997.
  • Sutton 96 R.S. Sutton Generalization in
    reinforcement learning Successful examples using
    sparse coarse coding. NIPS, 1996.
  • Baird 95 L.C. Baird Residual algorithms
    Reinforcement learning with function
    approximation, ICML, 1995.
  • Bradtke, Barto 96 S.J. Bradtke and A.G. Barto
    Linear least-squares algorithms for temporal
    difference learning. Machine Learning, 223357,
    1996.
  • Lagoudakis, Parr 03 M. Lagoudakis and R. Parr
    Least-squares policy iteration, Journal of
    Machine Learning Research, 411071149, 2003.
  • AnSzeMu 07 A. Antos, Cs. Szepesvari and R.
    Munos Learning near-optimal policies with
    Bellman-residual minimization based fitted policy
    iteration and a single sample path, Machine
    Learning Journal, 2007.
Write a Comment
User Comments (0)
About PowerShow.com