Reinforcement Learning - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Reinforcement Learning

Description:

Title: A Short Trip to Reinforcement Learning Author: bibhas Last modified by: SAM Created Date: 11/12/2004 10:07:57 PM Document presentation format – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 27
Provided by: bib52
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • Presented by
  • Bibhas Chakraborty and Lacey Gunter

2
What is Machine Learning?
  • A method to learn about some phenomenon from
    data, when there is little scientific theory
    (e.g., physical or biological laws) relative to
    the size of the feature space.
  • The goal is to make an intelligent machine, so
    that it can make decisions (or, predictions) in
    an unknown situation.
  • The science of learning plays a key role in areas
    like statistics, data mining and artificial
    intelligence. It also arises in engineering,
    medicine, psychology and finance.

3
Types of Learning
  • Supervised Learning
  • - Training data (X,Y). (features,
    label)
  • - Predict Y, minimizing some loss.
  • - Regression, Classification.
  • Unsupervised Learning
  • - Training data X. (features only)
  • - Find similar points in high-dim
    X-space.
  • - Clustering.

4
Types of Learning (Contd)
  • Reinforcement Learning
  • - Training data (S, A, R).
    (State-Action-Reward)
  • - Develop an optimal policy (sequence of
  • decision rules) for the learner so as to
  • maximize its long-term reward.
  • - Robotics, Board game playing programs.

5
Example of Supervised Learning
  • Predict the price of a stock in 6 months from
    now, based on economic data. (Regression)
  • Predict whether a patient, hospitalized due to a
    heart attack, will have a second heart attack.
    The prediction is to be based on demographic,
    diet and clinical measurements for that patient.
    (Logistic Regression)

6
Example of Supervised Learning
  • Identify the numbers in a handwritten ZIP code,
    from a digitized image (pixels). (Classification)

7
Example of Unsupervised Learning
  • From the DNA micro-array data, determine which
    genes are most similar in terms of their
    expression profiles. (Clustering)

8
Examples of Reinforcement Learning
  • How should a robot behave so as to optimize its
    performance? (Robotics)
  • How to automate the motion of a helicopter?
    (Control Theory)
  • How to make a good chess-playing program?
  • (Artificial Intelligence)

9
History of Reinforcement Learning
  • Roots in the psychology of animal learning
    (Thorndike,1911).
  • Another independent thread was the problem of
    optimal control, and its solution using dynamic
    programming (Bellman, 1957).
  • Idea of temporal difference learning (on-line
    method), e.g., playing board games (Samuel,
    1959).
  • A major breakthrough was the discovery of
    Q-learning (Watkins, 1989).

10
What is special about RL?
  • RL is learning how to map states to actions, so
    as to maximize a numerical reward over time.
  • Unlike other forms of learning, it is a
    multistage decision-making process (often
    Markovian).
  • An RL agent must learn by trial-and-error. (Not
    entirely supervised, but interactive)
  • Actions may affect not only the immediate reward
    but also subsequent rewards (Delayed effect).

11
Elements of RL
  • A policy
  • - A map from state space to action
    space.
  • - May be stochastic.
  • A reward function
  • - It maps each state (or, state-action
    pair) to
  • a real number, called reward.
  • A value function
  • - Value of a state (or, state-action
    pair) is the
  • total expected reward, starting from
    that
  • state (or, state-action pair).

12
The Precise Goal
  • To find a policy that maximizes the Value
    function.
  • There are different approaches to achieve this
    goal in various situations.
  • Q-learning and A-learning are just two different
    approaches to this problem. But essentially both
    are temporal-difference methods.

13
The Basic Setting
  • Training data n finite horizon trajectories,
    of the form
  • Deterministic policy A sequence of decision
    rules
  • Each p maps from the observable history (states
    and actions) to the action space at that time
    point.

14
Value and Advantage
  • Time t state value function, for history
    is
  • Time t state-action value function, Q-function,
    is
  • Time t advantage, A-function, is

15
Optimal Value and Advantage
  • Optimal time t value function for history
    is
  • Optimal time t Q-function is
  • Optimal time t A-function is

16
Return (sum of the rewards)
  • The conditional expectation of the return is
  • where the advantages µ are
  • and the , called temporal difference errors,
    are

17
Return (continued)
  • Conditional expectation of the return is a
    telescoping sum
  • Temporal difference errors have conditional mean
    zero

18
Q-LearningWatkins,1989
  • Estimate the Q-function using some approximator
    (for example, linear regression or neural
    networks or decision trees etc.).
  • Derive the estimated policy as an argument of the
    maximum of the estimated Q-function.
  • Allow different parameter vectors at different
    time points.
  • Let us illustrate the algorithm with linear
    regression as the approximator, and of course,
    squared error as the appropriate loss function.

19
Q-Learning Algorithm
  • Set
  • For
  • The estimated policy satisfies

20
What is the intuition?
  • Bellman equation gives
  • If and the training set were
    infinite, then Q-learning minimizes
  • which is equivalent to minimizing

21
A Success Story
  • TD Gammon (Tesauro, G., 1992)
  • - A Backgammon playing program.
  • - Application of temporal difference
    learning.
  • - The basic learner is a neural network.
  • - It trained itself to the world class
    level by playing against itself and learning
    from the outcome. So smart!!
  • - More information http//www.research.ib
    m.com/massive/tdl.html

22
A-Learning Murphy, 2003 and Robins, 2004
  • Estimate the A-function (advantages) using some
    approximator, as in Q-learning.
  • Derive the estimated policy as an argument of the
    maximum of the estimated A-function.
  • Allow different parameter vectors at different
    time points.
  • Let us illustrate the algorithm with linear
    regression as the approximator, and of course,
    squared error as the appropriate loss function.

23
A-Learning Algorithm (Inefficient Version)
  • For
  • The estimated policy satisfies

24
Differences between Q and A-learning
  • Q-learning
  • At time t we model the main effects of the
    history, (St,,At-1) and the action At and their
    interaction
  • Our Yt-1 is affected by how we modeled the main
    effect of the history in time t, (St,,At-1)
  • A-learning
  • At time t we only model the effects of At and its
    interaction with (St,,At-1)
  • Our Yt-1 does not depend on a model of the main
    effect of the history in time t, (St,,At-1)

25
Q-Learning Vs. A-Learning
  • Relative merits and demerits are not completely
    known till now.
  • Q-learning has low variance but high bias.
  • A-learning has high variance but low bias.
  • Comparison of Q-learning with A-learning involves
    a bias-variance trade-off.

26
References
  • Sutton, R.S. and Barto, A.G. (1998).
    Reinforcement Learning- An Introduction.
  • Hastie, T., Tibshirani, R. and Friedman, J.
    (2001). The Elements of Statistical Learning-Data
    Mining, Inference and Prediction.
  • Murphy, S.A. (2003). Optimal Dynamic Treatment
    Regimes. JRSS-B.
  • Blatt, D., Murphy, S.A. and Zhu, J. (2004).
  • A-Learning for Approximate Planning.
  • Murphy, S.A. (2004). A Generalization Error for
    Q-Learning.
Write a Comment
User Comments (0)
About PowerShow.com