Reinforcement Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Reinforcement Learning

Description:

The first approach starts by calculating utilities of all states at step n-1 in ... Function DP (M , R ) Returns Utility Function. Begin // Initialization. U ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 29
Provided by: fs87
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • Introduction
  • Presented by
  • Alp Sardag

2
Supervised vs Unsupervised Learning
  • Any Situation in which both the inputs and
    outputs of a component
  • can be perceived is called Supervised Learning.
  • Learning when there is no hint at all about
    correct outputs is called
  • Unsupervised Learning. The agent receives some
    evaluation of its action
  • but is not told the correct action.

3
Sequential Decision Problems
  • In single decision problems, the utility of each
    actions outcome is well known.

Aj1
Uj1
Aj2
Uj2
Choose next action with Max(U)
Ajn
Uj3
4
Sequential Decision Problems
  • Sequential decision problems, the agents utility
    depends on a sequence of actions.
  • The difference is what is returned is not a
    single action but rather a policy- arrived at by
    calculating the utilities for each state.

5
Example
The available actions (A) North, South, East
and West P(IE A) 0.8 P(IE A) 0.2 IE
? Intended Action
  • Terminal States The environment terminates when
    the agent reaches one of the states marked 1 or
    1.
  • Model Set of probabilites associated with the
    possible transitions between states after any
    given action. The notation Maij means the
    probability of reaching State j if action A is
    done in State i. (Accessible environment MDP
    next state depends current state and action.)

6
Model
Obtained by simulation
7
Example
  • There is no utility for the states other than the
    terminal states (T).
  • We have to base the utility function on a
    sequence of states rather than on a single state.
    E.g. Uex(s1,...,sn) -1/25 n U(T)
  • To select the next action Consider sequences as
    one long action and apply the basic maximum
    expected utility principle to sequences.
  • Max(EU(AI)) Max(?Maij Uj)
  • Result The first action of the optimal sequence.

8
Drawback
  • Consider the action sequence starting from state
    (3,2) North,East.
  • Than it will be better to calculate utilitiy
    function for each state.

9
VALUE ITERATION
  • The basic idea is to calculate the utility of
    each state, U(state), and then use the state
    utilities to select an optimal action in each
    state.
  • Policy A complete mapping from states to
    actions.
  • H(state,policy) History tree starting from the
    state and taking action according to policy.
  • U(i) ? EU(H(i,policy)M) ?
  • ?P(H(i,policy)M)Uh(H(i,policy)))

10
The Property of Utility Function
  • For a utility function on states (U) to make
    sense, we require that the utility function on
    histories (Uh) have the property of seperability.
  • Uh(s0,s1,...,sn) f(s0,Uh(s1,...,sn)
  • The siplest form of seperable utility funciton is
    additive.
  • Uh(s0,s1,...,sn) R(s0) Uh(s1,...,sn)
  • where R is called the Reward function.
  • Notice Additivity was implicit in our use of
    path cost functions in heuristic search
    algorithms. The sum of the utilities from that
    state until the terminal state is reached.

11
Refreshing
  • We have to base the utility function on a
    sequence of states rather than on a single state.
    E.g. Uex(s1,...,sn) -1/25 n U(T)
  • In that case R(si) -1/25 for non terminal
    states , 1 for state (4,3) and 1 for state
    (4,2).

12
Utility of States
  • Given a separable utility function Uh , the
    utility of a state can be expressed in terms of
    the utility of its succesors.
  • U(i) R(i) maxa ?jMaijU(j)
  • The above equation is the basis for dynamic
    programming.

13
Dynamic Programming
  • There are two approaches.
  • The first approach starts by calculating
    utilities of all states at step n-1 in terms of
    utilites of the terminal states than at step n-2
    , so on...
  • The second approach approximates the utilities of
    states to any degree of accuracv using an
    iterative procedure. This is used because in most
    decision problem the environment histories are
    potentially of unbounded length.

14
Algorithm
  • Function DP (M , R ) Returns Utility Function
  • Begin
  • // Initialization
  • U R U R
  • Repeat
  • U U
  • For Each State i do
  • Ui Ri maxa ?jMaijU(j)
  • end
  • Until U-U lt ?
  • End

15
Policy
  • Policy Function
  • policy(i) maxa ?jMaijU(j)

16
Reinforcement Learning
  • The task is to use rewards and punishments to
    learn a succesfull agent function (policy)
  • Diffucult, the agent never told what the right
    actions, nor which reward for which action. The
    agent starts with no model and no utility
    function.
  • In many complex domain, RL is the only feasible
    way to train a program to perform at high levels.

17
Example An agent learning to play chess
  • Supervised learning very hard for the teacher
    from large number of positions to choose accurate
    ones to train directly from examples.
  • In RL the program told when it has won or lost,
    and can use this information to learn an
    evaluation function.

18
Two Basic Designs
  • The agent learns a utility function on states (or
    histories) and uses it to select actions that
    maximizes the expected utility of their outcomes.
  • The agent learns an action-value function giving
    the expected utility of taking a given action in
    a given state. This is called Q-learning. The
    agent not interested with the outcome of its
    action.

19
Active Passive Learner
  • A passive learner simply watches the world going
    by, and tries to learn utility of being in
    various states.
  • An active learner must also act using learned
    information and use its problem generator to
    suggest explorations of unknown portions of the
    environment.

20
Comparison of Basic Designs
  • The policy for an agent that learns a utility
    function on states is
  • policy(i) maxa ?jMaijU(j)
  • Te policy for an agent that learns an
    action-value function is
  • policy(i) maxa Q(a,i)

21
Passive Learning
.5
(a)Simple Stocastic Environment
(b)Mij is provided in PL, Maij is provided in AL
(c)The exact utility values
22
Calculation of Utility on States for PL
  • Dynamic Programming (ADP)
  • U(i) ? R(i) ?jMijU(j)
  • Because the agent is passive, no maximization
    over action.
  • Temporal Difference Learning
  • U(i) ? U(i)?(R(i)U(j)-U(i))
  • where ? is the learning rate. This suggest
    U(i) agree with its successor.

23
Comparison of ADP TD
  • ADP will converge faster than TD, ADP knows
    current environment model.
  • ADP use the full model, TD uses no model, just
    information about connectedness of states, from
    the current training sequence.
  • TD adjusts a state to agree with its observed
    successor whereas ADP adjusts the state to agree
    with all successor. But this difference will
    disappear when the effects of TD adjustments are
    averaged over a large number of transitions.
  • Full ADP may be intractable when the number of
    states is large. Prioritized-sweeping heuristic
    prefers to make adjustement to states whose
    likely successor have just undergone a large
    adjustment in their own utility.

24
Calculation of Utility on States for AL
  • Dynamic Programming (ADP)
  • U(i) ? R(i) maxa ?jMaijU(j)
  • Temporal Difference Learning
  • U(i) ? U(i)?(R(i)U(j)-U(i))

25
Problem of Exploration in AL
  • An active learner act using the learned
    information, and can use its problem generator to
    suggest explorations of unknown portions of the
    environment.
  • Trade-off between immediate good and long-term
    well-being.
  • One idea To change the constraint equation so
    that it assigns a higher utility estimate to
    relatively unexplored action-state pairs.

U(i) ? R(i) maxa F(?jMaijU(j),N(a,i)) where
F(u,n)
26
Learning an Action-Value Function
  • The function assigns an expected utility to
    taking a given action in a given state. Q(a,i)
    expected utility to taking action a in state i.
  • Like condition-action rules, they suffice for
    decision making.
  • Unlike the condition-action rules, they can be
    learned directly from reward feedback.

27
Calculation of Action-Value Function
  • Dynamic Programming
  • Q(a,i) ? R(i) ?jMaij maxa Q(a,j)
  • Temporal Difference Learning
  • Q(a,i) ? Q(a,i) ?(R(i) maxaQ(a,j) - Q(a,i))
  • where ? is the learning rate.

28
Question The Answer that Refused to be Found
  • Is it better to learn a utility function or to
    learn an action-value function?
Write a Comment
User Comments (0)
About PowerShow.com