Machine Learning: Symbolbased

About This Presentation
Title:

Machine Learning: Symbolbased

Description:

Thomas Dean, James Allen, and Yiannis Aloimonos, Artificial Intgelligence: Theory and Practice ... The agent learns a policy which is a mapping from states to ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 30
Provided by: MBE
Learn more at: https://pages.mtu.edu

less

Transcript and Presenter's Notes

Title: Machine Learning: Symbolbased


1
Machine Learning Symbol-based
9d
9.0 Introduction 9.1 A Framework
for Symbol-based Learning 9.2 Version Space
Search 9.3 The ID3 Decision Tree Induction
Algorithm 9.4 Inductive Bias and Learnability
9.5 Knowledge and Learning 9.6 Unsupervised
Learning 9.7 Reinforcement Learning 9.8 Epilogue
and References 9.9 Exercises
Additional references for the slides Thomas
Dean, James Allen, and Yiannis Aloimonos, Artifici
al Intgelligence Theory and Practice Addison
Wesley, 1995, Section 5.9.
2
Reinforcement Learning
  • A form of learning where the agent can explore
    and learn through interaction with the
    environment
  • The agent learns a policy which is a mapping
    from states to actions. The policy tells what the
    best move is in a particular state.
  • It is a general methodology planning, decision
    making, search can all be viewed as some form of
    the reinforcement learning.

3
Tic-tac-toe a different approach
  • Recall the minimax approach The agent knows
    its current state. Generates a two layer search
    tree taking into account all the possible moves
    for itself and the opponent. Backs up values from
    the leaf nodes and takes the best move assuming
    that the opponent will also do so.
  • An alternative is to directly start playing with
    an opponent (does not have to be perfect,but
    could as well be). Assume no prior knowledge or
    lookahead. Assign values to states 1 is
    win 0 is loss or draw 0.5 is anything else

4
Notice that 0.5 is arbitrary, it cannot
differentiate between good moves and bad moves.
So, the learner has no guidance initially. It
engages in playing. When the game ends, if it is
a win, the value 1 will be propagated backwards.
If it is a draw or a loss, the value 0 is
propagated backwards. Eventually, earlier states
will be labeled to reflect their true value.
After several plays, the learner will learn the
best move given a state (a policy.)
5
Issues in generalizing this approach
  • How will the state values be initialized or
    propagated backwards?
  • What if there is no end to the game (infinite
    horizon)?
  • This is an optimization problem which suggests
    that it is hard. How can an optimal policy be
    learned?

6
A simple robot domain
The robot is in one of the states 0, 1, 2, 3.
Each one represents an office, the offices are
connected in a ring. Three actions are
available moves to the next state
- moves to the previous state _at_
remains at the same state
_at_
_at_

0
1
-


-
-
-
3
2
_at_
_at_

7
The robot domain (contd)
  • The robot can observe the label of the state it
    is in and perform any action corresponding to an
    arc leading out of its current state.
  • We assume that there is a clock governing the
    passage of time, and that at each tick of the
    clock the robot has to perform an action.
  • The environment is deterministic, there is a
    unique state resulting from any initial state and
    action.
  • Each state has a reward10 for state 3, 0 for
    the others.

8
The reinforcement learning problem
  • Given information about the environment
  • States
  • Actions
  • State-transition function (or diagram)
  • Output a policy p states ? actions, i.e., find
    the best action to execute at each state
  • Assumes that the state is completely observable
    (the agent always knows which state it is in)

9
Compare three policies
  • a. Every state is mapped to _at_
  • The value of this policy is 0, because the
    robot will never get to office 3.
  • b. Every state is mapped to
    policy 0
  • The value of this policy is ?, because the
    robot will end up in office 3 infinitely often.
  • c. Every state is except 3 is mapped to , 3 is
    mapped to _at_
    policy 1
  • The value of this policy is also ?, because
    the robot will end up (stay) in office 3
    infinitely often.

10
Compare three policies
So, it is easy to rule case a out, but how can we
show that policy 1 is better than policy 0? One
way would be to compute the average reward per
tick
  • POLICY 1
  • The average reward per tick for state 0 is 10.

POLICY 0 The average reward per tick for state 0
is 10/4.
Another way would be to assign higher values for
immediate rewards and apply a discount to future
rewards.
11
Discounted cumulative reward
  • Assume that the robot associates a higher value
    with more immediate rewards and therefore
    discounts future rewards.
  • The discount rate (?) is a number between 0 and 1
    used to discount future rewards.
  • The discounted cumulative reward for a particular
    state with respect to a given policy is the sum
    for n from 0 to infinity of ?n times the reward
    associated with the state reached after the n-th
    tick of the clock.

POLICY 1 The discounted cumulative reward for
state 0 is 2.5.
POLICY 0 The discounted cumulative reward for
state 0 is 1.33.
12
Discounted cumulative reward (contd)
  • Take ? 0.5
  • For state 0 with respect to policy 00.50 x 0
    0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 0 0.55
    x 0 0.56 x 0 0.57 x 10 1.25 0.078
    1.33 in the limit
  • For state 0 with respect to policy 10.50 x 0
    0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 10
    0.55 x 10 0.56 x 10 0.57 x 10 2.5 in
    the limit

13
Discounted cumulative reward (contd)
  • Let j be a state,R(j) be the reward for ending
    up in state j,? be a fixed policy,?(j) be the
    action dictated by ? in state j,f(j,a) be the
    next state given the robot starts in state j and
    performs action a,V?i(j) be the estimated value
    of state j with respect to the policy ? after the
    i-th iteration of the algorithm
  • Using a dynamic programming algorithm, one can
    obtain a good estimate of V?, the value function
    for policy ? as i ? ?.

14
A dynamic programming algorithm to compute values
for states for a policy ?
  • 1. For each j, set V?0(j) to 0.
  • 2. Set i to 0.
  • 3. For each j, set V?i1 (j) to R(j) ? V?i(
    f(j,?) ) ).
  • 4. Set i to i 1.
  • 5. If i is equal to the maximum number of
    iterations, then return V?i otherwise, return
    to step 3.

15
Values of states for policy 0
  • initialize
  • V(0) 0
  • V(1) 0
  • V(2) 0
  • V(3) 0
  • iteration 0
  • For office 0 R(0) ? V(1) 0 0.5 x 0 0
  • For office 1 R(1) ? V(2) 0 0.5 x 0 0
  • For office 2 R(2) ? V(3) 0 0.5 x 0 0
  • For office 3 R(3) ? V(1) 10 0.5 x 0 10
  • (iteration 0 essentially initializes values of
    states to their immediate rewards)

16
Values of states for policy 0 (contd)
  • iteration 0 V(0) V(1) V(2) 0 V(3)10
  • iteration 1
  • For office 0 R(0) ? V(1) 0 0.5 x 0 0
  • For office 1 R(1) ? V(2) 0 0.5 x 0 0
  • For office 2 R(2) ? V(3) 0 0.5 x 10 5
  • For office 3 R(3) ? V(0) 10 0.5 x 0 10
  • iteration 2
  • For office 0 R(0) ? V(1) 0 0.5 x 0 0
  • For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
  • For office 2 R(2) ? V(3) 0 0.5 x 10 5
  • For office 3 R(3) ? V(0) 10 0.5 x 0 10

17
Values of states for policy 0 (contd)
  • iteration 2 V(0) 0 V(1) 2.5 V(2) 5
    V(3) 10
  • iteration 4
  • For office 0 R(0) ? V(1) 0 0.5 x 2.5
    1.25
  • For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
  • For office 2 R(2) ? V(3) 0 0.5 x 10 5
  • For office 3 R(3) ? V(0) 10 0.5 x 0 10
  • iteration 5
  • For office 0 R(0) ? V(1) 0 0.5 x 2.5
    1.25
  • For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
  • For office 2 R(2) ? V(3) 0 0.5 x 10 5
  • For office 3 R(3) ? V(1) 10 0.5 x 1.25
    10.625

18
Values of states for policy 1
  • initialize
  • V(0) 0
  • V(1) 0
  • V(2) 0
  • V(3) 0
  • iteration 0
  • For office 0 R(0) ? V(1) 0 0.5 x 0 0
  • For office 1 R(1) ? V(2) 0 0.5 x 0 0
  • For office 2 R(2) ? V(3) 0 0.5 x 0 0
  • For office 3 R(3) ? V(3) 10 0.5 x 0 10

19
Values of states for policy 1 (contd)
  • iteration 0 V(0) V(1) V(2) 0 V(3)15
  • iteration 1
  • For office 0 R(0) ? V(1) 0 0.5 x 0 0
  • For office 1 R(1) ? V(2) 0 0.5 x 0 0
  • For office 2 R(2) ? V(3) 0 0.5 x 10 5
  • For office 3 R(3) ? V(3) 10 0.5 x 10 15
  • iteration 2
  • For office 0 R(0) ? V(1) 0 0.5 x 0 0
  • For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
  • For office 2 R(2) ? V(3) 0 0.5 x 15 7.5
  • For office 3 R(3) ? V(3) 10 0.5 x 15
    17.5

20
Values of states for policy 1 (contd)
  • iteration 2 V(0) 0 V(1) 2.5 V(2)
    7.5 V(3) 17.5
  • iteration 4
  • For office 0 R(0) ? V(1) 0 0.5 x 2.5
    1.25
  • For office 1 R(1) ? V(2) 0 0.5 x 7.5
    3.75
  • For office 2 R(2) ? V(3) 0 0.5 x 17.5
    8.75
  • For office 3 R(3) ? V(3) 10 0.5 x 17.5
    18.75
  • iteration 5
  • For office 0 R(0) ? V(1) 0 0.5 x 3.75
    1.875
  • For office 1 R(1) ? V(2) 0 0.5 x 8.75
    4.375
  • For office 2 R(2) ? V(3) 0 0.5 x 18.75
    9.375
  • For office 3 R(3) ? V(3) 10 0.5 x 18.75
    19.375

21
Compare policies
  • Policy 0 after iteration 5
  • For office 0 R(0) ? V(1) 0 0.5 x 2.5
    1.25
  • For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
  • For office 2 R(2) ? V(3) 0 0.5 x 10 5
  • For office 3 R(3) ? V(1) 10 0.5 x 1.25
    10.625
  • Policy 1 after iteration 5
  • For office 0 R(0) ? V(1) 0 0.5 x 3.75
    1.875
  • For office 1 R(1) ? V(2) 0 0.5 x 8.75
    4.375
  • For office 2 R(2) ? V(3) 0 0.5 x 18.75
    9.375
  • For office 3 R(3) ? V(3) 10 0.5 x 18.75
    19.375
  • Policy 1 is better because each state has higher
    value compared to policy 0

22
Temporal credit assignment problem
  • It is the problem of assigning credit or blame
    to the actions in a sequence of actions where
    feedback is available only at the end of the
    sequence.
  • When you lose a game of chess or checkers, the
    blame for your loss cannot necessarily be
    attributed to the last move you made, or even the
    next-to-the-last move.
  • Dynamic programming solves the temporal credit
    assignment problem by propagating rewards
    backwards to earlier states and hence to actions
    earlier in the sequence of actions determined by
    a policy.

23
Computing an optimal policy
  • Given a method for estimating the value of states
    with respect to a fixed policy, it is possible to
    find an optimal policy. We would like to maximize
    the discounted cumulative reward.
  • Policy iteration Howard, 1960 is an algorithm
    that uses the algorithm for computing the value
    of a state as a subroutine.

24
Policy iteration algorithm
  • 1. Let ?0 be an arbitrary policy.
  • 2. Set i to 0.
  • 3. Compute V?0 (j) for each j.
  • 4. Compute a new policy ?i1 so that ?i1 (j) is
    the action a maximizing R(j) ? V?i( f(j,?) ) .
  • 5. If ?i1 ?i , then return ?i otherwise, set
    i to i 1, and go to step 3.

25
Policy iteration algorithm (contd)
  • A policy ? is said to be the optimal policy if
    there is no other policy ? and state j such that
    V? (j) gt V? (j) and for all k ? j V? (j) gt V?
    (j) .
  • The policy iteration algorithm is guaranteed to
    terminate in a finite number of steps with an
    optimal policy.

26
Comments on reinforcement learning
  • A general model where an agent can learn to
    function in dynamic environments
  • The agent can learn while interacting with the
    environment
  • No prior knowledge except the (probabilistic)
    transitions is assumed
  • Can be generalized to stochastic domains (an
    action might have several different probabilistic
    consequences, i.e., the state-transition function
    is not deterministic)
  • Can also be generalized to domains where the
    reward function is not known

27
Famous example TD-Gammon (Tosauro, 1995)
  • Learns to play Backgammon
  • Immediate reward 100 if win -100 if lose 0
    for all other states
  • Trained by playing 1.5 million games against
    itself (several weeks)
  • Now approximately equal to best human player
    (won World Cup of Backgammon in 1992 among top 3
    since 1995)
  • Predecessor NeuroGammon Tesauro and Sejnowski,
    1989 learned from examples of labelled moves
    (very tedious for human expert)

28
Other examples
  • Robot learning to dock on battery charger
  • Pole balancing
  • Elevator dispatching Crites and Barto, 1995
    better than industry standard
  • Inventory management Van Roy et. Al 10-15
    improvement over industry standards
  • Job-shop scheduling for NASA space missions
    Zhang and Dietterich, 1997
  • Dynamic channel assignment in cellular phones
    Singh and Bertsekas, 1994
  • Robotic soccer

29
Common characteristics
  • delayed reward
  • opportunity for active exploration
  • possibility that state only partially observable
  • possible need to learn multiple tasks with same
    sensors/effectors
  • there may not be an adequate teacher
Write a Comment
User Comments (0)