Title: Machine Learning: Symbolbased
1Machine Learning Symbol-based
9d
9.0 Introduction 9.1 A Framework
for Symbol-based Learning 9.2 Version Space
Search 9.3 The ID3 Decision Tree Induction
Algorithm 9.4 Inductive Bias and Learnability
9.5 Knowledge and Learning 9.6 Unsupervised
Learning 9.7 Reinforcement Learning 9.8 Epilogue
and References 9.9 Exercises
Additional references for the slides Thomas
Dean, James Allen, and Yiannis Aloimonos, Artifici
al Intgelligence Theory and Practice Addison
Wesley, 1995, Section 5.9.
2Reinforcement Learning
- A form of learning where the agent can explore
and learn through interaction with the
environment - The agent learns a policy which is a mapping
from states to actions. The policy tells what the
best move is in a particular state. - It is a general methodology planning, decision
making, search can all be viewed as some form of
the reinforcement learning.
3Tic-tac-toe a different approach
- Recall the minimax approach The agent knows
its current state. Generates a two layer search
tree taking into account all the possible moves
for itself and the opponent. Backs up values from
the leaf nodes and takes the best move assuming
that the opponent will also do so. - An alternative is to directly start playing with
an opponent (does not have to be perfect,but
could as well be). Assume no prior knowledge or
lookahead. Assign values to states 1 is
win 0 is loss or draw 0.5 is anything else
4Notice that 0.5 is arbitrary, it cannot
differentiate between good moves and bad moves.
So, the learner has no guidance initially. It
engages in playing. When the game ends, if it is
a win, the value 1 will be propagated backwards.
If it is a draw or a loss, the value 0 is
propagated backwards. Eventually, earlier states
will be labeled to reflect their true value.
After several plays, the learner will learn the
best move given a state (a policy.)
5Issues in generalizing this approach
- How will the state values be initialized or
propagated backwards? - What if there is no end to the game (infinite
horizon)? - This is an optimization problem which suggests
that it is hard. How can an optimal policy be
learned?
6A simple robot domain
The robot is in one of the states 0, 1, 2, 3.
Each one represents an office, the offices are
connected in a ring. Three actions are
available moves to the next state
- moves to the previous state _at_
remains at the same state
_at_
_at_
0
1
-
-
-
-
3
2
_at_
_at_
7The robot domain (contd)
- The robot can observe the label of the state it
is in and perform any action corresponding to an
arc leading out of its current state. - We assume that there is a clock governing the
passage of time, and that at each tick of the
clock the robot has to perform an action. - The environment is deterministic, there is a
unique state resulting from any initial state and
action. - Each state has a reward10 for state 3, 0 for
the others.
8The reinforcement learning problem
- Given information about the environment
- States
- Actions
- State-transition function (or diagram)
- Output a policy p states ? actions, i.e., find
the best action to execute at each state - Assumes that the state is completely observable
(the agent always knows which state it is in)
9Compare three policies
- a. Every state is mapped to _at_
- The value of this policy is 0, because the
robot will never get to office 3. - b. Every state is mapped to
policy 0 - The value of this policy is ?, because the
robot will end up in office 3 infinitely often. - c. Every state is except 3 is mapped to , 3 is
mapped to _at_
policy 1 - The value of this policy is also ?, because
the robot will end up (stay) in office 3
infinitely often.
10Compare three policies
So, it is easy to rule case a out, but how can we
show that policy 1 is better than policy 0? One
way would be to compute the average reward per
tick
- POLICY 1
- The average reward per tick for state 0 is 10.
POLICY 0 The average reward per tick for state 0
is 10/4.
Another way would be to assign higher values for
immediate rewards and apply a discount to future
rewards.
11Discounted cumulative reward
- Assume that the robot associates a higher value
with more immediate rewards and therefore
discounts future rewards. - The discount rate (?) is a number between 0 and 1
used to discount future rewards. - The discounted cumulative reward for a particular
state with respect to a given policy is the sum
for n from 0 to infinity of ?n times the reward
associated with the state reached after the n-th
tick of the clock.
POLICY 1 The discounted cumulative reward for
state 0 is 2.5.
POLICY 0 The discounted cumulative reward for
state 0 is 1.33.
12Discounted cumulative reward (contd)
- Take ? 0.5
- For state 0 with respect to policy 00.50 x 0
0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 0 0.55
x 0 0.56 x 0 0.57 x 10 1.25 0.078
1.33 in the limit - For state 0 with respect to policy 10.50 x 0
0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 10
0.55 x 10 0.56 x 10 0.57 x 10 2.5 in
the limit
13Discounted cumulative reward (contd)
- Let j be a state,R(j) be the reward for ending
up in state j,? be a fixed policy,?(j) be the
action dictated by ? in state j,f(j,a) be the
next state given the robot starts in state j and
performs action a,V?i(j) be the estimated value
of state j with respect to the policy ? after the
i-th iteration of the algorithm - Using a dynamic programming algorithm, one can
obtain a good estimate of V?, the value function
for policy ? as i ? ?.
14A dynamic programming algorithm to compute values
for states for a policy ?
- 1. For each j, set V?0(j) to 0.
- 2. Set i to 0.
- 3. For each j, set V?i1 (j) to R(j) ? V?i(
f(j,?) ) ). - 4. Set i to i 1.
- 5. If i is equal to the maximum number of
iterations, then return V?i otherwise, return
to step 3.
15Values of states for policy 0
- initialize
- V(0) 0
- V(1) 0
- V(2) 0
- V(3) 0
- iteration 0
- For office 0 R(0) ? V(1) 0 0.5 x 0 0
- For office 1 R(1) ? V(2) 0 0.5 x 0 0
- For office 2 R(2) ? V(3) 0 0.5 x 0 0
- For office 3 R(3) ? V(1) 10 0.5 x 0 10
- (iteration 0 essentially initializes values of
states to their immediate rewards)
16Values of states for policy 0 (contd)
- iteration 0 V(0) V(1) V(2) 0 V(3)10
- iteration 1
- For office 0 R(0) ? V(1) 0 0.5 x 0 0
- For office 1 R(1) ? V(2) 0 0.5 x 0 0
- For office 2 R(2) ? V(3) 0 0.5 x 10 5
- For office 3 R(3) ? V(0) 10 0.5 x 0 10
- iteration 2
- For office 0 R(0) ? V(1) 0 0.5 x 0 0
- For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
- For office 2 R(2) ? V(3) 0 0.5 x 10 5
- For office 3 R(3) ? V(0) 10 0.5 x 0 10
17Values of states for policy 0 (contd)
- iteration 2 V(0) 0 V(1) 2.5 V(2) 5
V(3) 10 - iteration 4
- For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25 - For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
- For office 2 R(2) ? V(3) 0 0.5 x 10 5
- For office 3 R(3) ? V(0) 10 0.5 x 0 10
- iteration 5
- For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25 - For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
- For office 2 R(2) ? V(3) 0 0.5 x 10 5
- For office 3 R(3) ? V(1) 10 0.5 x 1.25
10.625
18Values of states for policy 1
- initialize
- V(0) 0
- V(1) 0
- V(2) 0
- V(3) 0
- iteration 0
- For office 0 R(0) ? V(1) 0 0.5 x 0 0
- For office 1 R(1) ? V(2) 0 0.5 x 0 0
- For office 2 R(2) ? V(3) 0 0.5 x 0 0
- For office 3 R(3) ? V(3) 10 0.5 x 0 10
19Values of states for policy 1 (contd)
- iteration 0 V(0) V(1) V(2) 0 V(3)15
- iteration 1
- For office 0 R(0) ? V(1) 0 0.5 x 0 0
- For office 1 R(1) ? V(2) 0 0.5 x 0 0
- For office 2 R(2) ? V(3) 0 0.5 x 10 5
- For office 3 R(3) ? V(3) 10 0.5 x 10 15
- iteration 2
- For office 0 R(0) ? V(1) 0 0.5 x 0 0
- For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
- For office 2 R(2) ? V(3) 0 0.5 x 15 7.5
- For office 3 R(3) ? V(3) 10 0.5 x 15
17.5
20Values of states for policy 1 (contd)
- iteration 2 V(0) 0 V(1) 2.5 V(2)
7.5 V(3) 17.5 - iteration 4
- For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25 - For office 1 R(1) ? V(2) 0 0.5 x 7.5
3.75 - For office 2 R(2) ? V(3) 0 0.5 x 17.5
8.75 - For office 3 R(3) ? V(3) 10 0.5 x 17.5
18.75 - iteration 5
- For office 0 R(0) ? V(1) 0 0.5 x 3.75
1.875 - For office 1 R(1) ? V(2) 0 0.5 x 8.75
4.375 - For office 2 R(2) ? V(3) 0 0.5 x 18.75
9.375 - For office 3 R(3) ? V(3) 10 0.5 x 18.75
19.375
21Compare policies
- Policy 0 after iteration 5
- For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25 - For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
- For office 2 R(2) ? V(3) 0 0.5 x 10 5
- For office 3 R(3) ? V(1) 10 0.5 x 1.25
10.625 - Policy 1 after iteration 5
- For office 0 R(0) ? V(1) 0 0.5 x 3.75
1.875 - For office 1 R(1) ? V(2) 0 0.5 x 8.75
4.375 - For office 2 R(2) ? V(3) 0 0.5 x 18.75
9.375 - For office 3 R(3) ? V(3) 10 0.5 x 18.75
19.375 - Policy 1 is better because each state has higher
value compared to policy 0
22Temporal credit assignment problem
- It is the problem of assigning credit or blame
to the actions in a sequence of actions where
feedback is available only at the end of the
sequence. - When you lose a game of chess or checkers, the
blame for your loss cannot necessarily be
attributed to the last move you made, or even the
next-to-the-last move. - Dynamic programming solves the temporal credit
assignment problem by propagating rewards
backwards to earlier states and hence to actions
earlier in the sequence of actions determined by
a policy.
23Computing an optimal policy
- Given a method for estimating the value of states
with respect to a fixed policy, it is possible to
find an optimal policy. We would like to maximize
the discounted cumulative reward. - Policy iteration Howard, 1960 is an algorithm
that uses the algorithm for computing the value
of a state as a subroutine.
24Policy iteration algorithm
- 1. Let ?0 be an arbitrary policy.
- 2. Set i to 0.
- 3. Compute V?0 (j) for each j.
- 4. Compute a new policy ?i1 so that ?i1 (j) is
the action a maximizing R(j) ? V?i( f(j,?) ) . - 5. If ?i1 ?i , then return ?i otherwise, set
i to i 1, and go to step 3.
25Policy iteration algorithm (contd)
- A policy ? is said to be the optimal policy if
there is no other policy ? and state j such that
V? (j) gt V? (j) and for all k ? j V? (j) gt V?
(j) . - The policy iteration algorithm is guaranteed to
terminate in a finite number of steps with an
optimal policy.
26Comments on reinforcement learning
- A general model where an agent can learn to
function in dynamic environments - The agent can learn while interacting with the
environment - No prior knowledge except the (probabilistic)
transitions is assumed - Can be generalized to stochastic domains (an
action might have several different probabilistic
consequences, i.e., the state-transition function
is not deterministic) - Can also be generalized to domains where the
reward function is not known
27Famous example TD-Gammon (Tosauro, 1995)
- Learns to play Backgammon
- Immediate reward 100 if win -100 if lose 0
for all other states - Trained by playing 1.5 million games against
itself (several weeks) - Now approximately equal to best human player
(won World Cup of Backgammon in 1992 among top 3
since 1995) - Predecessor NeuroGammon Tesauro and Sejnowski,
1989 learned from examples of labelled moves
(very tedious for human expert)
28Other examples
- Robot learning to dock on battery charger
- Pole balancing
- Elevator dispatching Crites and Barto, 1995
better than industry standard - Inventory management Van Roy et. Al 10-15
improvement over industry standards - Job-shop scheduling for NASA space missions
Zhang and Dietterich, 1997 - Dynamic channel assignment in cellular phones
Singh and Bertsekas, 1994 - Robotic soccer
29Common characteristics
- delayed reward
- opportunity for active exploration
- possibility that state only partially observable
- possible need to learn multiple tasks with same
sensors/effectors - there may not be an adequate teacher