Title: RELATIONAL REINFORCEMENT LEARNING
1RELATIONAL REINFORCEMENT LEARNING
- PLL Seminar - Talk 2
- Tayfun Gürel
2Overview
- Reinforcement Learning Background
- Need for Relational Representations
- Logical Decision Trees - TILDE
- Q-RRL integration of RL with TILDE
- P-RRL
- Experimental Results
- Final Discussion and Conclusion
3The standard reinforcement-learning model
i input r reward s state
a action
4MDP EXAMPLE
Bellman Equation
States and rewards
(Greedy policy selection)
Transition function
5Value Iteration Algorithm
AN ALTERNATIVE ITERATION (Singh,1993)
(Important for Q-learning)
6 Q-Learning (Model Free)
Action selection is done with the following
(Q-exploration)
7P-Learning
- Q-function encodes the distance from the goal
- Instead
- Code whether an (s,a) pair is optimal
- Use the P-exploration strategy
8Relational Representations
- In most applications the state space is too
- large.
- Generalization over states is essential
- Many states are similar for some aspects
- Representations for RL have to be enriched for
generalization - ?RRL was initiated (Dzeroski, DeRaedt,
Blockeel 1998)
9Blocks World
An action move(a,b) precondition clear(a) ? S
clear(b) ? S
s1clear(b), clear(a), on(b,c), on(c,floor),
on(a,floor)
10Relational Representations
- With Relational Representations for states
- Abstraction from details
- q_value(0.72) goal_unstack ,
numberofblocks(A) , - action_move(B,C),
height(D,E), E2, on(C,D),!. - Flexible to goal changes
- Retraining from the beginning is not
- necessary
- Transfer of experience to more complex domains
11Relational Reinforcement Learning
- How does it work?
- An integration of RF with ILP
- Do forever
- Use Q-learning to generate sample Q
values - for sample states action pairs
- Generalize them using ILP (in this case
TILDE)
12TILDE (Top Down Induction of Logical
Decision Trees )
- A generalization of Q/P values is represented by
a logical decision tree - Logical Decision Trees
- Nodes are First Order Logic atoms (Prolog Queries
as Tests ) (e.g. on (A , c) is there any block
on c) - Training Data is a relational database or a
Prolog knowledge base
13Logical Decision Tree vs. Decision Tree
Decision Tree and Logical
Decision Tree deciding whether
blocks are stacked
14TILDE Algorithm
Declarative bias e.g. on(,-)
Background knowledge A prolog program An
example part of the program can be
15TILDE Refinement Operators
- How to find all possible Tests for a node?
-
- Use 1. Declarative Bias
- 2. Tests for the previous nodes
-
- Note Predicates from background knowledge
should be declared in declarative bias.
16Finding possible tests
Assuming only mode decleration on(,-)
on(A,B)
- Refinement operator generates
- on(A,B),on(A,C)
- on(A,B),on(B,C)
yes
no
on(B,C)
unstacked
no
yes
unstacked
?
Refinement operator generates 1. on(A,B),
on(B,C), on(A,D) 2. on(A,B), on(B,C), on(B,D) 3.
on(A,B), on(A,C), on(C,D)
17Q-RRL Algorithm
Examples generated by Q- learning
18Logical regression tree generated by
TILDE-RT
Equivalent prolog program
19P-RRL algorithm
- Idea P function may be easier
- to learn
- Q-function encodes the distance from the
- goal. What about if the number of blocks
- is changed?
20P-RRL
Together of Induction of Q-Tree, perform
Induction of P-tree
Use P Exploration strategy
21A P-RRL Result
Examples generated by P-RRL in an episode
A Decision Tree generated by TILDE from these
examples
22EXPERIMENTS
- Tested for three different goals
- 1. one-stack
- 2. on(a,b)
- 3. unstack
- Tested for the following as well
- 1. Fixed number of blocks
- 2. Number of blocks changed after learning
- 3. Number of blocks changed while learning
- P-RRL vs. Q-RRL
23Results Fixed number of blocks
Accuracy percentage of correctly classified
(s,a) pairs (optimal-non-optimal)
Accuracy of random policies
24Results Fixed number of blocks
25Results Evaluating learned policies on varying
number of blocks
26Results Varying the number of blocks while
learning
STRANGE!
27Results Varying the number of blocks while
learning
After adding additional background knowledge
28Conclusion
- RRL has satisfying initial results, needs more
research - P-RRL is more successful when number of blocks is
increased (generalizes better to more complex
domains) - RRL works not very well for more complex goals
and for great number of blocks
29Discussion
- Theoretical research proving why it works is
still missing - LOMDP was introduced (Kersting De Raedt 2003)
- LOMDP
- - A logical alphabet
- - A set of abstract states (FO conjunctions)
- - Abstract Actions and Transition functions
30Discussion
- It is proven that for every LOMDP and
- and abstract policy, there is a MDP and a
policy. - Still open Why does RRL work?
- Why does the optimal policy found for LOMDP
also work for corresponding - MDP?
31Other RRL approaches
- Symbolic Dynamic Programming
- ( Reiter, 2001)
- Combination of Dynamic Programming
- with Situation Calculus
- Deictic Representations (Finney 2002)
- no FOL, variables like
- the-pen-in-my-hand
32References
- S. Dzeroski, L. De Raedt, K. Driessens.
Relational Reinforcement Learning. Machine
Learning 43(1/2) 7-52 (2001) - K. Kersting, L. De Raedt. Logical Markov Decision
Programs. In L. Machine Learning and Natural
Language Processing Lab (p13 of 13) Getoor and
D. Jensen, editors, Working Notes of the
IJCAI-2003 Workshop on Learning Statistical
Models from Relational Data (SRL-03), pp. 63-70,
August 11, Acapulco, Mexico, 2003. - M. van Otterlo. Relational Representations in
Reinforcement Learning Review and Open
Problems Proceedings of the ICML'02 - Workshop on Development of Representations,
2002.
33References
- Leslie Pack Kaelbling, Michael L. Littman, and
Andrew W. Moore.Reinforcement learning A survey.
Journal of Artificial Intelligence Research, vol.
4, pp. 237--285, 1996. - Dzeroski, S., De Raedt, L. and Blockeel, H.
Relational reinforcement learning, In Page, D.
(Ed.) Proceedings of the 8 th International
Conference on Inductive Logic Programming,
Lecture Notes in Artificial Intelligence, Vol.
1446, Springer, 1998.