Title: Collaborative Reinforcement Learning
1Collaborative Reinforcement Learning
2Credits
- Reinforcement Learning A Users Guide. Bill
Smart at ICAC 2005 - Jim Dowling, Eoin Curran, Raymond Cunningham and
Vinny Cahill, "Collaborative Reinforcement
Learning of Autonomic Behaviour", 2nd
International Workshop on Self-Adaptive and
Autonomic Computing Systems, pages 700-704, 2004.
Winner Best Paper Award.
3What is RL?
- a way of programming agents by reward and
punishment without needing to specify how the
task is to be achieved - Kaelbling, Littman, Moore, 96
4Basic RL Model
- Observe state, st
- Decide on an action, at
- Perform action
- Observe new state, st1
- Observe reward, rt1
- Learn from experience
- Repeat
- Goal Find a control policy that will maximize
the observed rewards over the lifetime of the
agent
A
S
R
5An Example Gridworld
- Canonical RL domain
- States are grid cells
- 4 actions N, S, E, W
- Reward for entering top right cell
- -0.01 for every other move
- Maximizing sum of rewards ? Shortest path
- In this instance
1
6The Promise of RL
- Specify what to do, but not how to do it
- Through the reward function
- Learning fills in the details
- Better final solutions
- Based of actual experiences, not programmer
assumptions - Less (human) time needed for a good solution
7Mathematics of RL
- Before we talk about RL, we need to cover some
background material - Some simple decision theory
- Markov Decision Processes
- Value functions
8Making Single Decisions
1
1
A
- Single decision to be made
- Multiple discrete actions
- Each action has a reward associated
with it - Goal is to maximize reward
- Not hard just pick the action with the largest
reward - State 0 has a value of 2
- Sum of rewards from taking the best action from
the state
0
B
2
2
9Markov Decision Processes
- We can generalize the previous example to
multiple sequential decisions - Each decision affects subsequent decisions
- This is formally modeled by a Markov Decision
Process (MDP)
A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
10Markov Decision Processes
- Formally, an MDP is
- A set of states, S s1, s2, ... , sn
- A set of actions, A a1, a2, ... , am
- A reward function, R S?A?S??
- A transition function,
- We want to learn a policy, p S ?A
- Maximize sum of rewards we see over our lifetime
11Policies
- There are 3 policies for this MDP
- 0 ?1 ?3 ?5
- 0 ?1 ?4 ?5
- 0 ?2 ?4 ?5
- Which is the best one?
A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
12Comparing Policies
- Order policies by how much reward they see
- 0 ?1 ?3 ?5 1 1 1 3
- 0 ?1 ?4 ?5 1 1 10 12
- 0 ?2 ?4 ?5 2 1000 10 -988
A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
13Value Functions
- We can define value without specifying the policy
- Specify the value of taking action a from state s
and then performing optimally - This is the state-action value function, Q
How do you tell which action to take from each
state?
A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
14Value Functions
- So, we have value function
- Q(s, a) R(s, a, s) maxa Q(s, a)
- In the form of
- Next reward plus the best I can do from the next
state - These extend to probabilistic actions
-
s is the next state
15Getting the Policy
- If we have the value function, then finding the
best policy is easy - p(s) arg maxa Q(s, a)
- Were looking for the optimal policy, p(s)
- No policy generates more reward than p
- Optimal policy defines optimal value functions
-
- The easiest way to learn the optimal policy is to
learn the optimal value function first
16Collaborative Reinforcement Learningto
Adaptively Optimize MANET Routing
- Jim Dowling, Eoin Curran, Raymond Cunningham and
Vinny Cahill
17Overview
- Building Autonomic distributed systems with self
properties - Self-Organizing
- Self-Healing
- Self-Optimizing
- Add collaborative learning mechanism to
self-adaptive component model - Improved ad-hoc routing protocol
18Introduction
- Autonomous Distributed systems will consist of
interacting components free from human
interference - Existing top-down management and programming
solutions require too much global state - Bottom up, decentralized collection of components
who make their own decisions based on local
information - System wide self behavior emerges from
interactions
19Self- Behavior
- Self-Adaptive components that change structure
and/or behavior at run-time - Adapt to discovered faults
- Reduced performance
- Requires active monitoring of component states
and external dependencies
20Self- Distributed Systems using Distributed
(collaborative) Reinforcement Learning
- For complex systems, programmers cannot be
expected to describe all conditions - Self-adaptive behavior learnt by components
- Decentralized co-ordination of components to
support system-wide properties - Distributed Reinforcement Learning (DRL) is
extension to RL and uses neighbor interactions
only
21Model-Based Reinforcement Learning
Markov Decision Process States , Actions,
R(States,Actions)-gt R
1.Action Reward
2. State Transition Model
3. Next State Reward
22Decentralised System Optimisation
- Coordinating the solution to a set of Discrete
Optimisation Problems (DOPs) - Components have a Partial System View
- Coordination Actions
- Actions delegation U DOP actions U
discovery - Connection Costs
23Collaborative Reinforcement Learning
- Advertisement
- Update Partial Views of Neighbours
- Decay
- Negative Feedback on State Values in the Absence
of Advertisements
State Transition Model
Cached Neighbours V-value
Action Reward
Connection Cost
24Adaptation in CRL System
- A feedback process to
- Changes in the optimal policy of any RL agent
- Changes in the system environment
- The passing time
25SAMPLE Ad-hoc Routing using DRL
- Probabilistic ad-hoc routing protocol based on
DRL - Adaptation of network traffic around areas of
congestion - Exploitation of stable routes
- Routing decisions based on local information and
information obtained from neighbors - Outperforms Ad-hoc On Demand Distance Vector
Routing and Dynamic Source Routing
26SAMPLE A CRL System (I)
27SAMPLE A CRL System (II)
28SAMPLE A CRL System (II)
29Performance
- Metric
- Maximize
- throughput
- ratio of delivered packets to undelivered packets
- Minimize
- number of transmission required per packet sent
- Figure 5-10
30Observations/Questions
- How general is this approach?
- Easy to represent problems in DRL?
- Does DRL work for all problems
- Needs many more examples
- Separation of self behavior from functional
components ? - What guarantees of optimization are there?
- Does it work for problems requiring system wide
guarantees? - Learning algorithms can and do reduce performance
along stages - More suited to off-line discovery?