Collaborative Reinforcement Learning - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Collaborative Reinforcement Learning

Description:

Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, 'Collaborative ... Actions ={delegation} U {DOP actions} U {discovery} Connection Costs ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 31
Provided by: cse92
Category:

less

Transcript and Presenter's Notes

Title: Collaborative Reinforcement Learning


1
Collaborative Reinforcement Learning
  • Presented by Dr. Ying Lu

2
Credits
  • Reinforcement Learning A Users Guide. Bill
    Smart at ICAC 2005
  • Jim Dowling, Eoin Curran, Raymond Cunningham and
    Vinny Cahill, "Collaborative Reinforcement
    Learning of Autonomic Behaviour", 2nd
    International Workshop on Self-Adaptive and
    Autonomic Computing Systems, pages 700-704, 2004.
    Winner Best Paper Award.

3
What is RL?
  • a way of programming agents by reward and
    punishment without needing to specify how the
    task is to be achieved
  • Kaelbling, Littman, Moore, 96

4
Basic RL Model
  • Observe state, st
  • Decide on an action, at
  • Perform action
  • Observe new state, st1
  • Observe reward, rt1
  • Learn from experience
  • Repeat
  • Goal Find a control policy that will maximize
    the observed rewards over the lifetime of the
    agent

A
S
R
5
An Example Gridworld
  • Canonical RL domain
  • States are grid cells
  • 4 actions N, S, E, W
  • Reward for entering top right cell
  • -0.01 for every other move
  • Maximizing sum of rewards ? Shortest path
  • In this instance

1
6
The Promise of RL
  • Specify what to do, but not how to do it
  • Through the reward function
  • Learning fills in the details
  • Better final solutions
  • Based of actual experiences, not programmer
    assumptions
  • Less (human) time needed for a good solution

7
Mathematics of RL
  • Before we talk about RL, we need to cover some
    background material
  • Some simple decision theory
  • Markov Decision Processes
  • Value functions

8
Making Single Decisions
1
1
A
  • Single decision to be made
  • Multiple discrete actions
  • Each action has a reward associated
    with it
  • Goal is to maximize reward
  • Not hard just pick the action with the largest
    reward
  • State 0 has a value of 2
  • Sum of rewards from taking the best action from
    the state

0
B
2
2
9
Markov Decision Processes
  • We can generalize the previous example to
    multiple sequential decisions
  • Each decision affects subsequent decisions
  • This is formally modeled by a Markov Decision
    Process (MDP)

A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
10
Markov Decision Processes
  • Formally, an MDP is
  • A set of states, S s1, s2, ... , sn
  • A set of actions, A a1, a2, ... , am
  • A reward function, R S?A?S??
  • A transition function,
  • We want to learn a policy, p S ?A
  • Maximize sum of rewards we see over our lifetime

11
Policies
  • There are 3 policies for this MDP
  • 0 ?1 ?3 ?5
  • 0 ?1 ?4 ?5
  • 0 ?2 ?4 ?5
  • Which is the best one?

A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
12
Comparing Policies
  • Order policies by how much reward they see
  • 0 ?1 ?3 ?5 1 1 1 3
  • 0 ?1 ?4 ?5 1 1 10 12
  • 0 ?2 ?4 ?5 2 1000 10 -988

A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
13
Value Functions
  • We can define value without specifying the policy
  • Specify the value of taking action a from state s
    and then performing optimally
  • This is the state-action value function, Q

How do you tell which action to take from each
state?
A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
14
Value Functions
  • So, we have value function
  • Q(s, a) R(s, a, s) maxa Q(s, a)
  • In the form of
  • Next reward plus the best I can do from the next
    state
  • These extend to probabilistic actions

s is the next state
15
Getting the Policy
  • If we have the value function, then finding the
    best policy is easy
  • p(s) arg maxa Q(s, a)
  • Were looking for the optimal policy, p(s)
  • No policy generates more reward than p
  • Optimal policy defines optimal value functions
  • The easiest way to learn the optimal policy is to
    learn the optimal value function first

16
Collaborative Reinforcement Learningto
Adaptively Optimize MANET Routing
  • Jim Dowling, Eoin Curran, Raymond Cunningham and
    Vinny Cahill

17
Overview
  • Building Autonomic distributed systems with self
    properties
  • Self-Organizing
  • Self-Healing
  • Self-Optimizing
  • Add collaborative learning mechanism to
    self-adaptive component model
  • Improved ad-hoc routing protocol

18
Introduction
  • Autonomous Distributed systems will consist of
    interacting components free from human
    interference
  • Existing top-down management and programming
    solutions require too much global state
  • Bottom up, decentralized collection of components
    who make their own decisions based on local
    information
  • System wide self behavior emerges from
    interactions

19
Self- Behavior
  • Self-Adaptive components that change structure
    and/or behavior at run-time
  • Adapt to discovered faults
  • Reduced performance
  • Requires active monitoring of component states
    and external dependencies

20
Self- Distributed Systems using Distributed
(collaborative) Reinforcement Learning
  • For complex systems, programmers cannot be
    expected to describe all conditions
  • Self-adaptive behavior learnt by components
  • Decentralized co-ordination of components to
    support system-wide properties
  • Distributed Reinforcement Learning (DRL) is
    extension to RL and uses neighbor interactions
    only

21
Model-Based Reinforcement Learning
Markov Decision Process States , Actions,
R(States,Actions)-gt R
1.Action Reward
2. State Transition Model
3. Next State Reward
22
Decentralised System Optimisation
  • Coordinating the solution to a set of Discrete
    Optimisation Problems (DOPs)
  • Components have a Partial System View
  • Coordination Actions
  • Actions delegation U DOP actions U
    discovery
  • Connection Costs

23
Collaborative Reinforcement Learning
  • Advertisement
  • Update Partial Views of Neighbours
  • Decay
  • Negative Feedback on State Values in the Absence
    of Advertisements

State Transition Model
Cached Neighbours V-value
Action Reward
Connection Cost
24
Adaptation in CRL System
  • A feedback process to
  • Changes in the optimal policy of any RL agent
  • Changes in the system environment
  • The passing time

25
SAMPLE Ad-hoc Routing using DRL
  • Probabilistic ad-hoc routing protocol based on
    DRL
  • Adaptation of network traffic around areas of
    congestion
  • Exploitation of stable routes
  • Routing decisions based on local information and
    information obtained from neighbors
  • Outperforms Ad-hoc On Demand Distance Vector
    Routing and Dynamic Source Routing

26
SAMPLE A CRL System (I)
27
SAMPLE A CRL System (II)
28
SAMPLE A CRL System (II)
29
Performance
  • Metric
  • Maximize
  • throughput
  • ratio of delivered packets to undelivered packets
  • Minimize
  • number of transmission required per packet sent
  • Figure 5-10

30
Observations/Questions
  • How general is this approach?
  • Easy to represent problems in DRL?
  • Does DRL work for all problems
  • Needs many more examples
  • Separation of self behavior from functional
    components ?
  • What guarantees of optimization are there?
  • Does it work for problems requiring system wide
    guarantees?
  • Learning algorithms can and do reduce performance
    along stages
  • More suited to off-line discovery?
Write a Comment
User Comments (0)
About PowerShow.com