Collaborative Reinforcement Learning - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Collaborative Reinforcement Learning

Description:

Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, 'Collaborative ... Actions ={delegation} U {DOP actions} U {discovery} Connection Costs ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 31

Provided by: cse92

Category:

more less

Transcript and Presenter's Notes

Title: Collaborative Reinforcement Learning

1
Collaborative Reinforcement Learning

Presented by Dr. Ying Lu

2
Credits

Reinforcement Learning A Users Guide. Bill
Smart at ICAC 2005
Jim Dowling, Eoin Curran, Raymond Cunningham and
Vinny Cahill, "Collaborative Reinforcement
Learning of Autonomic Behaviour", 2nd
International Workshop on Self-Adaptive and
Autonomic Computing Systems, pages 700-704, 2004.
Winner Best Paper Award.

3
What is RL?

a way of programming agents by reward and
punishment without needing to specify how the
task is to be achieved
Kaelbling, Littman, Moore, 96

4
Basic RL Model

Observe state, st
Decide on an action, at
Perform action
Observe new state, st1
Observe reward, rt1
Learn from experience
Repeat
Goal Find a control policy that will maximize
the observed rewards over the lifetime of the
agent

A
S
R
5
An Example Gridworld

Canonical RL domain
States are grid cells
4 actions N, S, E, W
Reward for entering top right cell
-0.01 for every other move
Maximizing sum of rewards ? Shortest path
In this instance

1
6
The Promise of RL

Specify what to do, but not how to do it
Through the reward function
Learning fills in the details
Better final solutions
Based of actual experiences, not programmer
assumptions
Less (human) time needed for a good solution

7
Mathematics of RL

Before we talk about RL, we need to cover some
background material
Some simple decision theory
Markov Decision Processes
Value functions

8
Making Single Decisions
1
1
A

Single decision to be made
Multiple discrete actions
Each action has a reward associated
with it
Goal is to maximize reward
Not hard just pick the action with the largest
reward
State 0 has a value of 2
Sum of rewards from taking the best action from
the state

0
B
2
2
9
Markov Decision Processes

We can generalize the previous example to
multiple sequential decisions
Each decision affects subsequent decisions
This is formally modeled by a Markov Decision
Process (MDP)

A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
10
Markov Decision Processes

Formally, an MDP is
A set of states, S s1, s2, ... , sn
A set of actions, A a1, a2, ... , am
A reward function, R S?A?S??
A transition function,
We want to learn a policy, p S ?A
Maximize sum of rewards we see over our lifetime

11
Policies

There are 3 policies for this MDP
0 ?1 ?3 ?5
0 ?1 ?4 ?5
0 ?2 ?4 ?5
Which is the best one?

A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
12
Comparing Policies

Order policies by how much reward they see
0 ?1 ?3 ?5 1 1 1 3
0 ?1 ?4 ?5 1 1 10 12
0 ?2 ?4 ?5 2 1000 10 -988

A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
13
Value Functions

We can define value without specifying the policy
Specify the value of taking action a from state s
and then performing optimally
This is the state-action value function, Q

How do you tell which action to take from each
state?
A
3
A
1
1
1
1
A
B
1
0
5
B
-1000
10
2
2
4
A
A
14
Value Functions

So, we have value function
Q(s, a) R(s, a, s) maxa Q(s, a)
In the form of
Next reward plus the best I can do from the next
state
These extend to probabilistic actions

s is the next state
15
Getting the Policy

If we have the value function, then finding the
best policy is easy
p(s) arg maxa Q(s, a)
Were looking for the optimal policy, p(s)
No policy generates more reward than p
Optimal policy defines optimal value functions
The easiest way to learn the optimal policy is to
learn the optimal value function first

16
Collaborative Reinforcement Learningto
Adaptively Optimize MANET Routing

Jim Dowling, Eoin Curran, Raymond Cunningham and
Vinny Cahill

17
Overview

Building Autonomic distributed systems with self
properties
Self-Organizing
Self-Healing
Self-Optimizing
Add collaborative learning mechanism to
self-adaptive component model
Improved ad-hoc routing protocol

18
Introduction

Autonomous Distributed systems will consist of
interacting components free from human
interference
Existing top-down management and programming
solutions require too much global state
Bottom up, decentralized collection of components
who make their own decisions based on local
information
System wide self behavior emerges from
interactions

19
Self- Behavior

Self-Adaptive components that change structure
and/or behavior at run-time
Adapt to discovered faults
Reduced performance
Requires active monitoring of component states
and external dependencies

20
Self- Distributed Systems using Distributed
(collaborative) Reinforcement Learning

For complex systems, programmers cannot be
expected to describe all conditions
Self-adaptive behavior learnt by components
Decentralized co-ordination of components to
support system-wide properties
Distributed Reinforcement Learning (DRL) is
extension to RL and uses neighbor interactions
only

21
Model-Based Reinforcement Learning
Markov Decision Process States , Actions,
R(States,Actions)-gt R
1.Action Reward
2. State Transition Model
3. Next State Reward
22
Decentralised System Optimisation

Coordinating the solution to a set of Discrete
Optimisation Problems (DOPs)
Components have a Partial System View
Coordination Actions
Actions delegation U DOP actions U
discovery
Connection Costs

23
Collaborative Reinforcement Learning

Advertisement
Update Partial Views of Neighbours
Decay
Negative Feedback on State Values in the Absence
of Advertisements

State Transition Model
Cached Neighbours V-value
Action Reward
Connection Cost
24
Adaptation in CRL System

A feedback process to
Changes in the optimal policy of any RL agent
Changes in the system environment
The passing time

25
SAMPLE Ad-hoc Routing using DRL

Probabilistic ad-hoc routing protocol based on
DRL
Adaptation of network traffic around areas of
congestion
Exploitation of stable routes
Routing decisions based on local information and
information obtained from neighbors
Outperforms Ad-hoc On Demand Distance Vector
Routing and Dynamic Source Routing