Reward Functions for Accelerated Learning

1 / 29

About This Presentation

Title:

Reward Functions for Accelerated Learning

Description:

Fundamental assumption of RL models, the belief that the agent-environment ... Instinctive behaviors because learning them has a high cost: ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 30

Provided by: hom4184

more less

Transcript and Presenter's Notes

Title: Reward Functions for Accelerated Learning

1
Reward Functions for Accelerated Learning

Presented by Alp Sardag

2
Why RL?

RL a methodology of choice for learning in a
variety of different domains.
Convergence property.
Potential biological relevance.
RL is good in
Game playing
Simulations

3
Cause of Failure

Fundamental assumption of RL models, the belief
that the agent-environment interaction can be
modeled as a MDP.
A and E are synchronized finite state automata.
A and E interact in discrete time intervals.
A can sense the state of E and use it to act.
After A acts, E transitions to a new state.
A receives a reward after performing an action.

4
States vs. Descriptors

Traditional RL depend on accurate state
information, where as in physical robot
environments
Even for simplest agents state space is very
large.
Sensor inputs are noisy
The agent usually percieve local

5
Transitions vs. Events

World and agent states change asynchronously, in
response to events not all are caused by the
agent.
Same event can vary in duration under different
circumstances and have different consequences.
Nondeterministic and stochastic models are more
close to real world. However, the information for
establishing a stochastic model is not usually
available.

6
Learning Trials

Generating a complete policy requires a search in
a large size of the state space.
In real world, the agent cannot choose what
states it will transition to, and cannot visit
all states.
Convergence in real world depends only on
focusing only on the relevant parts of state.
The better the problem formulated, fewer learning
trials.

7
Reinforcement vs. Feedback

Current RL work uses two types of reward
Immediate
Delayed
Real world situations tend to fall in between the
two popular extremes.
Some immediate rewards
Plenty of intermittent rewards
Few very delayed rewards

8
Multiple Goals

Traditional RL deal with specialized problems in
which the learning task can be specified with a
single goal. The problems
Very specific task learned
Conflicts with any future learning
The extension
Sequentially formulated goals where state space
explicitly encode what goals reached so far.
Use separate state space and reward function for
each goal.
W-learning competition among selfish Q-learners.

9
Goal

Given the complexity and uncertanity of real
world domains, a learning model, that minimizes
the state space and maximizes the amount of
learning at each trial.

10
Intermediate Rewards

Interminent rewards can be introduced
Reinforcing multiple goals, by using progress
estimators.
Heterogenous Reinforcement Function In real
worlds multiple goal exists, it is natural to
reinforce individually rather than a monolithic
goal.

11
Progress Estimators

Partial internal critics associated with specific
goal, provide a metric of improvement relative to
those goal. They are importanat in noisy worlds
Decrease the learners sensitivity to
intermittent errors.
Encourage the exploration, without them, the
agent can trash repeadetly attempting
inappropriate behaviors.

12
Experimental Design

To validate the proposed approach, experiments
designed for comparing new RL with traditional
RL.
Robots
Learning Task
Learning Algorithm
Control Algorithm

13
Robots

In the experiments four fully autonomous R2
mobile robots consisting of
Differentially steerable
Gripper for lifting objects
Piezo-electric bump sensor for detecting
contact-collisions and monitoring the grasping
force.
Set of IR for obstacle avoidance.
Radio tranceivers, used for determining absolute
posiiton.

14
Robot Algorithm

The robots are programmed in the behavior
language
Based on the subsumption architecture.
Parallel control system formed concurrently
active behaviors, some of which gather
information, some drive effectors, and some
monitor progress and contribute reinforcement.

15
The Learning Task

The learning task consists of finding a mapping
of all conditions and behaviors into the most
efficient policy for group foraging.
Basic behaviors from which to learn behavior
selection
Avoiding
Searching
Resting
Dispersing
Homing

16
The Learning Task Cont.

The state space can be reduced to the
cross-product of the following state variables
Have-puck?
At-home?
Near-intruder?
Night-time?

17
Learning Task Cont.

Instinctive behaviors because learning them has a
high cost
As soon as robot detects a puck between its
fingers, it grasps it.
As soon as the robot reaches the home region, it
drops a puck if ti is carrying one.
Whenever the robot is too near an obstacle, it
avoids.

18
The learning Algorithm

The algorithm produces and maintains a matrix
where appropriatness of behaviors associated with
each state is kept.
The values in the matrix fluctuates over time
based on received reinforcement, and are updated
asynchronously, with any received reward.

19
The Learning Algorithm
20
The Learning Algorithm

The algorithm sums the reinforcement over time

The influence of the different types of feedback
was weighted by the values of feedback constant

21
The Control Algorithm

Whenever an event is detected, the following
control sequence is executed
Appropriate reinforcement delivered for current
condition-behavior pair,
The current behavior is terminated,
Another behavior is selected.
Behaviors are selected according to the following
rule
Choose an untried behavior if one is available.
Otherwise choose best behavior

22
Experimental Results

The following three approaches are compared
A monolithic single-goal (puck delivery to the
home region) reward function using Q-learning,
R(t)P(t)
A heterogeneous reinforcement function using
multiple goals R(t)E(t),
A heterogeneous reinforcement function using
multiple goals and two progress estimator
function R(t)E(t)I(t)H(t)

23
Experimental Results

Values are collected twice per minute.
The final learning values are collected after 15
minute run.
Convergence is defined as relative ordering of
condition-behavior pairs.

24
Evaluation

Given the undeterminism and noisy sensor inputs
the single goal provides insufficient feedback.
It was vulnerable to interference.
The second learning strategy outperforms second
because it detects the achievement of subgoals on
the way of top level goal of depositing pucks at
home.
The complete heterogenous reinforcement and
progress estimator outperforms the others because
it uses of all available information for every
condition and behavior.

25
Additional Evaluation

Evaluated each part of the policy separately,
according the following criteria
Number of trials required,
Correctness,
Stability.
Some condition-behavior pairs proved to be much
more difficult to learn than others
without progress estimator
rare states

26
Discussion

Summing reinforcement
Scaling
Transition models

27
Summing Reinforcement

Allows for oscillations.
In theory, the more reinforcement the faster the
learning. In practice noise and error could have
the opposite effect.
The experiments described here demonstrate that
even with a significant amount of noise, multiple
reinforcers and progress estimators significantly
accelerate learning.

28
Scaling

Interference was detriment to all three approach.
In terms of the amount of time required,The
learned group foraging strategy outperformed
hand-coded greedy agent strategies.
Foraging can be improved further by minimizing
interference. Only one robot move at a time.

29
Transition Models

In case of noisy and uncertain environments
transition model is not available to aid the
learner.
The absence of a model made it difficult to
compute discounted future reward.
Future work applying this approach to problems
that involve incomplete and approximate state
transition models.

Write a Comment

User Comments (0)