Reward Functions for Accelerated Learning

1 / 29
About This Presentation
Title:

Reward Functions for Accelerated Learning

Description:

Fundamental assumption of RL models, the belief that the agent-environment ... Instinctive behaviors because learning them has a high cost: ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 30
Provided by: hom4184

less

Transcript and Presenter's Notes

Title: Reward Functions for Accelerated Learning


1
Reward Functions for Accelerated Learning
  • Presented by Alp Sardag

2
Why RL?
  • RL a methodology of choice for learning in a
    variety of different domains.
  • Convergence property.
  • Potential biological relevance.
  • RL is good in
  • Game playing
  • Simulations

3
Cause of Failure
  • Fundamental assumption of RL models, the belief
    that the agent-environment interaction can be
    modeled as a MDP.
  • A and E are synchronized finite state automata.
  • A and E interact in discrete time intervals.
  • A can sense the state of E and use it to act.
  • After A acts, E transitions to a new state.
  • A receives a reward after performing an action.

4
States vs. Descriptors
  • Traditional RL depend on accurate state
    information, where as in physical robot
    environments
  • Even for simplest agents state space is very
    large.
  • Sensor inputs are noisy
  • The agent usually percieve local

5
Transitions vs. Events
  • World and agent states change asynchronously, in
    response to events not all are caused by the
    agent.
  • Same event can vary in duration under different
    circumstances and have different consequences.
  • Nondeterministic and stochastic models are more
    close to real world. However, the information for
    establishing a stochastic model is not usually
    available.

6
Learning Trials
  • Generating a complete policy requires a search in
    a large size of the state space.
  • In real world, the agent cannot choose what
    states it will transition to, and cannot visit
    all states.
  • Convergence in real world depends only on
    focusing only on the relevant parts of state.
  • The better the problem formulated, fewer learning
    trials.

7
Reinforcement vs. Feedback
  • Current RL work uses two types of reward
  • Immediate
  • Delayed
  • Real world situations tend to fall in between the
    two popular extremes.
  • Some immediate rewards
  • Plenty of intermittent rewards
  • Few very delayed rewards

8
Multiple Goals
  • Traditional RL deal with specialized problems in
    which the learning task can be specified with a
    single goal. The problems
  • Very specific task learned
  • Conflicts with any future learning
  • The extension
  • Sequentially formulated goals where state space
    explicitly encode what goals reached so far.
  • Use separate state space and reward function for
    each goal.
  • W-learning competition among selfish Q-learners.

9
Goal
  • Given the complexity and uncertanity of real
    world domains, a learning model, that minimizes
    the state space and maximizes the amount of
    learning at each trial.

10
Intermediate Rewards
  • Interminent rewards can be introduced
  • Reinforcing multiple goals, by using progress
    estimators.
  • Heterogenous Reinforcement Function In real
    worlds multiple goal exists, it is natural to
    reinforce individually rather than a monolithic
    goal.

11
Progress Estimators
  • Partial internal critics associated with specific
    goal, provide a metric of improvement relative to
    those goal. They are importanat in noisy worlds
  • Decrease the learners sensitivity to
    intermittent errors.
  • Encourage the exploration, without them, the
    agent can trash repeadetly attempting
    inappropriate behaviors.

12
Experimental Design
  • To validate the proposed approach, experiments
    designed for comparing new RL with traditional
    RL.
  • Robots
  • Learning Task
  • Learning Algorithm
  • Control Algorithm

13
Robots
  • In the experiments four fully autonomous R2
    mobile robots consisting of
  • Differentially steerable
  • Gripper for lifting objects
  • Piezo-electric bump sensor for detecting
    contact-collisions and monitoring the grasping
    force.
  • Set of IR for obstacle avoidance.
  • Radio tranceivers, used for determining absolute
    posiiton.

14
Robot Algorithm
  • The robots are programmed in the behavior
    language
  • Based on the subsumption architecture.
  • Parallel control system formed concurrently
    active behaviors, some of which gather
    information, some drive effectors, and some
    monitor progress and contribute reinforcement.

15
The Learning Task
  • The learning task consists of finding a mapping
    of all conditions and behaviors into the most
    efficient policy for group foraging.
  • Basic behaviors from which to learn behavior
    selection
  • Avoiding
  • Searching
  • Resting
  • Dispersing
  • Homing

16
The Learning Task Cont.
  • The state space can be reduced to the
    cross-product of the following state variables
  • Have-puck?
  • At-home?
  • Near-intruder?
  • Night-time?

17
Learning Task Cont.
  • Instinctive behaviors because learning them has a
    high cost
  • As soon as robot detects a puck between its
    fingers, it grasps it.
  • As soon as the robot reaches the home region, it
    drops a puck if ti is carrying one.
  • Whenever the robot is too near an obstacle, it
    avoids.

18
The learning Algorithm
  • The algorithm produces and maintains a matrix
    where appropriatness of behaviors associated with
    each state is kept.
  • The values in the matrix fluctuates over time
    based on received reinforcement, and are updated
    asynchronously, with any received reward.

19
The Learning Algorithm
20
The Learning Algorithm
  • The algorithm sums the reinforcement over time
  • The influence of the different types of feedback
    was weighted by the values of feedback constant

21
The Control Algorithm
  • Whenever an event is detected, the following
    control sequence is executed
  • Appropriate reinforcement delivered for current
    condition-behavior pair,
  • The current behavior is terminated,
  • Another behavior is selected.
  • Behaviors are selected according to the following
    rule
  • Choose an untried behavior if one is available.
  • Otherwise choose best behavior

22
Experimental Results
  • The following three approaches are compared
  • A monolithic single-goal (puck delivery to the
    home region) reward function using Q-learning,
    R(t)P(t)
  • A heterogeneous reinforcement function using
    multiple goals R(t)E(t),
  • A heterogeneous reinforcement function using
    multiple goals and two progress estimator
    function R(t)E(t)I(t)H(t)

23
Experimental Results
  • Values are collected twice per minute.
  • The final learning values are collected after 15
    minute run.
  • Convergence is defined as relative ordering of
    condition-behavior pairs.

24
Evaluation
  • Given the undeterminism and noisy sensor inputs
    the single goal provides insufficient feedback.
    It was vulnerable to interference.
  • The second learning strategy outperforms second
    because it detects the achievement of subgoals on
    the way of top level goal of depositing pucks at
    home.
  • The complete heterogenous reinforcement and
    progress estimator outperforms the others because
    it uses of all available information for every
    condition and behavior.

25
Additional Evaluation
  • Evaluated each part of the policy separately,
    according the following criteria
  • Number of trials required,
  • Correctness,
  • Stability.
  • Some condition-behavior pairs proved to be much
    more difficult to learn than others
  • without progress estimator
  • rare states

26
Discussion
  • Summing reinforcement
  • Scaling
  • Transition models

27
Summing Reinforcement
  • Allows for oscillations.
  • In theory, the more reinforcement the faster the
    learning. In practice noise and error could have
    the opposite effect.
  • The experiments described here demonstrate that
    even with a significant amount of noise, multiple
    reinforcers and progress estimators significantly
    accelerate learning.

28
Scaling
  • Interference was detriment to all three approach.
  • In terms of the amount of time required,The
    learned group foraging strategy outperformed
    hand-coded greedy agent strategies.
  • Foraging can be improved further by minimizing
    interference. Only one robot move at a time.

29
Transition Models
  • In case of noisy and uncertain environments
    transition model is not available to aid the
    learner.
  • The absence of a model made it difficult to
    compute discounted future reward.
  • Future work applying this approach to problems
    that involve incomplete and approximate state
    transition models.
Write a Comment
User Comments (0)