Design Principles for Creating HumanShapable Agents

1 / 24
About This Presentation
Title:

Design Principles for Creating HumanShapable Agents

Description:

Human trainers reinforce predicted action as well as recent action. the end. ... make action selection fully greedy. human reinf. replaces environmental reward ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 25
Provided by: willia203
Learn more at: http://www.cc.gatech.edu

less

Transcript and Presenter's Notes

Title: Design Principles for Creating HumanShapable Agents


1
Design Principles for Creating Human-Shapable
Agents
  • W. Bradley Knox, Ian Fasel,
  • and Peter Stone

The University of Texas at Austin Department of
Computer Sciences
2
Transferring human knowledge through natural
forms of communication
  • Potential benefits over purely autonomous
    learners
  • Decrease sample complexity
  • Learn in the absence of a reward function
  • Allow lay users to teach agents the policies that
    they prefer (no programming!)
  • Learn in more complex domains

3
Shaping
LOOK magazine, 1952
  • Def. - creating a desired behavior by reinforcing
    successive approximations of the behavior

4
The Shaping Scenario(in this context)
  • A human trainer observes an agent and manually
    delivers reinforcement (a scalar value),
    signaling approval or disapproval.
  • E.g., training a dog with treats as in the
    previous picture

5
The Shaping Problem (for computational agents)
  • Within a sequential decision making task, how can
    an agent harness state descriptions and
    occasional scalar human reinforcement signals to
    learn a good task policy?

6
Previous work on human-shapable agents
  • Clicker training for entertainment agents
    (Blumberg et al., 2002 Kaplan et al., 2002)
  • Sophies World (Thomaz Breazeal, 2006)
  • RL with reward environmental (MDP) reward
    human reinforcement
  • Social software agent Cobot in LambdaMoo (Isbell
    et al., 2006)
  • RL with reward human reinforcement

7
MDP reward vs. Human reinforcement
  • MDP reward (within reinforcement learning)
  • Key problem credit assignment from sparse
    rewards
  • Reinforcement from a human trainer
  • Trainer has long-term impact in mind
  • Reinforcement is within a small temporal window
    of the targeted behavior
  • Credit assignment problem is largely removed

8
Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
  • TAMER approach
  • Learn a model of human reinforcement
  • Directly exploit the model to determine policy
  • If greedy

9
Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
  • Learning from
  • targeted human reinforcement
  • is a supervised learning problem,
  • not a reinforcement learning problem.

10
Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
11
The Shaped Agents Perspective
  • Each time step, agent
  • receives state description
  • might receive a scalar human reinforcement signal
  • chooses an action
  • does not receive an environmental reward signal
    (if learning purely from shaping)

12
Tetris
  • Drop blocks to make solid horizontal lines, which
    then disappear
  • state space gt 2250
  • Challenging but slow
  • 21 features extracted from (s, a)
  • TAMER model
  • Linear model over features
  • Gradient descent updates
  • Greedy action selection

13
TAMER in action Tetris
Training
After training
Before training
14
TAMER Results Tetris(9 subjects)
15
TAMER Results Tetris(9 subjects)
16
TAMER Results Mountain Car(19 subjects)
17
Conjectures on how to create an agent that can be
interactively shaped by a human trainer
  • For many tasks, greedily exploiting the human
    trainers reinforcement function yields a good
    policy.
  • Modeling a human trainers reinforcement is a
    supervised learning problem (not RL).
  • Exploration can be driven by negative
    reinforcement alone.
  • Credit assignment to a dense state-action history
    should
  • A human trainers reinforcement function is not
    static.
  • Human reinforcement is a function of states and
    actions.
  • In an MDP, human reinforcement should be treated
    differently from environmental reward.
  • Human trainers reinforce predicted action as well
    as recent action.

18
the end.
19
Mountain Car
  • Drive back and forth, gaining enough momentum to
    get to the goal on top of the hill
  • Continuous state space
  • Velocity and position
  • Simple but rapid actions
  • Feature extraction
  • 2D Gaussian RBFs over velocity and position of
    car
  • One grid of RBFs per action
  • TAMER model
  • Linear model over RBF features
  • Gradient descent updates

20
TAMER in action Mountain Car
After training
Before training
Training
21
TAMER Results Mountain Car(19 subjects)
22
TAMER Results Mountain Car(19 subjects)
23
HOW TO Convert a basic TD-Learning agent into a
TAMER agent (w/o temporal credit assignment)
  • the underlying fcn approximator must be a
    Q-function (for state-action values)
  • set discount factor (gamma) to 0
  • make action selection fully greedy
  • human reinf. replaces environmental reward
  • if no human input is received, no update
  • remove any eligibility traces (can just change
    parameter lambda to 0)
  • maybe lower alpha to .01 or less

24
HOW TO Convert a TD-Learning agent into a TAMER
agent (cont.)
  • With credit assignment (more frequent time steps)
  • Save (features, human reinf.) for each time step
    in a window from 0.2 seconds before to about 0.8
    seconds
  • define a probability distribution fcn over the
    window (a uniform distribution is probably fine)
  • 3. credit for each state-action pair is the
    integral of the pdf from the time of the next
    most recent timestep to the timestep for that
    pair
  • - for the update, both reward prediction (in
    place of state-action-value prediction) used to
    calculate the error and the calculation of the
    gradient for any one weight use the the weighted
    sum, for each action, of the features in the
    window (the weights are the "credit" calculated
    in the last step)
  • - time measurements used for credit assignment
    should be in real time, not simulation time
Write a Comment
User Comments (0)