Design Principles for Creating HumanShapable Agents

About This Presentation

Title:

Design Principles for Creating HumanShapable Agents

Description:

Human trainers reinforce predicted action as well as recent action. the end. ... make action selection fully greedy. human reinf. replaces environmental reward ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 25

Provided by: willia203

Learn more at: http://www.cc.gatech.edu

more less

Transcript and Presenter's Notes

Title: Design Principles for Creating HumanShapable Agents

1
Design Principles for Creating Human-Shapable
Agents

W. Bradley Knox, Ian Fasel,
and Peter Stone

The University of Texas at Austin Department of
Computer Sciences
2
Transferring human knowledge through natural
forms of communication

Potential benefits over purely autonomous
learners
Decrease sample complexity
Learn in the absence of a reward function
Allow lay users to teach agents the policies that
they prefer (no programming!)
Learn in more complex domains

3
Shaping
LOOK magazine, 1952

Def. - creating a desired behavior by reinforcing
successive approximations of the behavior

4
The Shaping Scenario(in this context)

A human trainer observes an agent and manually
delivers reinforcement (a scalar value),
signaling approval or disapproval.
E.g., training a dog with treats as in the
previous picture

5
The Shaping Problem (for computational agents)

Within a sequential decision making task, how can
an agent harness state descriptions and
occasional scalar human reinforcement signals to
learn a good task policy?

6
Previous work on human-shapable agents

Clicker training for entertainment agents
(Blumberg et al., 2002 Kaplan et al., 2002)
Sophies World (Thomaz Breazeal, 2006)
RL with reward environmental (MDP) reward
human reinforcement
Social software agent Cobot in LambdaMoo (Isbell
et al., 2006)
RL with reward human reinforcement

7
MDP reward vs. Human reinforcement

MDP reward (within reinforcement learning)
Key problem credit assignment from sparse
rewards
Reinforcement from a human trainer
Trainer has long-term impact in mind
Reinforcement is within a small temporal window
of the targeted behavior
Credit assignment problem is largely removed

8
Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)

TAMER approach
Learn a model of human reinforcement
Directly exploit the model to determine policy
If greedy

9
Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)

Learning from
targeted human reinforcement
is a supervised learning problem,
not a reinforcement learning problem.

10
Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
11
The Shaped Agents Perspective

Each time step, agent
receives state description
might receive a scalar human reinforcement signal
chooses an action
does not receive an environmental reward signal
(if learning purely from shaping)

12
Tetris

Drop blocks to make solid horizontal lines, which
then disappear
state space gt 2250
Challenging but slow
21 features extracted from (s, a)
TAMER model
Linear model over features
Gradient descent updates
Greedy action selection

13
TAMER in action Tetris
Training
After training
Before training
14
TAMER Results Tetris(9 subjects)
15
TAMER Results Tetris(9 subjects)
16
TAMER Results Mountain Car(19 subjects)
17
Conjectures on how to create an agent that can be
interactively shaped by a human trainer

For many tasks, greedily exploiting the human
trainers reinforcement function yields a good
policy.
Modeling a human trainers reinforcement is a
supervised learning problem (not RL).
Exploration can be driven by negative
reinforcement alone.
Credit assignment to a dense state-action history
should
A human trainers reinforcement function is not
static.
Human reinforcement is a function of states and
actions.
In an MDP, human reinforcement should be treated
differently from environmental reward.
Human trainers reinforce predicted action as well
as recent action.

18
the end.
19
Mountain Car

Drive back and forth, gaining enough momentum to
get to the goal on top of the hill
Continuous state space
Velocity and position

Simple but rapid actions
Feature extraction
2D Gaussian RBFs over velocity and position of
car
One grid of RBFs per action
TAMER model
Linear model over RBF features
Gradient descent updates

20
TAMER in action Mountain Car
After training
Before training
Training
21
TAMER Results Mountain Car(19 subjects)
22
TAMER Results Mountain Car(19 subjects)
23
HOW TO Convert a basic TD-Learning agent into a
TAMER agent (w/o temporal credit assignment)

the underlying fcn approximator must be a
Q-function (for state-action values)
set discount factor (gamma) to 0
make action selection fully greedy
human reinf. replaces environmental reward
if no human input is received, no update
remove any eligibility traces (can just change
parameter lambda to 0)
maybe lower alpha to .01 or less

24
HOW TO Convert a TD-Learning agent into a TAMER
agent (cont.)

With credit assignment (more frequent time steps)
Save (features, human reinf.) for each time step
in a window from 0.2 seconds before to about 0.8
seconds
define a probability distribution fcn over the
window (a uniform distribution is probably fine)
3. credit for each state-action pair is the
integral of the pdf from the time of the next
most recent timestep to the timestep for that
pair
- for the update, both reward prediction (in
place of state-action-value prediction) used to
calculate the error and the calculation of the
gradient for any one weight use the the weighted
sum, for each action, of the features in the
window (the weights are the "credit" calculated
in the last step)
- time measurements used for credit assignment
should be in real time, not simulation time

Write a Comment

User Comments (0)