Title: Design Principles for Creating HumanShapable Agents
1Design Principles for Creating Human-Shapable
Agents
- W. Bradley Knox, Ian Fasel,
- and Peter Stone
The University of Texas at Austin Department of
Computer Sciences
2Transferring human knowledge through natural
forms of communication
- Potential benefits over purely autonomous
learners - Decrease sample complexity
- Learn in the absence of a reward function
- Allow lay users to teach agents the policies that
they prefer (no programming!) - Learn in more complex domains
3Shaping
LOOK magazine, 1952
- Def. - creating a desired behavior by reinforcing
successive approximations of the behavior
4The Shaping Scenario(in this context)
- A human trainer observes an agent and manually
delivers reinforcement (a scalar value),
signaling approval or disapproval. - E.g., training a dog with treats as in the
previous picture
5The Shaping Problem (for computational agents)
- Within a sequential decision making task, how can
an agent harness state descriptions and
occasional scalar human reinforcement signals to
learn a good task policy?
6Previous work on human-shapable agents
- Clicker training for entertainment agents
(Blumberg et al., 2002 Kaplan et al., 2002) - Sophies World (Thomaz Breazeal, 2006)
- RL with reward environmental (MDP) reward
human reinforcement - Social software agent Cobot in LambdaMoo (Isbell
et al., 2006) - RL with reward human reinforcement
7MDP reward vs. Human reinforcement
- MDP reward (within reinforcement learning)
- Key problem credit assignment from sparse
rewards - Reinforcement from a human trainer
- Trainer has long-term impact in mind
- Reinforcement is within a small temporal window
of the targeted behavior - Credit assignment problem is largely removed
8Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
- TAMER approach
- Learn a model of human reinforcement
- Directly exploit the model to determine policy
- If greedy
9Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
- Learning from
- targeted human reinforcement
- is a supervised learning problem,
- not a reinforcement learning problem.
10Teaching an Agent Manually via Evaluative
Reinforcement (TAMER)
11The Shaped Agents Perspective
- Each time step, agent
- receives state description
- might receive a scalar human reinforcement signal
- chooses an action
- does not receive an environmental reward signal
(if learning purely from shaping)
12Tetris
- Drop blocks to make solid horizontal lines, which
then disappear - state space gt 2250
- Challenging but slow
- 21 features extracted from (s, a)
- TAMER model
- Linear model over features
- Gradient descent updates
- Greedy action selection
13TAMER in action Tetris
Training
After training
Before training
14TAMER Results Tetris(9 subjects)
15TAMER Results Tetris(9 subjects)
16TAMER Results Mountain Car(19 subjects)
17Conjectures on how to create an agent that can be
interactively shaped by a human trainer
- For many tasks, greedily exploiting the human
trainers reinforcement function yields a good
policy. - Modeling a human trainers reinforcement is a
supervised learning problem (not RL). - Exploration can be driven by negative
reinforcement alone. - Credit assignment to a dense state-action history
should - A human trainers reinforcement function is not
static. - Human reinforcement is a function of states and
actions. - In an MDP, human reinforcement should be treated
differently from environmental reward. - Human trainers reinforce predicted action as well
as recent action.
18the end.
19Mountain Car
- Drive back and forth, gaining enough momentum to
get to the goal on top of the hill - Continuous state space
- Velocity and position
- Simple but rapid actions
- Feature extraction
- 2D Gaussian RBFs over velocity and position of
car - One grid of RBFs per action
- TAMER model
- Linear model over RBF features
- Gradient descent updates
20TAMER in action Mountain Car
After training
Before training
Training
21TAMER Results Mountain Car(19 subjects)
22TAMER Results Mountain Car(19 subjects)
23HOW TO Convert a basic TD-Learning agent into a
TAMER agent (w/o temporal credit assignment)
- the underlying fcn approximator must be a
Q-function (for state-action values) - set discount factor (gamma) to 0
- make action selection fully greedy
- human reinf. replaces environmental reward
- if no human input is received, no update
- remove any eligibility traces (can just change
parameter lambda to 0) - maybe lower alpha to .01 or less
24HOW TO Convert a TD-Learning agent into a TAMER
agent (cont.)
- With credit assignment (more frequent time steps)
- Save (features, human reinf.) for each time step
in a window from 0.2 seconds before to about 0.8
seconds - define a probability distribution fcn over the
window (a uniform distribution is probably fine) - 3. credit for each state-action pair is the
integral of the pdf from the time of the next
most recent timestep to the timestep for that
pair - - for the update, both reward prediction (in
place of state-action-value prediction) used to
calculate the error and the calculation of the
gradient for any one weight use the the weighted
sum, for each action, of the features in the
window (the weights are the "credit" calculated
in the last step) - - time measurements used for credit assignment
should be in real time, not simulation time