Title: How a Modeler
1How a Modelers Conception of Rewards Influences
a Models behavior
- Investigating ACT-R 6s utility learning mechanism
- Christian P. Janssen
- Wayne D. Gray
- Michael J. Schoelles
2Temporal difference learning ACT-R
- Temporal difference learning has recently been
introduced as ACT-Rs new utility learning
mechanism (e.g., Fu Anderson, 2004 Anderson,
2006, 2007 Bothell, 2005) - Utility learning learns to optimize behavior as
to maximize the rewards that the model receives - A model can
- Receive rewards at different moments in times
- Receive rewards of different magnitudes
- There are no guidelines for choosing when a
reward should be given and what its magnitude
should be
3New issues for ACT-R
- We studied two aspects of TD learning
- When is reward given
- Magnitude of the reward
- This a new issue for ACT-R
- When is reward given could be varied in ACT-R 5
- Magnitude of reward could not be varied in ACT-R
5 - As we will show, the modelers conception of
rewards has a big influence on a models behavior - Case study Blocks World task (Gray et al., 2006)
4Why the Blocks World task?
- Previous work indicates that the utility learning
mechanism is crucial for this task - ACT-R 5 models (Gray, Sims, Schoelles, 2005)
- Regular ACT-R 5 can not provide a good fit to the
human data - Because rewards in ACT-R 5 are binary (i.e.,
successes and failures) and not scalar - Ideal Performer Model (Gray et al., 2006)
- Model outside of ACT-R that uses temporal
difference learning provided a very good fit
(Gray et al., 2006)
5Blocks World task
6Blocks World task
Task Copy pattern in target window by moving
blocks from resource window to workspace window
7Blocks World task
Windows are covered with gray rectanglesAccessin
g information requires interaction with the
interface
8Blocks World task
Windows are covered with gray rectanglesAccessin
g information requires interaction with the
interface
9Blocks World task
Windows are covered with gray rectanglesAccessin
g information requires interaction with the
interface
10Blocks World task
Windows are covered with gray rectanglesAccessin
g information requires interaction with the
interface
11Blocks World task
- Blocks world task
- Information in Target Window is only available
after waiting for a lockout time - 0, 400 or 3200 milliseconds (between subjects)
12Blocks World task human data (Gray et al., 2006)
- Size of lockout time influences human behavior
13Blocks World task Modeling Strategies
- Strategy How many blocks do you plan to place
after a visit to the target window? - 8 encode-x production rules
- study x blocks
- Encode-1 till encode-8
- Model learns utility value of each production
rule using ACT-Rs temporal difference learning
algorithm
14Utility learning
- Utility learning requires the incorporation of
rewards - Two choices are crucial
- When is the reward is given?
- What is the magnitude of the reward?
- After some experience, the utility of a
production rule approximates (Anderson, 2007)
Magnitude
When is reward given
15Utility learning
- Choice 1 When is the reward given?
- Important because
- Utility value has a linear relationship with the
the time at which the reward is given - Choice in Blocks World
- Once model Update once, at the end of the trial
- Each model Update each time that part of the
task is completed. - A (set of) block(s) has been placed and the model
either returns to the target window to study more
blocks, or finishes the trial
16Utility learning
- Choice 2 magnitude of the reward
- Important because
- Utility value has a linear relationship with the
magnitude of the reward - But how to set this value?
- Experimental tweaking? -gt unfavorable
- Fixed range of values? (e.g., between 0 and 1) -gt
difficult - Relate to neurological data? -gt not available for
most models
17Utility learning
- Choice 2 magnitude of the reward
- Choice in Blocks World
- Relate the reward to what might be important in
the task - Accuracy Accuracy with which task is
performedOptions - Success blocks placed (once)
- Success blocks placed (each)
- Success Failure blocks placed - blocks
forgotten (each model) - Time How much time does (part of the) task
take?Options - Time spend on the task -1 time spend (once)
- Time spend waiting for specific aspect of the
task -1 lockout size number of visits to
target window (once) - Number of blocks placed per second (each)
18Blocks World task Modeling Strategies
- 6 models were developed
- Each model is run 6 times for each of 3
experimental conditions - 0, 400 and 3200 milliseconds
- Models interact with the same interface as human
participants
19Blocks World task general results
- Each model has unique results
20Blocks World task general results
- What is the impact of
- When the reward is given (once/each)
- The concept of the reward (related to
accuracy/time) - Results averaged over 3 models
21Utility learning impact of when reward is given
22Utility learning impact of concept of reward
23Comparison with ACT-R 5 (Gray, Sims Schoelles,
2005)
24Conclusion
- Rewards can be given at different times during a
trial and according to different concepts - There are no guidelines what the best choices are
- Blocks World suggests that rewards should
- Be given once Model can optimize behavior over
entire task - Relate to concept of time because different
strategy choices have a big impact on reward size - Models of other tasks should point out if this is
consistent
25Conclusion
- This is not just a Blocks World issue
- General Computer Science / AI issue
representing a task in the right way is
crucial(e.g., Russell Norvig, 1995 Sutton
Barto, 1998) - Many experiments involve manipulations and
measurements of accuracy and speed of performance - This a new issue for ACT-R
- When is reward given could be varied in ACT-R 5
- Magnitude of reward could not be varied in ACT-R
5
26Thank you for your attention
- Questions?
- More information
- cjanssen_at_ai.rug.nl
- www.ai.rug.nl/cjanssen
- www.cogsci.rpi.edu/cogworks
- Poster Session _at_ CogSci 2008 Thursday, July
24th Cognitive Models of Strategy Shifts in
Interactive Behavior(session Attention and
Implicit Learning)
27References
- Anderson, J. R. (2006). A new utility learning
mechanism. Paper presented at the 2006 ACT-R
workshop. - Anderson, J. R. (2007). How can the human mind
occur in the physical universe? New York Oxford
University Press. - Bothell, D. (2005). ACT-R 6 Official Release.
Proceedings of the 12th ACT-R Workshop. - Fu, W. T., Anderson, J. R. (2004). Extending
the computational abilities of the procedural
learning mechanism in ACT-R. Proceedings of the
26th annual meeting of the Cognitive Science
Society, 416-421. - Gray, W. D., Schoelles, M. J., Sims, C. R.
(2005). Adapting to the task environment
Explorations in expected value. Cognitive Systems
Research, 6(1), 27-40. - Gray, W. D., Sims, C. R., Fu, W. T., Schoelles,
M. J. (2006). The soft constraints hypothesis A
rational analysis approach to resource allocation
for interactive behavior. Psychological Review,
113(3), 461-482. - Russell, S. J., Norvig, P. (1995). Artificial
intelligence a modern approach. Upper Saddle
River, NJ Prentice-Hall, Inc. - Sutton, R. S., Barto, A. G. (1998).
Reinforcement learning An introduction.
Cambridge, MA MIT Press.