Title: Computational Neuroscience of Reinforcement Learning
1Computational Neuroscience of Reinforcement
Learning
- François Rivest
- Sept 25 2007
- McGill University
Affiliation Département dinformatique et de
recherche opérationnelle Groupe de recherche sur
le système nerveux central Université de Montréal
2Computational Neuroscience of Reinforcement
Learning
3What is Reinforcement Learning?
- The reinforcement learning framework was
originally developed to create a neural model of
animal learning behaviors (by Andrew Barto in the
80's). - The framework is the following
- The agent is in an environment
- It sees a state, takes an action
- It receives a reward, and ends up in a new state
- The agent's goal is to maximize its reward
- This led to the development of the
temporal-difference learning algorithm.
The reward is the reinforcement signal
4Temporal Difference Learning
- The key idea in TD is to learn a function of the
state that is an estimate of the sum of future
rewards (to be expected from that state) - The TD trick to learn this is that the current
estimate should be equal to the next state
estimate plus the reward received.
Long-term estimate of reward
5Temporal Difference Learning
- Thus we would like
- The error in the current state estimate is
therefore the temporal difference in estimate
(the difference between current and next time
step)
Called effective reinforcement signal!
6Example The expected time in traffic
7Linear Model for Neuron V
- Assume a linear neuron fed by some other neurons
xk - To minimize the error dt
- We take the gradient of the squared error and
find the learning rule
8Demo
- Conditioning
- Masking
- Secondary conditioning
- Note that this TD A-C model covers far more
conditioning cases than Rescorla-Wagner model
(secondary conditioning , masking, etc..) (1)
9The Actor-Critic Model
This is a policy
10The Actor-Critic Model
- (Stochastic) Softmax rule
- Similar to hardmax or winner-take-all.
- But the probability that a given actor unit will
be the winner (active neuron) is proportional to
its activation. - Intuitive learning rule (for winner unit aj)
- More reward than expected means good alternative,
strengthen the state/action association - Less reward than expected means bad alternative,
weaken the state/action association
Hebbian?
11Example Finding the traffic shortcut
12Demo
13How is this Related to Neurophysiology?
Striatum
Frontal Cortex
SNc
VTA
14The Actor-Critic Model
15TD Error Signal and Dopaminergic Neurons
16Incentive Saliency
17Incentive Saliency
- When DA is blocked, animals used to running in a
maze to get reward tend to stay still. But if
placed manually near the reward, they take as
much of it as they normally would. - Question
- How can d (DA) also play a role in
action-selection modulation (real-time)?
18Incentive Saliency
- Assume that, to trigger an action, the winning
actor must reach a minimal activation threshold - How could we use DA signal to modulate activity?
- Now, what happens when we block DA in this
modified model?
19Drug Addiction
20Relations to Drug Addiction
- Some addictive drugs increase the dopamine level.
- Questions? (without incentive saliency)
- What does it mean in terms of reward estimate?
- What does it mean in terms of action learning
(conditioning)? - Is there hope for unlearning?
- What would be the effect of incentive saliency?
21The Actor-Critic Model
22Relation to Representation Learning
23Relation to Representation Learning
- Assumes the cortex learns by experimenting
- Basal ganglia A-C TD uses cortical inputs as part
of its representation. - Assume novelty (something we can't yet represent
is novel) is rewarding (1). - How could this be helpful in learning new
representations? - Can this be related to play?
24Relation to Representation Learning
Motor Output
Actors
V
Reward
DA
Novelty Signal
Frontal Cortex
Higher Cortex
Primary Cortex
25Relation to Cognition, Frontal Cortex, and Gating
26DA and Frontal cortex
Striatum
Frontal Cortex
SNc
VTA
27Working Memory (and Gating)
- Frontal cortex is often considered to implement
working memory - Working memory must contain some form of gating
28Working Memory (and Gating)
- Frontal cortex differs from cortex by its DA input
29How could DA act on gates (directly)?
- By indicating when to shift memory content or
memory context (WCS example)? (1) - By indicating salient events?
- Novelty?
- Prediction error?
- Reward related stimulus?
- By directing attention?
- Incentive saliency in goals?
30How could DA act on learning (indirectly)?
- By guiding learning
- What to learn (structural abstraction)?
- When to learn (temporal abstraction)?
- Positive or negative corrections?
- Large or small corrections?
31DA and Frontal cortex
Striatum
Frontal Cortex
SNc
VTA
32Questions?
Email francois.rivest_at_mail.mcgill.ca Blog
www-etud.iro.umontreal.ca/rivestfr/wordpress