Title: Neural Implementations of Bayesian Inference
1Neural Implementations of Bayesian Inference
Alexandre Pouget Department of Brain and
Cognitive Sciences University of Rochester
2Outline
- Encoding probability distributions with spikes
- Bayesian inference with spikes Multisensory
integration - Bayesian inference with spikes Decision making
- Alternative schemes
- Maximum likelihood estimation
3Visuo-Tactile Integration
(Ernst and Banks, Nature, 2002)
4Visuo-Tactile Integration
Bimodal p(sVision,Touch)
ap(sVision) p(sTouch)
Probability
p(sVision)
S (Width)
5Main Issues
- How do cortical neurons represent probability
distributions? - How do they take products of distributions?
- How do we make optimal decisions? How do neurons
collapse distributions onto maximum likelihood
estimates?
6Main Issues
- And how do they do so given the high level of
variability in neuronal responses in cortex?
7Poisson Variability in Cortex
The variability is Poisson-like p(rs) (rspike
counts) is bell shaped with variance proportional
to the mean (Fano factors within 0.3-1.8, Fano
factor for a Poisson process is 1)
Trial 1
Trial 2
Trial 3
Trial 4
8Probabilistic population code
- As an example, we consider a population of
neurons with Gaussian tuning curves and
independent Poisson variability.
rr1,r2,,rn
100
100
80
80
60
Activity
60
Activity (Spike count)
40
40
20
20
0
0
-45
0
45
-45
0
45
Stimulus
Preferred stimulus
Population pattern of activity on a single trial
9Population codes
- Standard approach estimating
100
Population vector
r
80
60
Activity (spike count)
40
20
0
-45
0
45
Preferred stimulus
Underlying assumption population codes encode
single values.
10Probabilistic population codes
- Alternative compute a posterior distribution,
p(sr) from (Foldiak, 1993 Sanger 1996).
100
r
80
60
Activity (spike count)
40
20
0
-45
0
45
Preferred stimulus
Variability in neural responses for a constant
stimulus Poisson-like
11Probabilistic population codes
- For independent Poisson noise product of experts
12For independent Poisson noise, There
fore, the gain encodes the certainty associated
with the encoded variable.
13Gain and variance
- For independent Poisson noise, we have
This is average of the width of the posterior
14Experimental Evidence
Anderson et al, Nature 2000
15Experimental Evidence
- Contrast
- Motion Coherency
- Retinal Eccentricity
16Outline
- Bayesian inference multisensory integration
17Inferences with probabilistic population codes
100
g1
80
Vision
C1
Activity
60
40
20
0
-45
0
45
Preferred S
100
80
C2
Activity
60
g2
Touch
40
20
0
-45
0
45
Preferred S
18100
g1
80
C1
Activity
60
40
20
100
gg1g2
0
-45
0
45
80
Preferred S
Activity
60
40
20
100
0
80
-45
0
45
C2
Activity
Preferred S
60
g2
40
20
0
-45
0
45
Preferred S
19Visuo-Tactile Integration
Bimodal p(sVision,Touch)
ap(sVision) p(sTouch)
Probability
p(sVision)
S (Width)
20Normalization
- Divisive normalization can be used to keep
neurons within their firing range.
21Assumptions
- Neural noise independent Poisson noise
- Gaussian tuning curves
- Unimodal Gaussian probability distributions over
the stimulus - Is this more general?
22Bayesian decoder
100
r1
80
C1
Activity
60
40
20
r1r2
100
0
-45
0
45
80
Preferred S
Activity
60
40
20
100
r2
0
80
-45
0
45
C2
Activity
Preferred S
60
40
Bayesian decoder
20
0
-45
0
45
Preferred S
Bayesian decoder
Bayesian inference
23Variability requirements
- Exponential distributions
Covariance matrix of r
Derivative of the tuning curves
24Kernel h(s)
Covariance matrix of r
Derivative of the tuning curves
Covariance between r and s
Local optimal linear estimator!
25Covariance requirements
- This family includes any distribution in which
the covariance matrix is proportional to the
mean, regardless of the form of the correlations.
- Any exponential distribution with a fixed Fano
factor works.
26Tuning curve requirements
- The tuning curve f(s) can take any shape.
However, h(s) has to be the same in all
populations. What if its not the same?
27Tuning curves Identical Gaussians
Activity
Preferred S
Cue 1
Cue 2
28Tuning Curves Gaussians with different widths
Cue 1
Cue 2
29Tuning curves Gaussians vs Sigmoids
40
30
Activity
20
10
0
-50
0
50
Preferred S
Cue 1
Cue 2
30Tuning curve requirements
- Let say r1 has gaussian tuning curves and r2 uses
sigmoidal tuning curves. Then, the optimal
combination is a linear combination. - The matrix A exists if the tuning curves are
basis sets.
31Distribution over s
- p(rs) does not have to be a normal distribution
over s.
32Prior Distributions
- Prior are easily incorporated
- Prediction baseline activity in cortex (e.g.
before the start of a trial) should encode the
prior distribution - There is evidence for this idea in LIP (Glimcher
and Platt) and the superior Colliculus (Basso and
Wurtz).
33Summary
- Linear combinations of PPCs are equivalent to
optimal Bayesian inference when the variability
follows an exponential distribution. This works
for - all covariance matrices that are proportional to
the mean (Fixed Fano factor) - any set of tuning curves that forms a basis set
- any probability distribution over s
- any prior distribution over s
34Integrate and fire neurons
- Can we get a similar result with realistic
networks of spiking neurons, such as integrate
and fire neurons?
35Integrate and fire neurons
- Output layer
- 1200 conductance-based integrate-and-fire
neurons, 1000 excitatory, 200 inhibitory - Lateral connections
- High Fano factors (0.3 to 1)
- Correlated activity
- Linear in rates
100
100
g1
Input near-Poisson correlated spike trains with
different gains and slightly different means
80
80
Activity
Activity
60
60
g2
40
40
20
20
0
0
-45
0
45
-45
0
45
Preferred S
Preferred S
Cue 1
Cue 2
36Test cue 1 alone
r1
100
80
Activity
60
40
20
0
-45
0
45
Preferred S
100
80
Activity
60
40
20
0
-45
0
45
Preferred S
Cue 1
37Test cue 2 alone
r2
100
80
Activity
60
40
20
0
-45
0
45
Preferred S
100
80
Activity
60
40
20
0
-45
0
45
Preferred S
Cue 2
38Test cue1 and cue2 together
r3
100
80
Activity
60
40
20
0
-45
0
45
Preferred S
100
100
80
80
Activity
Activity
60
60
40
40
20
20
0
0
-45
0
45
-45
0
45
Preferred S
Preferred S
Cue 1
Cue 2
39Compare the distributions
r3
100
80
Activity
60
40
20
How does p(r3s) compare to p(r1s)p(r2s)?
0
-45
0
45
Preferred S
100
100
80
80
Activity
Activity
60
60
40
40
20
20
0
0
-45
0
45
-45
0
45
Preferred S
Preferred S
Cue 1
Cue 2
40p(r3s) versus p(r1s)p(r2s)
Cue 1
Activity
Preferred S
Cue 2
Activity
Preferred S
Identical tuning curves
41p(r3s) versus p(r1s)p(r2s)
Cue 1
Activity
Mean
Variance
96
3
95
2.5
94
Preferred S
2
93
Variance of p(r3s)
Mean of p(r3s)
92
1.5
91
0.5
90
0
89
0
0.5
1
1.5
2
2.5
3
89
90
91
92
93
94
95
96
Cue 2
Activity
Mean of p(r1s)p(r2s)
Variance of p(r1s)p(r2s)
Preferred S
Different tuning curves and different correlations
42p(r3s) versus p(r1s)p(r2s)
Cue 1
Activity
Preferred S
Cue 2
Activity
Preferred S
43Experimental prediction
- Multisensory neurons should be linear on average
44Experimental prediction
- The main results in the literature are nonlinear
combinations (superadditivity)!
Wallace, Meredith, and Stein, J Neurophys 1998
45Experimental prediction
- The main results in the literature are nonlinear
combinations (superadditivity)! - In fact, nonlinearity is the criteria used to
define multisensory areas in fMRI - Are we already proven wrong?
46Experimental prediction
Perrault, Vaughan, Stein, and Wallace. J
Neurophys 2005
47Inference over time
- Can we generalize this approach to inference over
time, and more generally time varying signals?
48Outline
- Bayesian inference decision making
49Binary Decision Making
Shadlen et al.
50Binary Decision Making
- The Bayesian strategy involves computing the
posterior distribution given all activity
patterns from MT up to the current time, - Therefore, all we need to do is add the activity
patterns over time. - This predicts that decision neurons act like
integrators
51Bayesian decoder
50
40
Activity
30
20
10
0
-45
0
45
Preferred S
52LIP
Roitman Shadlen, 2002 J. Neurosci.
53Outline
54Alternative schemes
Log likelihood ratio (Shadlen et al Deneve)
55Log Likelihood
- Race models and Bayesian approach
Temporal sum
over
56Differences between Log odds and PPCs
- With PPCs, LIP neurons do not compute the
activity difference between MT neurons with
opposite direction preferences - PPCs and log odds turn products into sums but for
log odds, sums are products regardless of the
noise distribution. Not so for PPCs - At the end of the integration, LIP encodes the
posterior distribution over direction, i.e, LIP
knows how much it can trust its choice
57Alternative schemes
Log likelihood ratio (Shadlen et al Deneve)
- Log probability
- (Barlow Rao Jazayeri, Movshon)
-
- Probability
- (Anastasio et al Simoncelli Hoyer and
Hyvarynen Rao Koechlin et al)
- Convolution codes
- (Anderson Zemel, Dayan, Pouget)
58Alternative schemes
Si
Log likelihood ratio (Shadlen et al Deneve)
100
80
60
- Log probability
- (Barlow Rao Jazayeri, Movshon)
-
Activity
40
20
0
-90
0
90
Stimulus
- Probability
- (Anastasio et al Simoncelli Hoyer and
Hyvarynen Rao Koechlin et al)
- Convolution codes
- (Anderson Zemel, Dayan, Pouget)
59Alternative schemes
Si?
Si?
Log likelihood ratio (Shadlen et al Deneve)
100
80
60
- Log probability
- (Barlow Rao Jazayeri, Movshon)
-
Activity
40
20
0
-90
0
90
Stimulus
- Probability
- (Anastasio et al Simoncelli Hoyer and
Hyvarynen Rao Koechlin et al)
- Convolution codes
- (Anderson Zemel, Dayan, Pouget)
60Alternative schemes
Si?
Log likelihood ratio (Shadlen et al Deneve)
100
80
60
- Log probability
- (Barlow Rao Jazayeri, Movshon)
-
Activity
40
20
0
-90
0
90
Stimulus
- Probability
- (Anastasio et al Simoncelli Hoyer and
Hyvarynen Rao Koechlin et al)
- Convolution codes
- (Anderson Zemel, Dayan, Pouget)
61Alternative schemes
The convolution codes and the log likelihood fail
to account for contrast invariance
0.04
Log P(sr)
0.02
0
-45
0
45
Orientation (deg)
Contrast invariance
Prediction for Convolution codes
Prediction for Log Likelihood
62Outline
- Maximum likelihood estimation
63Decision Making
Superior Colliculus
LIP
64Maximum Likelihood
Activity
0
Preferred Direction (deg)
65Neural implementation
66Optimal decision making
100
80
60
40
20
0
-100
0
100
Superior Colliculus
Preferred saccade direction
LIP
100
80
60
40
20
0
-100
0
100
Preferred saccade direction
67Nonlinear Networks
- Networks in which the activity at time t1 is a
nonlinear function of the activity at the
previous time step.
68Line Attractor Networks
- Attractor network with population code
- Periodic variable
- Translation invariant weights
69Line Attractor Networks
70Line Attractor Networks
Desired profile
Desired profile over u
71Line Attractor Networks
- The problem with the previous approach is that
the weights tend to oscillate. Instead, we
minimize - The solution is
72Weight Pattern
5
4
Amplitude
2
0
-2
0
Difference in preferred orientation
73Optimal decision making
Superior Colliculus
LIP
100
80
60
40
20
0
-100
0
100
Preferred saccade direction
74Optimal decision making
A maximum likelihood estimate minimizes this
variance
100
80
60
40
20
0
-100
0
100
Superior Colliculus
Preferred saccade direction
LIP
100
80
60
40
20
0
-100
0
100
Preferred saccade direction
75Is the network an ML estimator?
Variances
Above Maximum Likelihood
Pop Vector
Network
76Optimality constaint
Eigenvector with eigenvalue equal to 0
Covariance matrix of r
Derivative of the tuning curves
Covariance between r and s
Local optimal linear estimator!
This network is effectively projecting its input
on the LOLE
77General Results
- Line attractor networks (stable smooth hills) are
equivalent to maximum likelihood estimators. - This result holds regardless of the exact form
of the nonlinear activation function.
78Performance Over Time
6
5
4
3
Standard Deviation (deg)
2
1
0
-1 0 5 10
15
Time ( of iterations)
79Optimal decision making
100
80
60
40
20
Sensorimotor transformation lecture
0
-100
0
100
f(S)
S
100
80
60
40
20
0
-100
0
100
S
80Kalman and Particle Filters