Title: Bayesian models of human learning and inference
1- Bayesian models of human learning and
inference - Josh Tenenbaum
- MIT
- Department of Brain and Cognitive Sciences
- Computer Science and AI Lab (CSAIL)
(http//web.mit.edu/cocosci/Talks/nips06-tutorial.
ppt)
Thanks to Tom Griffiths, Charles Kemp, Vikash
Mansinghka
2The probabilistic revolution in AI
- Principled and effective solutions for inductive
inference from ambiguous data - Vision
- Robotics
- Machine learning
- Expert systems / reasoning
- Natural language processing
- Standard view no necessary connection to how the
human brain solves these problems.
3Probabilistic inference inhuman cognition?
- People arent Bayesian
- Kahneman and Tversky (1970s-present)
heuristics and biases research program. 2002
Nobel Prize in Economics. - Slovic, Fischhoff, and Lichtenstein (1976) It
appears that people lack the correct programs for
many important judgmental tasks.... it may be
argued that we have not had the opportunity to
evolve an intellect capable of dealing
conceptually with uncertainty. - Stephen Jay Gould (1992) Our minds are not
built (for whatever reason) to work by the rules
of probability.
4 A. greater than 90 B. between 70 and
90 C. between 50 and 70 D. between 30 and
50 E. between 10 and 30 F. less than 10
The probability of breast cancer is 1 for a
woman at 40 who participates in a routine
screening. If a woman has breast cancer, the
probability is 80 that she will have a positive
mammography. If a woman does not have breast
cancer, the probability is 9.6 that she will
also have a positive mammography.
A woman in this age group had a positive
mammography in a routine screening. What is the
probability that she actually has breast cancer?
5Availability biases in probability judgment
- How likely is that a randomly chosen word
- ends in g?
- ends in ing?
- When buying a car, how much do you weigh your
friends experience relative to consumer
satisfaction surveys?
6(No Transcript)
7(No Transcript)
8Probabilistic inference inhuman cognition?
- People arent Bayesian
- Kahneman and Tversky (1970s-present)
heuristics and biases research program. 2002
Nobel Prize in Economics. - Psychology is often drawn towards the minds
errors and apparent irrationalities. - But the computationally interesting question
remains How does mind work so well?
9Bayesian models of cognition
- Visual perception Weiss, Simoncelli, Adelson,
Richards, Freeman, Feldman, Kersten, Knill,
Maloney, Olshausen, Jacobs, Pouget, ... - Language acquisition and processing Brent, de
Marken, Niyogi, Klein, Manning, Jurafsky, Keller,
Levy, Hale, Johnson, Griffiths, Perfors,
Tenenbaum, - Motor learning and motor control Ghahramani,
Jordan, Wolpert, Kording, Kawato, Doya, Todorov,
Shadmehr, - Associative learning Dayan, Daw, Kakade,
Courville, Touretzky, Kruschke, - Memory Anderson, Schooler, Shiffrin, Steyvers,
Griffiths, McClelland, - Attention Mozer, Huber, Torralba, Oliva,
Geisler, Yu, Itti, Baldi, - Categorization and concept learning Anderson,
Nosfosky, Rehder, Navarro, Griffiths, Feldman,
Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, - Reasoning Chater, Oaksford, Sloman, McKenzie,
Heit, Tenenbaum, Kemp, - Causal inference Waldmann, Sloman, Steyvers,
Griffiths, Tenenbaum, Yuille, - Decision making and theory of mind Lee,
Stankiewicz, Rao, Baker, Goodman, Tenenbaum,
10Learning concepts from examples
horse
horse
horse
11Learning concepts from examples
12Everyday inductive leaps
- How can people learn so much about the world . .
. - Kinds of objects and their properties
- The meanings of words, phrases, and sentences
- Cause-effect relations
- The beliefs, goals and plans of other people
- Social structures, conventions, and rules
- . . . from such limited evidence?
13Contributions of Bayesian models
- Principled quantitative models of human behavior,
with broad coverage and a minimum of free
parameters and ad hoc assumptions. - Explain how and why human learning and reasoning
works, in terms of (approximations to) optimal
statistical inference in natural environments. - A framework for studying peoples implicit
knowledge about the structure of the world how
it is structured, used, and acquired. - A two-way bridge to state-of-the-art AI and
machine learning.
14Marrs Three Levels of Analysis
- Computation
- What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out? - Algorithm
- Cognitive psychology
- Implementation
- Neurobiology
15What about those errors?
- The human mind is not a universal Bayesian
engine. - But, the mind does appear adapted to solve
important real-world inference problems in
approximately Bayesian ways, e.g. - Predicting everyday events
- Causal learning and reasoning
- Learning concepts from examples
- Like perceptual tasks, adults and even young
children solve these problems mostly
unconsciously, effortlessly, and successfully.
16Technical themes
- Inference in probabilistic models
- Role of priors, explaining away.
- Learning in graphical models
- Parameter learning, structure learning.
- Bayesian model averaging
- Being Bayesian over network structures.
- Bayesian Occams razor
- Trade off model complexity against data fit.
17Technical themes
- Structured probabilistic models
- Grammars, first-order logic, relational schemas.
- Hierarchical Bayesian models
- Acquire abstract knowledge, supports transfer.
- Nonparametric Bayes
- Flexible models that grow in complexity as new
data warrant. - Tractable approximate inference
- Markov chain Monte Carlo (MCMC), Sequential Monte
Carlo (particle filtering).
18Outline
- Predicting everyday events
- Causal learning and reasoning
- Learning concepts from examples
19Outline
- Predicting everyday events
- Causal learning and reasoning
- Learning concepts from examples
20Basics of Bayesian inference
- Bayes rule
- An example
- Data John is coughing
- Some hypotheses
- John has a cold
- John has lung cancer
- John has a stomach flu
- Likelihood P(dh) favors 1 and 2 over 3
- Prior probability P(h) favors 1 and 3 over 2
- Posterior probability P(hd) favors 1 over 2 and
3
21Bayesian inference in perception and sensorimotor
integration
(Weiss, Simoncelli Adelson 2002)
(Kording Wolpert 2004)
22Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
Power law of forgetting
Spacing effects in forgetting
Additive effects of practice delay
Log memory strength
Mean recalled
Log delay (hours)
Retention interval (days)
Log delay (seconds)
23Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
- For each item in memory, estimate the probability
that it will be useful in the present context. - Use priors based on the statistics of natural
information sources.
24Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
Power law of forgetting
Spacing effects in forgetting
Additive effects of practice delay
Log need odds
Log need odds
Log days since last occurrence
Log days since last occurrence
Log days since last occurrence
New York Times data c.f. email sources,
child-directed speech
25Everyday prediction problems(Griffiths
Tenenbaum, 2006)
- You read about a movie that has made 60 million
to date. How much money will it make in total? - You see that something has been baking in the
oven for 34 minutes. How long until its ready? - You meet someone who is 78 years old. How long
will they live? - Your friend quotes to you from line 17 of his
favorite poem. How long is the poem? - You see taxicab 107 pull up to the curb in front
of the train station. How many cabs in this city?
26Making predictions
- You encounter a phenomenon that has existed for
tpast units of time. How long will it continue
into the future? (i.e. whats ttotal?) - We could replace time with any other quantity
that ranges from 0 to some unknown upper limit.
27Bayesian inference
- P(ttotaltpast) ? P(tpastttotal) P(ttotal)
posterior probability
likelihood
prior
28Bayesian inference
- P(ttotaltpast) ? P(tpastttotal) P(ttotal)
- ? 1/ttotal 1/ttotal
posterior probability
likelihood
prior
Uninformative prior
Assume random sample (0 lt tpast lt ttotal)
(e.g., Jeffreys, Jaynes)
29Bayesian inference
- P(ttotaltpast) ? 1/ttotal
1/ttotal
posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
30Bayesian inference
- P(ttotaltpast) ? 1/ttotal
1/ttotal
posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
Best guess for ttotal t such that P(ttotal gt
ttpast) 0.5
31Bayesian inference
- P(ttotaltpast) ? 1/ttotal
1/ttotal
posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
Yields Gotts Rule P(ttotal gt ttpast) 0.5
when t 2tpast i.e., best
guess for ttotal 2tpast .
32Evaluating Gotts Rule
- You read about a movie that has made 78 million
to date. How much money will it make in total? - 156 million seems reasonable.
- You meet someone who is 35 years old. How long
will they live? - 70 years seems reasonable.
- Not so simple
- You meet someone who is 78 years old. How long
will they live? - You meet someone who is 6 years old. How long
will they live?
33The effects of priors
- Different kinds of priors P(ttotal) are
appropriate in different domains.
e.g., wealth, contacts
e.g., height, lifespan
Gott P(ttotal)?? ttotal-1
34The effects of priors
35Evaluating human predictions
- Different domains with different priors
- A movie has made 60 million
- Your friend quotes from line 17 of a poem
- You meet a 78 year old man
- A move has been running for 55 minutes
- A U.S. congressman has served for 11 years
- A cake has been in the oven for 34 minutes
- Use 5 values of tpast for each.
- People predict ttotal .
36(No Transcript)
37You learn that in ancient Egypt, there was a
great flood in the 11th year of a pharaohs
reign. How long did he reign?
38You learn that in ancient Egypt, there was a
great flood in the 11th year of a pharaohs
reign. How long did he reign?
How long did the typical pharaoh reign in
ancient egypt?
39exponential or power law?
If a friend is calling a telephone box office to
book tickets and tells you he has been on hold
for 3 minutes, how long do you think will be on
hold in total?
40Summary prediction
- Predictions about the extent or magnitude of
everyday events follow Bayesian principles. - Contrast with Bayesian inference in perception,
motor control, memory no universal priors
here. - Predictions depend rationally on priors that are
appropriately calibrated for different domains. - Form of the prior (e.g., power-law or
exponential) - Specific distribution given that form
(parameters) - Non-parametric distribution when necessary.
- In the absence of concrete experience, priors may
be generated by qualitative background knowledge.
41Outline
- Predicting everyday events
- Causal learning and reasoning
- Learning concepts from examples
42Bayesian networks
Nodes variables Links direct dependencies Each
node has a conditional probability
distribution Data observations of X1, ..., X4
- Four random variables
- X1 coughing
- X2 high body temperature
- X3 flu
- X4 lung cancer
43Causal Bayesian networks
Nodes variables Links causal mechanisms Each
node has a conditional probability
distribution Data observations of and
interventions on X1, ..., X4
- Four random variables
- X1 coughing
- X2 high body temperature
- X3 flu
- X4 lung cancer
(Pearl Glymour Cooper)
44Inference in causal graphical models
- Explaining away or discounting in
social reasoning (Kelley Morris Larrick) - Screening off in intuitive causal reasoning
(Waldmann, Rehder Burnett, Blok Sloman,
Gopnik Sobel) - Better in chains than common-cause structures
common-cause better if mechanisms clearly
independent - Understanding and predicting the effects of
interventions (Sloman Lagnado Gopnik Schulz)
A
B
C
B
P(cb) vs. P(cb, a) P(cb, not
a)
B
A
C
A
C
45Learning graphical models
- Structure learning what causes what?
- Parameter learning how do causes work?
46Bayesian learning of causal structure
- Data d Causal
hypotheses h -
X4
X3
X4
X3
X1
X2
X1
X2
1. What is the most likely network h given
observed data d ? 2. How likely
is there to be a link X4
X2 ?
(Bayesian model averaging)
47Bayesian Occams Razor
(MacKay, 2003 Ghahramani tutorials)
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
48Learning causation from contingencies
C present (c)
C absent (c-)
e.g., Does injecting this chemical cause mice to
express a certain gene?
a
c
E present (e)
d
b
E absent (e-)
Subjects judge the extent C to which causes E
(rate on a scale from 0 to 100)
49Two models of causal judgment
- Delta-P (Jenkins Ward, 1965)
- Power PC (Cheng, 1997)
Power
50Judging the probability that C E (Buehner
Cheng, 1997 2003)
- Independent effects of both DP and causal power.
- At DP0, judgments decrease with base rate.
(frequency illusion)
51Learning causal strength(parameter learning)
- Assume this causal structure
- DP and causal power are maximum likelihood
estimates of the strength parameter w1, under
different parameterizations for P(EB,C)
- linear ? DP, Noisy-OR ? causal power
B
52Learning causal structure(Griffiths Tenenbaum,
2005)
- Hypotheses
-
-
- Bayesian causal support
-
h0
h1
likelihood ratio (Bayes factor) gives evidence
in favor of h1
noisy-OR
(assume uniform parameter priors, but see Yuille
et al., Danks et al.)
53Buehner and Cheng (1997)
People
DP (r 0.89)
Power (r 0.88)
Support (r 0.97)
54Implicit background theory
- Injections may or may not cause gene expression,
but gene expression does not cause injections. - No hypotheses with E C
- Other naturally occurring processes may also
cause gene expression. - All hypotheses include an always-present
background cause B C - Causes are generative, probabilistically
sufficient and independent, i.e. each cause
independently produces the effect in some
proportion of cases. - Noisy-OR parameterization
55Sensitivity analysis
People
Support (Noisy-OR)
?2
Support (generic parameterization)
56Generativity is essential
0/8 0/8
P(ec)
8/8 8/8
6/8 6/8
4/8 4/8
2/8 2/8
P(ec-)
100 50 0
Support
- Predictions result from ceiling effect
- ceiling effects only matter if you believe a
cause increases the probability of an effect
57Different parameterizations for different kinds
of mechanisms
Does C cause E?
Is there a difference in E with C vs. not-C?
Does C prevent E?
58Blicket detector (Sobel, Gopnik, and colleagues)
59Backwards blocking (Sobel, Tenenbaum Gopnik,
2004)
A Trial
AB Trial
- Initially Nothing on detector detector silent
(A0, B0, E0) - Trial 1 A B on detector detector active (A1,
B1, E1) - Trial 2 A on detector detector active (A1,
B0, E1) - 4-year-olds judge if each object is a blicket
- A a blicket (100 say yes)
- B probably not a blicket (34 say yes)
B
A
?
?
E
(cf. explaining away in weight space, Dayan
Kakade)
60Possible hypotheses?
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
61Bayesian causal learning
With a uniform prior on hypotheses, generic
parameterization
Probability of being a blicket
A
B
0.32
0.32
0.34
0.34
62A stronger hypothesis space
- Links can only exist from blocks to detectors.
- Blocks are blickets with prior probability q.
- Blickets always activate detectors, detectors
never activate on their own (i.e., deterministic
OR parameterization, no hidden causes).
P(h00) (1 q)2
P(h10) q(1 q)
P(h01) (1 q) q
P(h11) q2
B
A
B
A
B
A
B
A
E
E
E
E
P(E1 A0, B0) 0 0
0
0 P(E1 A1, B0) 0
0 1
1 P(E1 A0, B1) 0
1 0
1 P(E1 A1, B1)
0 1
1 1
63Manipulating prior probability(Tenenbaum, Sobel,
Griffiths, Gopnik)
A Trial
Initial
AB Trial
64Learning more complex structures
- Tenenbaum et al., Griffiths Sobel detectors
with more than two objects and noisy mechanisms - Steyvers et al., Sobel Kushnir active learning
with interventions (c.f. Tong Koller, Murphy) - Lagnado Sloman learning
from interventions on
continuous dynamical systems
65Inferring hidden causes
Common unobserved cause
4 x
2 x
2 x
Independent unobserved causes
1 x
2 x
2 x
2 x
2 x
One observed cause
The stick ball machine
2 x
4 x
(Kushnir, Schulz, Gopnik, Danks, 2003)
66Bayesian learning with unknown number of hidden
variables
(Griffiths et al 2006)
67(No Transcript)
68Inferring latent causes in classical
conditioning(Courville, Daw, Gordon, Touretzky
2003)
e.g., A noise X tone B click US
shock
Training A US A X B US Test X X
B
69Inferring latent causes in perceptual learning
(Orban, Fiser, Aslin, Lengyel 2006)
Learning to recognize objects and segment scenes
70Inferring latent causes in sensory integration
(Kording et al. 2006, NIPS 06)
71Coincidences(Griffiths Tenenbaum, in press)
- The birthday problem
- How many people do you need to have in the room
before the probability exceeds 50 that two of
them have the same birthday? - The bombing of London
23.
72How much of a coincidence?
73Bayesian coincidence factor
Chance
Latent common cause
C
x
x
x
x
x
x
x
x
x
x
August
- Alternative hypotheses
- proximity in date, matching days of the
month, matching month, ....
74How much of a coincidence?
75Bayesian coincidence factor
Latent common cause
Chance
C
x
x
x
x
x
x
x
x
x
x
uniform regularity
uniform
76Summary causal inference learning
- Human causal induction can be explained using
core principles of graphical models. - Bayesian inference (explaining away, screening
off) - Bayesian structure learning (Occams razor,
model averaging) - Active learning with interventions
- Identifying latent causes
77Summary causal inference learning
- Crucial constraints on hypothesis spaces come
from abstract prior knowledge, or intuitive
theories. - What are the variables?
- How can they be connected?
- How are their effects parameterized?
- Big open questions
- How can these theories be described formally?
- How can these theories be learned?
78Hierarchical Bayesian framework
Abstract Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
79A theory for blickets(c.f. PRMs, BLOG, FOPL)
80Learning with a uniform prior on network
structures
True network Sample 75 observations
attributes (1-12)
observed data
patients
81Learning a block-structured prior on network
structures (Mansinghka et al. 2006)
z
1 2 3 4
5 6 7 8
0.8
0.0
0.01
h
0.0
0.0
0.75
9 1011 12
0.0
0.0
0.0
True network Sample 75 observations
attributes (1-12)
observed data
patients
82The blessing of abstraction
True structure of graphical model G
of samples 20 80
1000
Graph G
edge (G)
Data D
edge (G)
Abstract theory Z
Graph G
class (z)
Data D
83The nonparametric safety-net
12
1
11
True structure of graphical model G
2
10
3
9
4
8
5
7
6
of samples 40 100
1000
Graph G
edge (G)
Data D
edge (G)
Abstract theory Z
Graph G
class (z)
Data D
84Outline
- Predicting everyday events
- Causal learning and reasoning
- Learning concepts from examples
85Learning from just one or a few examples, and
mostly unlabeled examples (semi-supervised
learning).
86Simple model of concept learning
Can you show me the other blickets?
87Simple model of concept learning
Other blickets.
88Simple model of concept learning
Other blickets.
- Learning from just one positive example is
possible if - Assume concepts refer to clusters in the world.
- Observe enough unlabeled data to identify clear
clusters. - (c.f. Learning with mixture models and EM,
Ghahramani Jordan, 1994 Nigam et al. 2000)
89Concept learning with mixture models in cognitive
science
- Fried Holyoak (1984)
- Modeled unsupervised and
semi-supervised categorization as EM
in a Gaussian mixture. - Anderson (1990)
- Modeled unsupervised and semi-supervised
categorization as greedy sequential search in an
infinite (Chinese restaurant process) mixture.
90Infinite (CRP) mixture models
- Construct from k-component mixtures by
integrating out mixing weights, collapsing
equivalent partitions, and taking the limit as
. - Does not require that we commit to a fixed or
even finite number of classes. - Effective number of classes can grow with number
of data points, balancing complexity with data
fit. - Computationally much simpler than applying
Bayesian Occams razor or cross-validation. - Easy to learn with standard Monte Carlo
approximations (MCMC, particle filtering),
hopefully avoiding local minima.
91High school lunch room analogy
92Sampling from the CRP
punks
preppies
jocks
nerds
93(No Transcript)
94Assign to larger groups
Group with similar objects
Gibbs sampler (Neal)
punks
preppies
jocks
nerds
95A typical cognitive experiment
F1 F2 F3 F4 Label
Training stimuli 1 1 1 1 1 1
0 1 0 1 0 1 0 1 1 0 0
0 0 0 0 1 0 0 0 1 0 1
1 0
Test stimuli 0 1 1 1 ? 1
1 0 1 ? 1 1 1 0 ? 1 0
0 0 ? 0 0 1 0 ? 0 0
0 1 ?
96Anderson (1990), Rational model of
categorization Greedy sequential search
in an infinite mixture
model. Sanborn, Griffiths, Navarro (2006), More
rational model of categorization Particle
filter with a small of particles
97Towards more natural concepts
98CrossCat Discovering multiple structures that
capture different subsets of features(Shafto,
Kemp, Mansinghka, Gordon Tenenbaum, 2006)
99Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
(c.f. Xu, Tresp, et al. SRL 06)
concept
predicate
concept
- Biomedical predicate data from UMLS (McCrae et
al.) - 134 concepts enzyme, hormone, organ, disease,
cell function ... - 49 predicates affects(hormone, organ),
complicates(enzyme, cell function), treats(drug,
disease), diagnoses(procedure, disease)
100Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
101Learning from very few examples
tufa
tufa
tufa
Cows have T9 hormones. Seals have T9
hormones. Squirrels have T9 hormones. All
mammals have T9 hormones.
Cows have T9 hormones. Sheep have T9
hormones. Goats have T9 hormones. All mammals
have T9 hormones.
102The computational problem(c.f., semi-supervised
learning)
?
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
New property
Features
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
103X
Y
Hypotheses h
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
...
...
Prior P(h)
104X
Y
Prediction P(Y X)
Hypotheses h
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
...
...
Prior P(h)
105Many sources of priors
106Hierarchical Bayesian Framework(Kemp Tenenbaum)
F form
Tree
S structure
F1 F2 F3 F4 Has T9 hormones
mouse squirrel chimp gorilla
D data
? ? ?
107P(DS) How the structure constrains the data of
experience
- Define a stochastic process over structure S that
generates hypotheses h. - For generic properties, prior should favor
hypotheses that vary smoothly over structure. - Many properties of biological species were
actually generated by such a process (i.e.,
mutation selection).
Smooth P(h) high
Not smooth P(h) low
108P(DS) How the structure constrains the data of
experience
S
Gaussian Process ( random walk,
diffusion)
Zhu, Ghahramani Lafferty 2003
y
Threshold
h
109A graph-based prior
- Let dij be the length of the edge between i and j
- ( if i and j are not connected)
A Gaussian prior N(0, S), with (Zhu, Lafferty
Ghahramani, 2003)
110Structure S
Data D
Species 1 Species 2 Species 3 Species 4 Species
5 Species 6 Species 7 Species 8 Species 9 Species
10
Features
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
111(No Transcript)
112(No Transcript)
113Structure S
Data D
Species 1 Species 2 Species 3 Species 4 Species
5 Species 6 Species 7 Species 8 Species 9 Species
10
? ? ? ? ? ? ? ?
Features
New property
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
114Cows have property P. Elephants have property
P. Horses have property P.
Tree
2D
Gorillas have property P. Mice have property
P. Seals have property P. All mammals have
property P.
115Reasoning about spatially varying properties
- Native American artifacts task
116Property type has T9 hormones can bite
through wire carry E. Spirus
bacteria Theory Structure taxonomic tree
directed chain
directed network diffusion process
drift process noisy transmission
Class D
Class D
Class A
Class A
Class F
Class E
Class C
Class C
Class B
Class G
Class E
Class B
Class F
Hypotheses
Class G
Class A Class B Class C Class D Class E Class
F Class G
. . .
. . .
. . .
117Herring
Tuna
Mako shark
Sand shark
Dolphin
Human
Kelp
Sand shark
Kelp
Human
Mako shark
Tuna
Herring
Dolphin
118Hierarchical Bayesian Framework
F form
Tree
Space
Chain
mouse
squirrel
S structure
gorilla
chimp
F1 F2 F3 F4
D data
mouse squirrel chimp gorilla
119Discovering structural forms
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Robin
Crocodile
Snake
Bat
Turtle
Orangutan
Ostrich
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
120Discovering structural forms
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Great chain of being
Robin
Crocodile
Snake
Bat
Turtle
Plant
Rock
Orangutan
Angel
Ostrich
God
Linnaeus
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
121People can discover structural forms
- Scientists
- Tree structure for living kinds (Linnaeus)
- Periodic structure for chemical elements
(Mendeleev) - Children
- Hierarchical structure of category labels
- Clique structure of social groups
- Cyclical structure of seasons or days of the week
- Transitive structure for value
122The value of structural form knowledge inductive
bias
123Typical structure learning algorithms assume a
fixed structural form
Flat Clusters
Line
Circle
K-Means Mixture models Competitive learning
Guttman scaling Ideal point models
Circumplex models
Grid
Tree
Euclidean Space
Hierarchical clustering Bayesian phylogenetics
MDS PCA Factor Analysis
Self-Organizing Map Generative topographic
mapping
124Goal a universal framework for unsupervised
learning
Universal Learner
K-Means Hierarchical clustering Factor
Analysis Guttman scaling Circumplex
models Self-Organizing maps
Data
Representation
125Hierarchical Bayesian Framework
F form
S structure
F1 F2 F3 F4
D data
mouse squirrel chimp gorilla
126Structural forms as graph grammars
Form
Form
Process
Process
127Node-replacement graph grammars
Production (Line)
Derivation
128Node-replacement graph grammars
Production (Line)
Derivation
129Node-replacement graph grammars
Production (Line)
Derivation
130Model fitting
- Evaluate each form in parallel
- For each form, heuristic search over structures
based on greedy growth from a one-node seed
131(No Transcript)
132Development of structural forms as more data are
observed
133Beyond Nativism versus Empiricism
- Nativism Explicit knowledge of structural
forms for core domains is innate. - Atran (1998) The tendency to group living kinds
into hierarchies reflects an innately determined
cognitive structure. - Chomsky (1980) The belief that various systems
of mind are organized along quite different
principles leads to the natural conclusion that
these systems are intrinsically determined, not
simply the result of common mechanisms of
learning or growth. - Empiricism General-purpose learning systems
without explicit knowledge of structural form. - Connectionist networks (e.g., Rogers and
McClelland, 2004). - Traditional structure learning in probabilistic
graphical models.
134Summary concept learning
- Models based on Bayesian inference over
hierarchies of structured representations. - How does abstract domain knowledge guide learning
of new concepts? - How can this knowledge be represented, and how
might it be learned?
F form
mouse
squirrel
S structure
chimp
gorilla
F1 F2 F3 F4
mouse squirrel chimp gorilla
D data
-
- How can probabilistic inference work together
with flexibly structured representations to model
complex, real-world learning and reasoning? -
135Contributions of Bayesian models
- Principled quantitative models of human behavior,
with broad coverage and a minimum of free
parameters and ad hoc assumptions. - Explain how and why human learning and reasoning
works, in terms of (approximations to) optimal
statistical inference in natural environments. - A framework for studying peoples implicit
knowledge about the structure of the world how
it is structured, used, and acquired. - A two-way bridge to state-of-the-art AI and
machine learning.
136Looking forward
- What we need to understand the minds ability to
build rich models of the world from
sparse data. - Learning about objects, categories, and their
properties. - Causal inference
- Language comprehension and production
- Scene understanding
- Understanding other peoples actions, plans,
thoughts, goals - What do we need to understand these abilities?
- Bayesian inference in probabilistic generative
models - Hierarchical models, with inference at all levels
of abstraction - Structured representations graphs, grammars,
logic - Flexible representations, growing in response to
observed data
137Learning word meanings
Whole-object principle Shape bias Taxonomic
principle Contrast principle Basic-level bias
Abstract Principles
Structure
Data
(Tenenbaum Xu)
138Causal learning and reasoning
Abstract Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
139Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
140Vision as probabilistic parsing
(Han Zhu, 2006 c.f., Zhu, Yuanhao Yuille
NIPS 06 )
141(No Transcript)
142Goal-directed action (production and
comprehension)
(Wolpert et al., 2003)
143Bayesian models of action understanding
(Baker, Tenenbaum Saxe Verma Rao)
144Open directions and challenges
- Effective methods for learning structured
knowledge - How to balance expressiveness/learnability
tradeoff? - More precise relation to psychological processes
- To what extent do mental processes implement
boundedly rational methods of approximate
inference? - Relation to neural computation
- How to implement structured representations in
brains? - Modeling individual subjects and single trials
- Is there a rational basis for probability
matching? - Understanding failure cases
- Are these simply not Bayesian, or are people
using a different model? How do we avoid
circularity?
145Want to learn more?
- Special issue of Trends in Cognitive Sciences
(TiCS), July 2006 (Vol. 10, no. 7), on
Probabilistic models of cognition. - Tom Griffiths reading list, a/k/a
http//bayesiancognition.com - Summer school on probabilistic models of
cognition, July 2007, Institute for Pure and
Applied Mathematics (IPAM) at UCLA.
146(No Transcript)
147Extra slides
148Bayesian prediction
- P(ttotaltpast) ? 1/ttotal
P(tpast)
posterior probability
Random sampling
Domain-dependent prior
What is the best guess for ttotal? Compute t
such that P(ttotal gt ttpast) 0.5
P(ttotaltpast)
We compared the median of the Bayesian
posterior with the median of subjects judgments
but what about the distribution of subjects
judgments?
ttotal
149Sources of individual differences
- Individuals judgments could by noisy.
- Individuals judgments could be optimal, but with
different priors. - e.g., each individual has seen only a sparse
sample of the relevant population of events. - Individuals inferences about the posterior could
be optimal, but their judgments could be based on
probability (or utility) matching rather than
maximizing.
150Individual differences in prediction
P(ttotaltpast)
ttotal
Proportion of judgments below predicted value
Quantile of Bayesian posterior distribution
151Individual differences in prediction
P(ttotaltpast)
ttotal
- Average over all
- prediction tasks
- movie run times
- movie grosses
- poem lengths
- life spans
- terms in congress
- cake baking times
152Individual differences in concept learning
153Why probability matching?
- Optimal behavior under some (evolutionarily
natural) circumstances. - Optimal betting theory, portfolio theory
- Optimal foraging theory
- Competitive games
- Dynamic tasks (changing probabilities or
utilities) - Side-effect of algorithms for approximating
complex Bayesian computations. - Markov chain Monte Carlo (MCMC) instead of
integrating over complex hypothesis spaces,
construct a sample of high-probability
hypotheses. - Judgments from individual (independent) samples
can on average be almost as good as using the
full posterior distribution.
154Markov chain Monte Carlo
(Metropolis-Hastings algorithm)
155The puzzle of coincidences
- Discoveries of hidden causal structure are often
driven by noticing coincidences. . . - Science
- Halleys comet (1705)
156(Halley, 1705)
157(Halley, 1705)
158The puzzle of coincidences
- Discoveries of hidden causal structure are often
driven by noticing coincidences. . . - Science
- Halleys comet (1705)
- John Snow and the cause of cholera (1854)
159(No Transcript)
160Rational analysis of cognition
- Often can show that apparently irrational
behavior is actually rational.
Which cards do you have to turn over to test this
rule? If there is an A on one side, then there
is a 2 on the other side
161Rational analysis of cognition
- Often can show that apparently irrational
behavior is actually rational. - Oaksford Chaters rational analysis
- Optimal data selection based on maximizing
expected information gain. - Test the rule If p, then q against the null
hypothesis that p and q are independent. - Assuming p and q are rare predicts peoples
choices
162Integrating multiple forms of reasoning(Kemp,
Shafto, Berke Tenenbaum NIPS 06)
2) Causal relations between features
Parameters of causal relations vary
smoothly over the category hierarchy.
1) Taxonomic relations between categories
T9 hormones cause elevated heart rates. Elevated
heart rates cause faster metabolisms. Mice have
T9 hormones. ?
163Integrating multiple forms of reasoning
164Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
(c.f. Xu, Tresp, et al. SRL 06)
concept
predicate
concept
- Biomedical predicate data from UMLS (McCrae et
al.) - 134 concepts enzyme, hormone, organ, disease,
cell function ... - 49 predicates affects(hormone, organ),
complicates(enzyme, cell function), treats(drug,
disease), diagnoses(procedure, disease)
165Learning relational theories
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
166Learning annotated hierarchies from
relational data(Roy, Kemp, Mansinghka, Tenenbaum
NIPS 06)
167Learning abstract relational structures
Dominance hierarchy Tree
Cliques Ring
Primate troop Bush administration
Prison inmates Kula islands x
beats y x told y x likes y
x trades with y
168Bayesian inference in neural networks
(Rao, in press)
169The big problem of intelligence
- The development of intuitive theories in
childhood. - Psychology How do we learn to understand others
actions in terms of beliefs, desires, plans,
intentions, values, morals? - Biology How do we learn that people, dogs, bees,
worms, trees, flowers, grass, coral, moss are
alive, but chairs, cars, tricycles, computers,
the sun, Roomba, robots, clocks, rocks are not?
170The big problem of intelligence
- Consider a man named Boris.
- Is the mother of Boriss father his grandmother?
- Is the mother of Boriss sister his mother?
- Is the son of Boriss sister his son?
(Note Boris and his family were stranded on a
desert island when he was a young boy.)