Bayesian models of human learning and inference

About This Presentation

Title:

Bayesian models of human learning and inference

Description:

Bayesian models of human learning and inference – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 157

Provided by: josht151

more less

Transcript and Presenter's Notes

Title: Bayesian models of human learning and inference

1

Bayesian models of human learning and
inference
Josh Tenenbaum
MIT
Department of Brain and Cognitive Sciences
Computer Science and AI Lab (CSAIL)

(http//web.mit.edu/cocosci/Talks/nips06-tutorial.
ppt)
Thanks to Tom Griffiths, Charles Kemp, Vikash
Mansinghka
2
The probabilistic revolution in AI

Principled and effective solutions for inductive
inference from ambiguous data
Vision
Robotics
Machine learning
Expert systems / reasoning
Natural language processing
Standard view no necessary connection to how the
human brain solves these problems.

3
Probabilistic inference inhuman cognition?

People arent Bayesian
Kahneman and Tversky (1970s-present)
heuristics and biases research program. 2002
Nobel Prize in Economics.
Slovic, Fischhoff, and Lichtenstein (1976) It
appears that people lack the correct programs for
many important judgmental tasks.... it may be
argued that we have not had the opportunity to
evolve an intellect capable of dealing
conceptually with uncertainty.
Stephen Jay Gould (1992) Our minds are not
built (for whatever reason) to work by the rules
of probability.

4
A. greater than 90 B. between 70 and
90 C. between 50 and 70 D. between 30 and
50 E. between 10 and 30 F. less than 10
The probability of breast cancer is 1 for a
woman at 40 who participates in a routine
screening. If a woman has breast cancer, the
probability is 80 that she will have a positive
mammography. If a woman does not have breast
cancer, the probability is 9.6 that she will
also have a positive mammography.
A woman in this age group had a positive
mammography in a routine screening. What is the
probability that she actually has breast cancer?
5
Availability biases in probability judgment

How likely is that a randomly chosen word
ends in g?
ends in ing?
When buying a car, how much do you weigh your
friends experience relative to consumer
satisfaction surveys?

6
(No Transcript)
7
(No Transcript)
8
Probabilistic inference inhuman cognition?

People arent Bayesian
Kahneman and Tversky (1970s-present)
heuristics and biases research program. 2002
Nobel Prize in Economics.
Psychology is often drawn towards the minds
errors and apparent irrationalities.
But the computationally interesting question
remains How does mind work so well?

9
Bayesian models of cognition

Visual perception Weiss, Simoncelli, Adelson,
Richards, Freeman, Feldman, Kersten, Knill,
Maloney, Olshausen, Jacobs, Pouget, ...
Language acquisition and processing Brent, de
Marken, Niyogi, Klein, Manning, Jurafsky, Keller,
Levy, Hale, Johnson, Griffiths, Perfors,
Tenenbaum,
Motor learning and motor control Ghahramani,
Jordan, Wolpert, Kording, Kawato, Doya, Todorov,
Shadmehr,
Associative learning Dayan, Daw, Kakade,
Courville, Touretzky, Kruschke,
Memory Anderson, Schooler, Shiffrin, Steyvers,
Griffiths, McClelland,
Attention Mozer, Huber, Torralba, Oliva,
Geisler, Yu, Itti, Baldi,
Categorization and concept learning Anderson,
Nosfosky, Rehder, Navarro, Griffiths, Feldman,
Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka,
Reasoning Chater, Oaksford, Sloman, McKenzie,
Heit, Tenenbaum, Kemp,
Causal inference Waldmann, Sloman, Steyvers,
Griffiths, Tenenbaum, Yuille,
Decision making and theory of mind Lee,
Stankiewicz, Rao, Baker, Goodman, Tenenbaum,

10
Learning concepts from examples

Word learning

horse
horse
horse
11
Learning concepts from examples
12
Everyday inductive leaps

How can people learn so much about the world . .
.
Kinds of objects and their properties
The meanings of words, phrases, and sentences
Cause-effect relations
The beliefs, goals and plans of other people
Social structures, conventions, and rules
. . . from such limited evidence?

13
Contributions of Bayesian models

Principled quantitative models of human behavior,
with broad coverage and a minimum of free
parameters and ad hoc assumptions.
Explain how and why human learning and reasoning
works, in terms of (approximations to) optimal
statistical inference in natural environments.
A framework for studying peoples implicit
knowledge about the structure of the world how
it is structured, used, and acquired.
A two-way bridge to state-of-the-art AI and
machine learning.

14
Marrs Three Levels of Analysis

Computation
What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?
Algorithm
Cognitive psychology
Implementation
Neurobiology

15
What about those errors?

The human mind is not a universal Bayesian
engine.
But, the mind does appear adapted to solve
important real-world inference problems in
approximately Bayesian ways, e.g.
Predicting everyday events
Causal learning and reasoning
Learning concepts from examples
Like perceptual tasks, adults and even young
children solve these problems mostly
unconsciously, effortlessly, and successfully.

16
Technical themes

Inference in probabilistic models
Role of priors, explaining away.
Learning in graphical models
Parameter learning, structure learning.
Bayesian model averaging
Being Bayesian over network structures.
Bayesian Occams razor
Trade off model complexity against data fit.

17
Technical themes

Structured probabilistic models
Grammars, first-order logic, relational schemas.
Hierarchical Bayesian models
Acquire abstract knowledge, supports transfer.
Nonparametric Bayes
Flexible models that grow in complexity as new
data warrant.
Tractable approximate inference
Markov chain Monte Carlo (MCMC), Sequential Monte
Carlo (particle filtering).

18
Outline

Predicting everyday events
Causal learning and reasoning
Learning concepts from examples

19
Outline

Predicting everyday events
Causal learning and reasoning
Learning concepts from examples

20
Basics of Bayesian inference

Bayes rule
An example
Data John is coughing
Some hypotheses
John has a cold
John has lung cancer
John has a stomach flu
Likelihood P(dh) favors 1 and 2 over 3
Prior probability P(h) favors 1 and 3 over 2
Posterior probability P(hd) favors 1 over 2 and
3

21
Bayesian inference in perception and sensorimotor
integration
(Weiss, Simoncelli Adelson 2002)
(Kording Wolpert 2004)
22
Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
Power law of forgetting
Spacing effects in forgetting
Additive effects of practice delay
Log memory strength
Mean recalled
Log delay (hours)
Retention interval (days)
Log delay (seconds)
23
Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)

For each item in memory, estimate the probability
that it will be useful in the present context.
Use priors based on the statistics of natural
information sources.

24
Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
Power law of forgetting
Spacing effects in forgetting
Additive effects of practice delay
Log need odds
Log need odds
Log days since last occurrence
Log days since last occurrence
Log days since last occurrence
New York Times data c.f. email sources,
child-directed speech
25
Everyday prediction problems(Griffiths
Tenenbaum, 2006)

You read about a movie that has made 60 million
to date. How much money will it make in total?
You see that something has been baking in the
oven for 34 minutes. How long until its ready?
You meet someone who is 78 years old. How long
will they live?
Your friend quotes to you from line 17 of his
favorite poem. How long is the poem?
You see taxicab 107 pull up to the curb in front
of the train station. How many cabs in this city?

26
Making predictions

You encounter a phenomenon that has existed for
tpast units of time. How long will it continue
into the future? (i.e. whats ttotal?)
We could replace time with any other quantity
that ranges from 0 to some unknown upper limit.

27
Bayesian inference

P(ttotaltpast) ? P(tpastttotal) P(ttotal)

posterior probability
likelihood
prior
28
Bayesian inference

P(ttotaltpast) ? P(tpastttotal) P(ttotal)
? 1/ttotal 1/ttotal

posterior probability
likelihood
prior
Uninformative prior
Assume random sample (0 lt tpast lt ttotal)
(e.g., Jeffreys, Jaynes)
29
Bayesian inference

P(ttotaltpast) ? 1/ttotal
1/ttotal

posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
30
Bayesian inference

P(ttotaltpast) ? 1/ttotal
1/ttotal

posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
Best guess for ttotal t such that P(ttotal gt
ttpast) 0.5
31
Bayesian inference

P(ttotaltpast) ? 1/ttotal
1/ttotal

posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
Yields Gotts Rule P(ttotal gt ttpast) 0.5
when t 2tpast i.e., best
guess for ttotal 2tpast .
32
Evaluating Gotts Rule

You read about a movie that has made 78 million
to date. How much money will it make in total?
156 million seems reasonable.
You meet someone who is 35 years old. How long
will they live?
70 years seems reasonable.
Not so simple
You meet someone who is 78 years old. How long
will they live?
You meet someone who is 6 years old. How long
will they live?

33
The effects of priors

Different kinds of priors P(ttotal) are
appropriate in different domains.

e.g., wealth, contacts
e.g., height, lifespan
Gott P(ttotal)?? ttotal-1
34
The effects of priors
35
Evaluating human predictions

Different domains with different priors
A movie has made 60 million
Your friend quotes from line 17 of a poem
You meet a 78 year old man
A move has been running for 55 minutes
A U.S. congressman has served for 11 years
A cake has been in the oven for 34 minutes
Use 5 values of tpast for each.
People predict ttotal .

36
(No Transcript)
37
You learn that in ancient Egypt, there was a
great flood in the 11th year of a pharaohs
reign. How long did he reign?
38
You learn that in ancient Egypt, there was a
great flood in the 11th year of a pharaohs
reign. How long did he reign?
How long did the typical pharaoh reign in
ancient egypt?
39
exponential or power law?
If a friend is calling a telephone box office to
book tickets and tells you he has been on hold
for 3 minutes, how long do you think will be on
hold in total?
40
Summary prediction

Predictions about the extent or magnitude of
everyday events follow Bayesian principles.
Contrast with Bayesian inference in perception,
motor control, memory no universal priors
here.
Predictions depend rationally on priors that are
appropriately calibrated for different domains.
Form of the prior (e.g., power-law or
exponential)
Specific distribution given that form
(parameters)
Non-parametric distribution when necessary.
In the absence of concrete experience, priors may
be generated by qualitative background knowledge.

41
Outline

Predicting everyday events
Causal learning and reasoning
Learning concepts from examples

42
Bayesian networks
Nodes variables Links direct dependencies Each
node has a conditional probability
distribution Data observations of X1, ..., X4

Four random variables
X1 coughing
X2 high body temperature
X3 flu
X4 lung cancer

43
Causal Bayesian networks
Nodes variables Links causal mechanisms Each
node has a conditional probability
distribution Data observations of and
interventions on X1, ..., X4

Four random variables
X1 coughing
X2 high body temperature
X3 flu
X4 lung cancer

(Pearl Glymour Cooper)
44
Inference in causal graphical models

Explaining away or discounting in
social reasoning (Kelley Morris Larrick)
Screening off in intuitive causal reasoning
(Waldmann, Rehder Burnett, Blok Sloman,
Gopnik Sobel)
Better in chains than common-cause structures
common-cause better if mechanisms clearly
independent
Understanding and predicting the effects of
interventions (Sloman Lagnado Gopnik Schulz)

A
B
C
B
P(cb) vs. P(cb, a) P(cb, not
a)
B
A
C
A
C
45
Learning graphical models

Structure learning what causes what?
Parameter learning how do causes work?

46
Bayesian learning of causal structure

Data d Causal
hypotheses h

X4
X3
X4
X3
X1
X2
X1
X2
1. What is the most likely network h given
observed data d ? 2. How likely
is there to be a link X4
X2 ?
(Bayesian model averaging)
47
Bayesian Occams Razor
(MacKay, 2003 Ghahramani tutorials)
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
48
Learning causation from contingencies
C present (c)
C absent (c-)
e.g., Does injecting this chemical cause mice to
express a certain gene?
a
c
E present (e)
d
b
E absent (e-)
Subjects judge the extent C to which causes E
(rate on a scale from 0 to 100)
49
Two models of causal judgment

Delta-P (Jenkins Ward, 1965)
Power PC (Cheng, 1997)

Power
50
Judging the probability that C E (Buehner
Cheng, 1997 2003)

Independent effects of both DP and causal power.
At DP0, judgments decrease with base rate.
(frequency illusion)

51
Learning causal strength(parameter learning)

Assume this causal structure
DP and causal power are maximum likelihood
estimates of the strength parameter w1, under
different parameterizations for P(EB,C)
linear ? DP, Noisy-OR ? causal power

B
52
Learning causal structure(Griffiths Tenenbaum,
2005)

Hypotheses
Bayesian causal support

h0
h1
likelihood ratio (Bayes factor) gives evidence
in favor of h1
noisy-OR
(assume uniform parameter priors, but see Yuille
et al., Danks et al.)
53
Buehner and Cheng (1997)
People
DP (r 0.89)
Power (r 0.88)
Support (r 0.97)
54
Implicit background theory

Injections may or may not cause gene expression,
but gene expression does not cause injections.
No hypotheses with E C
Other naturally occurring processes may also
cause gene expression.
All hypotheses include an always-present
background cause B C
Causes are generative, probabilistically
sufficient and independent, i.e. each cause
independently produces the effect in some
proportion of cases.
Noisy-OR parameterization

55
Sensitivity analysis
People
Support (Noisy-OR)
?2
Support (generic parameterization)
56
Generativity is essential
0/8 0/8
P(ec)
8/8 8/8
6/8 6/8
4/8 4/8
2/8 2/8
P(ec-)
100 50 0
Support

Predictions result from ceiling effect
ceiling effects only matter if you believe a
cause increases the probability of an effect

57
Different parameterizations for different kinds
of mechanisms
Does C cause E?
Is there a difference in E with C vs. not-C?
Does C prevent E?
58
Blicket detector (Sobel, Gopnik, and colleagues)
59
Backwards blocking (Sobel, Tenenbaum Gopnik,
2004)
A Trial
AB Trial

Initially Nothing on detector detector silent
(A0, B0, E0)
Trial 1 A B on detector detector active (A1,
B1, E1)
Trial 2 A on detector detector active (A1,
B0, E1)
4-year-olds judge if each object is a blicket
A a blicket (100 say yes)
B probably not a blicket (34 say yes)

B
A
?
?
E
(cf. explaining away in weight space, Dayan
Kakade)
60
Possible hypotheses?
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
61
Bayesian causal learning
With a uniform prior on hypotheses, generic
parameterization
Probability of being a blicket
A
B
0.32
0.32
0.34
0.34
62
A stronger hypothesis space

Links can only exist from blocks to detectors.
Blocks are blickets with prior probability q.
Blickets always activate detectors, detectors
never activate on their own (i.e., deterministic
OR parameterization, no hidden causes).

P(h00) (1 q)2
P(h10) q(1 q)
P(h01) (1 q) q
P(h11) q2
B
A
B
A
B
A
B
A
E
E
E
E
P(E1 A0, B0) 0 0
0
0 P(E1 A1, B0) 0
0 1
1 P(E1 A0, B1) 0
1 0
1 P(E1 A1, B1)
0 1
1 1
63
Manipulating prior probability(Tenenbaum, Sobel,
Griffiths, Gopnik)
A Trial
Initial
AB Trial
64
Learning more complex structures

Tenenbaum et al., Griffiths Sobel detectors
with more than two objects and noisy mechanisms
Steyvers et al., Sobel Kushnir active learning
with interventions (c.f. Tong Koller, Murphy)
Lagnado Sloman learning
from interventions on
continuous dynamical systems

65
Inferring hidden causes
Common unobserved cause
4 x
2 x
2 x
Independent unobserved causes
1 x
2 x
2 x
2 x
2 x
One observed cause
The stick ball machine
2 x
4 x
(Kushnir, Schulz, Gopnik, Danks, 2003)
66
Bayesian learning with unknown number of hidden
variables
(Griffiths et al 2006)
67
(No Transcript)
68
Inferring latent causes in classical
conditioning(Courville, Daw, Gordon, Touretzky
2003)
e.g., A noise X tone B click US
shock
Training A US A X B US Test X X
B
69
Inferring latent causes in perceptual learning
(Orban, Fiser, Aslin, Lengyel 2006)
Learning to recognize objects and segment scenes
70
Inferring latent causes in sensory integration
(Kording et al. 2006, NIPS 06)
71
Coincidences(Griffiths Tenenbaum, in press)

The birthday problem
How many people do you need to have in the room
before the probability exceeds 50 that two of
them have the same birthday?
The bombing of London

23.
72
How much of a coincidence?
73
Bayesian coincidence factor
Chance
Latent common cause
C
x
x
x
x
x
x
x
x
x
x
August

Alternative hypotheses
proximity in date, matching days of the
month, matching month, ....

74
How much of a coincidence?
75
Bayesian coincidence factor
Latent common cause
Chance
C
x
x
x
x
x
x
x
x
x
x
uniform regularity
uniform
76
Summary causal inference learning

Human causal induction can be explained using
core principles of graphical models.
Bayesian inference (explaining away, screening
off)
Bayesian structure learning (Occams razor,
model averaging)
Active learning with interventions
Identifying latent causes

77
Summary causal inference learning

Crucial constraints on hypothesis spaces come
from abstract prior knowledge, or intuitive
theories.
What are the variables?
How can they be connected?
How are their effects parameterized?
Big open questions
How can these theories be described formally?
How can these theories be learned?

78
Hierarchical Bayesian framework
Abstract Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
79
A theory for blickets(c.f. PRMs, BLOG, FOPL)
80
Learning with a uniform prior on network
structures
True network Sample 75 observations
attributes (1-12)
observed data
patients
81
Learning a block-structured prior on network
structures (Mansinghka et al. 2006)
z
1 2 3 4
5 6 7 8
0.8
0.0
0.01
h
0.0
0.0
0.75
9 1011 12
0.0
0.0
0.0
True network Sample 75 observations
attributes (1-12)
observed data
patients
82
The blessing of abstraction
True structure of graphical model G
of samples 20 80
1000
Graph G
edge (G)
Data D
edge (G)
Abstract theory Z
Graph G
class (z)
Data D
83
The nonparametric safety-net
12
1
11
True structure of graphical model G
2
10
3
9
4
8
5
7
6
of samples 40 100
1000
Graph G
edge (G)
Data D
edge (G)
Abstract theory Z
Graph G
class (z)
Data D
84
Outline

Predicting everyday events
Causal learning and reasoning
Learning concepts from examples

85
Learning from just one or a few examples, and
mostly unlabeled examples (semi-supervised
learning).
86
Simple model of concept learning
Can you show me the other blickets?
87
Simple model of concept learning
Other blickets.
88
Simple model of concept learning
Other blickets.

Learning from just one positive example is
possible if
Assume concepts refer to clusters in the world.
Observe enough unlabeled data to identify clear
clusters.
(c.f. Learning with mixture models and EM,
Ghahramani Jordan, 1994 Nigam et al. 2000)

89
Concept learning with mixture models in cognitive
science

Fried Holyoak (1984)
Modeled unsupervised and
semi-supervised categorization as EM
in a Gaussian mixture.
Anderson (1990)
Modeled unsupervised and semi-supervised
categorization as greedy sequential search in an
infinite (Chinese restaurant process) mixture.

90
Infinite (CRP) mixture models

Construct from k-component mixtures by
integrating out mixing weights, collapsing
equivalent partitions, and taking the limit as
.
Does not require that we commit to a fixed or
even finite number of classes.
Effective number of classes can grow with number
of data points, balancing complexity with data
fit.
Computationally much simpler than applying
Bayesian Occams razor or cross-validation.
Easy to learn with standard Monte Carlo
approximations (MCMC, particle filtering),
hopefully avoiding local minima.

91
High school lunch room analogy
92
Sampling from the CRP

punks
preppies
jocks
nerds

93
(No Transcript)
94
Assign to larger groups
Group with similar objects
Gibbs sampler (Neal)

punks
preppies
jocks
nerds

95
A typical cognitive experiment
F1 F2 F3 F4 Label
Training stimuli 1 1 1 1 1 1
0 1 0 1 0 1 0 1 1 0 0
0 0 0 0 1 0 0 0 1 0 1
1 0
Test stimuli 0 1 1 1 ? 1
1 0 1 ? 1 1 1 0 ? 1 0
0 0 ? 0 0 1 0 ? 0 0
0 1 ?
96
Anderson (1990), Rational model of
categorization Greedy sequential search
in an infinite mixture
model. Sanborn, Griffiths, Navarro (2006), More
rational model of categorization Particle
filter with a small of particles
97
Towards more natural concepts
98
CrossCat Discovering multiple structures that
capture different subsets of features(Shafto,
Kemp, Mansinghka, Gordon Tenenbaum, 2006)
99
Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
(c.f. Xu, Tresp, et al. SRL 06)
concept
predicate
concept

Biomedical predicate data from UMLS (McCrae et
al.)
134 concepts enzyme, hormone, organ, disease,
cell function ...
49 predicates affects(hormone, organ),
complicates(enzyme, cell function), treats(drug,
disease), diagnoses(procedure, disease)

100
Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
101
Learning from very few examples
tufa
tufa

Word learning

tufa

Property induction

Cows have T9 hormones. Seals have T9
hormones. Squirrels have T9 hormones. All
mammals have T9 hormones.
Cows have T9 hormones. Sheep have T9
hormones. Goats have T9 hormones. All mammals
have T9 hormones.
102
The computational problem(c.f., semi-supervised
learning)
?
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
New property
Features
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
103

X
Y
Hypotheses h
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
...
...
Prior P(h)
104

X
Y
Prediction P(Y X)
Hypotheses h
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
...
...
Prior P(h)
105
Many sources of priors
106
Hierarchical Bayesian Framework(Kemp Tenenbaum)
F form
Tree
S structure
F1 F2 F3 F4 Has T9 hormones
mouse squirrel chimp gorilla
D data
? ? ?

107
P(DS) How the structure constrains the data of
experience

Define a stochastic process over structure S that
generates hypotheses h.
For generic properties, prior should favor
hypotheses that vary smoothly over structure.
Many properties of biological species were
actually generated by such a process (i.e.,
mutation selection).

Smooth P(h) high
Not smooth P(h) low
108
P(DS) How the structure constrains the data of
experience
S
Gaussian Process ( random walk,
diffusion)
Zhu, Ghahramani Lafferty 2003
y
Threshold
h
109
A graph-based prior

Let dij be the length of the edge between i and j
( if i and j are not connected)

A Gaussian prior N(0, S), with (Zhu, Lafferty
Ghahramani, 2003)
110
Structure S
Data D
Species 1 Species 2 Species 3 Species 4 Species
5 Species 6 Species 7 Species 8 Species 9 Species
10
Features
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
111
(No Transcript)
112
(No Transcript)
113
Structure S
Data D
Species 1 Species 2 Species 3 Species 4 Species
5 Species 6 Species 7 Species 8 Species 9 Species
10
? ? ? ? ? ? ? ?
Features
New property
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
114
Cows have property P. Elephants have property
P. Horses have property P.
Tree
2D
Gorillas have property P. Mice have property
P. Seals have property P. All mammals have
property P.
115
Reasoning about spatially varying properties

Native American artifacts task

116
Property type has T9 hormones can bite
through wire carry E. Spirus
bacteria Theory Structure taxonomic tree
directed chain
directed network diffusion process
drift process noisy transmission
Class D
Class D
Class A
Class A
Class F
Class E
Class C
Class C
Class B
Class G
Class E
Class B
Class F
Hypotheses
Class G
Class A Class B Class C Class D Class E Class
F Class G
. . .
. . .
. . .
117
Herring
Tuna
Mako shark
Sand shark
Dolphin
Human
Kelp
Sand shark
Kelp
Human
Mako shark
Tuna
Herring
Dolphin
118
Hierarchical Bayesian Framework
F form
Tree
Space
Chain
mouse
squirrel
S structure
gorilla
chimp
F1 F2 F3 F4
D data
mouse squirrel chimp gorilla
119
Discovering structural forms
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Robin
Crocodile
Snake
Bat
Turtle
Orangutan
Ostrich
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
120
Discovering structural forms
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Great chain of being
Robin
Crocodile
Snake
Bat
Turtle
Plant
Rock
Orangutan
Angel
Ostrich
God
Linnaeus
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
121
People can discover structural forms

Scientists
Tree structure for living kinds (Linnaeus)
Periodic structure for chemical elements
(Mendeleev)
Children
Hierarchical structure of category labels
Clique structure of social groups
Cyclical structure of seasons or days of the week
Transitive structure for value

122
The value of structural form knowledge inductive
bias
123
Typical structure learning algorithms assume a
fixed structural form
Flat Clusters
Line
Circle
K-Means Mixture models Competitive learning
Guttman scaling Ideal point models
Circumplex models
Grid
Tree
Euclidean Space
Hierarchical clustering Bayesian phylogenetics
MDS PCA Factor Analysis
Self-Organizing Map Generative topographic
mapping
124
Goal a universal framework for unsupervised
learning
Universal Learner
K-Means Hierarchical clustering Factor
Analysis Guttman scaling Circumplex
models Self-Organizing maps
Data
Representation
125
Hierarchical Bayesian Framework
F form
S structure
F1 F2 F3 F4
D data
mouse squirrel chimp gorilla
126
Structural forms as graph grammars
Form
Form
Process
Process
127
Node-replacement graph grammars
Production (Line)
Derivation
128
Node-replacement graph grammars
Production (Line)
Derivation
129
Node-replacement graph grammars
Production (Line)
Derivation
130
Model fitting

Evaluate each form in parallel
For each form, heuristic search over structures
based on greedy growth from a one-node seed

131
(No Transcript)
132
Development of structural forms as more data are
observed
133
Beyond Nativism versus Empiricism

Nativism Explicit knowledge of structural
forms for core domains is innate.
Atran (1998) The tendency to group living kinds
into hierarchies reflects an innately determined
cognitive structure.
Chomsky (1980) The belief that various systems
of mind are organized along quite different
principles leads to the natural conclusion that
these systems are intrinsically determined, not
simply the result of common mechanisms of
learning or growth.
Empiricism General-purpose learning systems
without explicit knowledge of structural form.
Connectionist networks (e.g., Rogers and
McClelland, 2004).
Traditional structure learning in probabilistic
graphical models.

134
Summary concept learning

Models based on Bayesian inference over
hierarchies of structured representations.
How does abstract domain knowledge guide learning
of new concepts?
How can this knowledge be represented, and how
might it be learned?

F form
mouse
squirrel
S structure
chimp
gorilla
F1 F2 F3 F4
mouse squirrel chimp gorilla
D data

How can probabilistic inference work together
with flexibly structured representations to model
complex, real-world learning and reasoning?

135
Contributions of Bayesian models

Principled quantitative models of human behavior,
with broad coverage and a minimum of free
parameters and ad hoc assumptions.
Explain how and why human learning and reasoning
works, in terms of (approximations to) optimal
statistical inference in natural environments.
A framework for studying peoples implicit
knowledge about the structure of the world how
it is structured, used, and acquired.
A two-way bridge to state-of-the-art AI and
machine learning.

136
Looking forward

What we need to understand the minds ability to
build rich models of the world from
sparse data.
Learning about objects, categories, and their
properties.
Causal inference
Language comprehension and production
Scene understanding
Understanding other peoples actions, plans,
thoughts, goals
What do we need to understand these abilities?
Bayesian inference in probabilistic generative
models
Hierarchical models, with inference at all levels
of abstraction
Structured representations graphs, grammars,
logic
Flexible representations, growing in response to
observed data

137
Learning word meanings
Whole-object principle Shape bias Taxonomic
principle Contrast principle Basic-level bias
Abstract Principles
Structure
Data
(Tenenbaum Xu)
138
Causal learning and reasoning
Abstract Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
139
Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
140
Vision as probabilistic parsing
(Han Zhu, 2006 c.f., Zhu, Yuanhao Yuille
NIPS 06 )
141
(No Transcript)
142
Goal-directed action (production and
comprehension)
(Wolpert et al., 2003)
143
Bayesian models of action understanding
(Baker, Tenenbaum Saxe Verma Rao)
144
Open directions and challenges

Effective methods for learning structured
knowledge
How to balance expressiveness/learnability
tradeoff?
More precise relation to psychological processes
To what extent do mental processes implement
boundedly rational methods of approximate
inference?
Relation to neural computation
How to implement structured representations in
brains?
Modeling individual subjects and single trials
Is there a rational basis for probability
matching?
Understanding failure cases
Are these simply not Bayesian, or are people
using a different model? How do we avoid
circularity?

145
Want to learn more?

Special issue of Trends in Cognitive Sciences
(TiCS), July 2006 (Vol. 10, no. 7), on
Probabilistic models of cognition.
Tom Griffiths reading list, a/k/a
http//bayesiancognition.com
Summer school on probabilistic models of
cognition, July 2007, Institute for Pure and
Applied Mathematics (IPAM) at UCLA.

146
(No Transcript)
147
Extra slides
148
Bayesian prediction

P(ttotaltpast) ? 1/ttotal
P(tpast)

posterior probability
Random sampling
Domain-dependent prior
What is the best guess for ttotal? Compute t
such that P(ttotal gt ttpast) 0.5
P(ttotaltpast)
We compared the median of the Bayesian
posterior with the median of subjects judgments
but what about the distribution of subjects
judgments?
ttotal
149
Sources of individual differences

Individuals judgments could by noisy.
Individuals judgments could be optimal, but with
different priors.
e.g., each individual has seen only a sparse
sample of the relevant population of events.
Individuals inferences about the posterior could
be optimal, but their judgments could be based on
probability (or utility) matching rather than
maximizing.

150
Individual differences in prediction
P(ttotaltpast)
ttotal
Proportion of judgments below predicted value
Quantile of Bayesian posterior distribution
151
Individual differences in prediction
P(ttotaltpast)
ttotal

Average over all
prediction tasks
movie run times
movie grosses
poem lengths
life spans
terms in congress
cake baking times

152
Individual differences in concept learning
153
Why probability matching?

Optimal behavior under some (evolutionarily
natural) circumstances.
Optimal betting theory, portfolio theory
Optimal foraging theory
Competitive games
Dynamic tasks (changing probabilities or
utilities)
Side-effect of algorithms for approximating
complex Bayesian computations.
Markov chain Monte Carlo (MCMC) instead of
integrating over complex hypothesis spaces,
construct a sample of high-probability
hypotheses.
Judgments from individual (independent) samples
can on average be almost as good as using the
full posterior distribution.

154
Markov chain Monte Carlo
(Metropolis-Hastings algorithm)
155
The puzzle of coincidences

Discoveries of hidden causal structure are often
driven by noticing coincidences. . .
Science
Halleys comet (1705)

156
(Halley, 1705)
157
(Halley, 1705)
158
The puzzle of coincidences

Discoveries of hidden causal structure are often
driven by noticing coincidences. . .
Science
Halleys comet (1705)
John Snow and the cause of cholera (1854)

159
(No Transcript)
160
Rational analysis of cognition

Often can show that apparently irrational
behavior is actually rational.

Which cards do you have to turn over to test this
rule? If there is an A on one side, then there
is a 2 on the other side
161
Rational analysis of cognition

Often can show that apparently irrational
behavior is actually rational.
Oaksford Chaters rational analysis

Optimal data selection based on maximizing
expected information gain.
Test the rule If p, then q against the null
hypothesis that p and q are independent.
Assuming p and q are rare predicts peoples
choices

162
Integrating multiple forms of reasoning(Kemp,
Shafto, Berke Tenenbaum NIPS 06)
2) Causal relations between features
Parameters of causal relations vary
smoothly over the category hierarchy.
1) Taxonomic relations between categories
T9 hormones cause elevated heart rates. Elevated
heart rates cause faster metabolisms. Mice have
T9 hormones. ?
163
Integrating multiple forms of reasoning
164
Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
(c.f. Xu, Tresp, et al. SRL 06)
concept
predicate
concept

Biomedical predicate data from UMLS (McCrae et
al.)
134 concepts enzyme, hormone, organ, disease,
cell function ...
49 predicates affects(hormone, organ),
complicates(enzyme, cell function), treats(drug,
disease), diagnoses(procedure, disease)

165
Learning relational theories
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
166
Learning annotated hierarchies from
relational data(Roy, Kemp, Mansinghka, Tenenbaum
NIPS 06)
167
Learning abstract relational structures
Dominance hierarchy Tree
Cliques Ring
Primate troop Bush administration
Prison inmates Kula islands x
beats y x told y x likes y
x trades with y
168
Bayesian inference in neural networks
(Rao, in press)
169
The big problem of intelligence

The development of intuitive theories in
childhood.
Psychology How do we learn to understand others
actions in terms of beliefs, desires, plans,
intentions, values, morals?
Biology How do we learn that people, dogs, bees,
worms, trees, flowers, grass, coral, moss are
alive, but chairs, cars, tricycles, computers,
the sun, Roomba, robots, clocks, rocks are not?

170
The big problem of intelligence