Title: Bayesian models of inductive learning
1Bayesian models of inductive learning
- Josh Tenenbaum Tom Griffiths
- MIT
- Computational Cognitive Science Group Department
of Brain and Cognitive Sciences - Computer Science and AI Lab (CSAIL)
2What to expect
- What youll get out of this tutorial
- Our view of what Bayesian models have to offer
cognitive science. - In-depth examples of basic and advanced models
how the math works what it buys you. - Some comparison to other approaches.
- Opportunities to ask questions.
- What you wont get
- Detailed, hands-on how-to.
- Where you can learn more
- http//bayesiancognition.com
3Outline
- Morning
- Introduction (Josh)
- Basic case study 1 Flipping coins (Tom)
- Basic case study 2 Rules and similarity (Josh)
- Afternoon
- Advanced case study 1 Causal induction (Tom)
- Advanced case study 2 Property induction (Josh)
- Quick tour of more advanced topics (Tom)
4Outline
- Morning
- Introduction (Josh)
- Basic case study 1 Flipping coins (Tom)
- Basic case study 2 Rules and similarity (Josh)
- Afternoon
- Advanced case study 1 Causal induction (Tom)
- Advanced case study 2 Property induction (Josh)
- Quick tour of more advanced topics (Tom)
5Bayesian models in cognitive science
- Vision
- Motor control
- Memory
- Language
- Inductive learning and reasoning.
6Everyday inductive leaps
- Learning concepts and words from examples
horse
horse
horse
7Learning concepts and words
- Can you pick out the tufas?
8Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
9Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
10Everyday inductive leaps
- How can we learn so much about . . .
- Properties of natural kinds
- Meanings of words
- Future outcomes of a dynamic process
- Hidden causal properties of an object
- Causes of a persons action (beliefs, goals)
- Causal laws governing a domain
- . . . from such limited data?
11The Challenge
- How do we generalize successfully from very
limited data? - Just one or a few examples
- Often only positive examples
- Philosophy
- Induction is a problem, a riddle, a
paradox, a scandal, or a myth. - Machine learning and statistics
- Focus on generalization from many examples, both
positive and negative.
12Rational statistical inference(Bayes, Laplace)
Sum over space of hypotheses
13Bayesian models of inductive learning some
recent history
- Shepard (1987)
- Analysis of one-shot stimulus generalization, to
explain the universal exponential law. - Anderson (1990)
- Models of categorization and causal induction.
- Oaksford Chater (1994)
- Model of conditional reasoning (Wason selection
task). - Heit (1998)
- Framework for category-based inductive reasoning.
14Theory-Based Bayesian Models
- Rational statistical inference (Bayes)
- Learners domain theories generate their
hypothesis space H and prior p(h). - Well-matched to structure of the natural world.
- Learnable from limited data.
- Computationally tractable inference.
15What is a theory?
- Working definition
- An ontology and a system of abstract principles
that generates a hypothesis space of candidate
world structures along with their relative
probabilities. - Analogy to grammar in language.
- Example Newtons laws
16Structure and statistics
- A framework for understanding how structured
knowledge and statistical inference interact. - How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning. -
- How simplicity trades off with fit to the data in
evaluating structural hypotheses. -
- How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance. -
17Structure and statistics
- A framework for understanding how structured
knowledge and statistical inference interact. - How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning. - Hierarchical Bayes.
- How simplicity trades off with fit to the data in
evaluating structural hypotheses. - Bayesian Occams Razor.
- How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance. - Non-parametric Bayes.
18Alternative approaches to inductive generalization
- Associative learning
- Connectionist networks
- Similarity to examples
- Toolkit of simple heuristics
- Constraint satisfaction
- Analogical mapping
19Marrs Three Levels of Analysis
- Computation
- What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out? - Representation and algorithm
- Cognitive psychology
- Implementation
- Neurobiology
20Why Bayes?
- A framework for explaining cognition.
- How people can learn so much from such limited
data. - Why process-level models work the way that they
do. - Strong quantitative models with minimal ad hoc
assumptions. - A framework for understanding how structured
knowledge and statistical inference interact. - How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning. - How simplicity trades off with fit to the data in
evaluating structural hypotheses (Occams razor). - How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance.
21Outline
- Morning
- Introduction (Josh)
- Basic case study 1 Flipping coins (Tom)
- Basic case study 2 Rules and similarity (Josh)
- Afternoon
- Advanced case study 1 Causal induction (Tom)
- Advanced case study 2 Property induction (Josh)
- Quick tour of more advanced topics (Tom)
22Coin flipping
23Coin flipping
HHTHT
HHHHH
What process produced these sequences?
24Bayes rule
For data D and a hypothesis H, we have
- Posterior probability
- Prior probability
- Likelihood
25The origin of Bayes rule
- A simple consequence of using probability to
represent degrees of belief - For any two random variables
26Why represent degrees of belief with
probabilities?
- Good statistics
- consistency, and worst-case error bounds.
- Cox Axioms
- necessary to cohere with common sense
- Dutch Book Survival of the Fittest
- if your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord. - Provides a theory of learning
- a common currency for combining prior knowledge
and the lessons of experience.
27Bayes rule
For data D and a hypothesis H, we have
- Posterior probability
- Prior probability
- Likelihood
28Hypotheses in Bayesian inference
- Hypotheses H refer to processes that could have
generated the data D - Bayesian inference provides a distribution over
these hypotheses, given D - P(DH) is the probability of D being generated by
the process identified by H - Hypotheses H are mutually exclusive only one
process could have generated D
29Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...
30Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...
31Representing generative models
- Graphical model notation
- Pearl (1988), Jordan (1998)
- Variables are nodes, edges indicate dependency
- Directed edges show causal process of data
generation
32Models with latent structure
- Not all nodes in a graphical model need to be
observed - Some variables reflect latent structure, used in
generating D but unobserved
33Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
- Comparing infinitely many hypotheses
- P(H) p
- Psychology Representativeness
34Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
- Comparing infinitely many hypotheses
- P(H) p
- Psychology Representativeness
35Comparing two simple hypotheses
- Contrast simple hypotheses
- H1 fair coin, P(H) 0.5
- H2always heads, P(H) 1.0
- Bayes rule
- With two hypotheses, use odds form
36Bayes rule in odds form
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D data
- H1, H2 models
- P(H1D) posterior probability H1 generated the
data - P(DH1) likelihood of data under model H1
- P(H1) prior probability H1 generated the data
x
37Coin flipping
HHTHT
HHHHH
What process produced these sequences?
38Comparing two simple hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000
- P(H1D) / P(H2D) infinity
x
39Comparing two simple hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 30
x
40Comparing two simple hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHHHHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/210 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 1
x
41Comparing two simple hypotheses
- Bayes rule tells us how to combine prior beliefs
with new data - top-down and bottom-up influences
- As a model of human inference
- predicts conclusions drawn from data
- identifies point at which prior beliefs are
overwhelmed by new experiences - But more complex cases?
42Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
- Comparing infinitely many hypotheses
- P(H) p
- Psychology Representativeness
43Comparing simple and complex hypotheses
vs.
- Which provides a better account of the data the
simple hypothesis of a fair coin, or the complex
hypothesis that P(H) p?
44Comparing simple and complex hypotheses
- P(H) p is more complex than P(H) 0.5 in two
ways - P(H) 0.5 is a special case of P(H) p
- for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5
45Comparing simple and complex hypotheses
Probability
46Comparing simple and complex hypotheses
Probability
HHHHH p 1.0
47Comparing simple and complex hypotheses
Probability
HHTHT p 0.6
48Comparing simple and complex hypotheses
- P(H) p is more complex than P(H) 0.5 in two
ways - P(H) 0.5 is a special case of P(H) p
- for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5 - How can we deal with this?
- frequentist hypothesis testing
- information theorist minimum description length
- Bayesian just use probability theory!
49Comparing simple and complex hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p
x
50Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
51Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
52Comparing simple and complex hypotheses
- Simple and complex hypotheses can be compared
directly using Bayes rule - requires summing over latent variables
- Complex hypotheses are penalized for their
greater flexibility Bayesian Occams razor - This principle is used in model selection methods
in psychology (e.g. Myung Pitt, 1997)
53Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
- Comparing infinitely many hypotheses
- P(H) p
- Psychology Representativeness
54Comparing infinitely many hypotheses
- Assume data are generated from a model
- What is the value of p?
- each value of p is a hypothesis H
- requires inference over infinitely many hypotheses
55Comparing infinitely many hypotheses
- Flip a coin 10 times and see 5 heads, 5 tails.
- P(H) on next flip? 50
- Why? 50 5 / (55) 5/10.
- Future will be like the past.
- Suppose we had seen 4 heads and 6 tails.
- P(H) on next flip? Closer to 50 than to 40.
- Why? Prior knowledge.
56Integrating prior knowledge and data
- Posterior distribution P(p D) is a probability
density over p P(H) - Need to work out likelihood P(D p) and specify
prior distribution P(p)
P(p D) ? P(D p) P(p)
57Likelihood and prior
- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1
?
58A simple method of specifying priors
- Imagine some fictitious trials, reflecting a set
of previous experiences - strategy often used with neural networks
- e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair - In fact, this is a sensible statistical idea...
59Likelihood and prior
- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1
- FH fictitious observations of heads
- FT fictitious observations of tails
Beta(FH,FT)
60Conjugate priors
- Exist for many standard distributions
- formula for exponential family conjugacy
- Define prior in terms of fictitious observations
- Beta is conjugate to Bernoulli (coin-flipping)
FH FT 1 FH FT 3 FH FT 1000
61Likelihood and prior
- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1
- FH fictitious observations of heads
- FT fictitious observations of tails
62Comparing infinitely many hypotheses
P(p D) ? P(D p) P(p) pNHFH-1 (1-p)NTFT-1
- Posterior is Beta(NHFH,NTFT)
- same form as conjugate prior
- Posterior mean
- Posterior predictive distribution
63Some examples
- e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair - After seeing 4 heads, 6 tails, P(H) on next flip
1004 / (10041006) 49.95 - e.g., F 3 heads, 3 tails weak expectation
that any new coin will be fair - After seeing 4 heads, 6 tails, P(H) on next flip
7 / (79) 43.75 - Prior knowledge too weak
64But flipping thumbtacks
- e.g., F 4 heads, 3 tails weak expectation
that tacks are slightly biased towards heads - After seeing 2 heads, 0 tails, P(H) on next flip
6 / (63) 67 - Some prior knowledge is always necessary to avoid
jumping to hasty conclusions... - Suppose F After seeing 2 heads, 0 tails,
P(H) on next flip 2 / (20) 100
65Origin of prior knowledge
- Tempting answer prior experience
- Suppose you have previously seen 2000 coin flips
1000 heads, 1000 tails - By assuming all coins (and flips) are alike,
these observations of other coins are as good as
observations of the present coin
66Problems with simple empiricism
- Havent really seen 2000 coin flips, or any flips
of a thumbtack - Prior knowledge is stronger than raw experience
justifies - Havent seen exactly equal number of heads and
tails - Prior knowledge is smoother than raw experience
justifies - Should be a difference between observing 2000
flips of a single coin versus observing 10 flips
each for 200 coins, or 1 flip each for 2000 coins - Prior knowledge is more structured than raw
experience
67A simple theory
- Coins are manufactured by a standardized
procedure that is effective but not perfect. - Justifies generalizing from previous coins to the
present coin. - Justifies smoother and stronger prior than raw
experience alone. - Explains why seeing 10 flips each for 200 coins
is more valuable than seeing 2000 flips of one
coin. - Tacks are asymmetric, and manufactured to less
exacting standards.
68Limitations
- Can all domain knowledge be represented so
simply, in terms of an equivalent number of
fictional observations? - Suppose you flip a coin 25 times and get all
heads. Something funny is going on - But with F 1000 heads, 1000 tails, P(H) on
next flip 1025 / (10251000) 50.6. - Looks like nothing unusual
69Hierarchical priors
- Higher-order hypothesis is this coin fair or
unfair? - Example probabilities
- P(fair) 0.99
- P(pfair) is Beta(1000,1000)
- P(punfair) is Beta(1,1)
- 25 heads in a row propagates up, affecting p and
then P(fairD)
fair
p
d1 d2 d3 d4
70More hierarchical priors
- Latent structure can capture coin variability
- 10 flips from 200 coins is better than 2000 flips
from a single coin allows estimation of FH, FT
p Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
p
p
p
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
71Yet more hierarchical priors
physical knowledge
- Discrete beliefs (e.g. symmetry) can influence
estimation of continuous properties (e.g. FH, FT)
FH,FT
p
p
p
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
72Comparing infinitely many hypotheses
- Apply Bayes rule to obtain posterior probability
density - Requires prior over all hypotheses
- computation simplified by conjugate priors
- richer structure with hierarchical priors
- Hierarchical priors indicate how simple theories
can inform statistical inferences - one step towards structure and statistics
73Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
- Comparing infinitely many hypotheses
- P(H) p
- Psychology Representativeness
74Psychology Representativeness
- Which sequence is more likely from a fair coin?
HHTHT
more representative of a fair coin (Kahneman
Tversky, 1972)
HHHHH
75What might representativeness mean?
Evidence for a random generating process
76A constrained hypothesis space
- Four hypotheses
- h1 fair coin HHTHTTTH
- h2 always alternates HTHTHTHT
- h3 mostly heads HHTHTHHH
- h4 always heads HHHHHHHH
77Representativeness judgments
78Results
- Good account of representativeness data, with
three pseudo-free parameters, ? 0.91 - always alternates means 99 of the time
- mostly heads means P(H) 0.85
- always heads means P(H) 0.99
- With scaling parameter, r 0.95
(Tenenbaum Griffiths, 2001)
79The role of theories
- The fact that HHTHT looks representative of a
fair coin and HHHHH does not reflects our
implicit theories of how the world works. - Easy to imagine how a trick all-heads coin could
work high prior probability. - Hard to imagine how a trick HHTHT coin could
work low prior probability.
80Summary
- Three kinds of Bayesian inference
- comparing two simple hypotheses
- comparing simple and complex hypotheses
- comparing an infinite number of hypotheses
- Critical notions
- generative models, graphical models
- Bayesian Occams razor
- priors conjugate, hierarchical (theories)
81Outline
- Morning
- Introduction (Josh)
- Basic case study 1 Flipping coins (Tom)
- Basic case study 2 Rules and similarity (Josh)
- Afternoon
- Advanced case study 1 Causal induction (Tom)
- Advanced case study 2 Property induction (Josh)
- Quick tour of more advanced topics (Tom)
82Rules and similarity
83Structure versus statistics
Statistics Similarity Typicality
Rules Logic Symbols
84A better metaphor
85A better metaphor
86Structure and statistics
Statistics Similarity Typicality
Rules Logic Symbols
87Structure and statistics
- Basic case study 1 Flipping coins
- Learning and reasoning with structured
statistical models. - Basic case study 2 Rules and similarity
- Statistical learning with structured
representations.
88The number game
- Program input number between 1 and 100
- Program output yes or no
89The number game
- Learning task
- Observe one or more positive (yes) examples.
- Judge whether other numbers are yes or no.
90The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
91The number game
Examples of yes numbers
Generalization judgments (n 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
92The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
Focused similarity numbers near 50-60
60 52 57 55
93The number game
Examples of yes numbers
Generalization judgments (N 20)
16
Diffuse similarity
16 8 2 64
Rule powers of 2
Focused similarity numbers near 20
16 23 19 20
94The number game
- Main phenomena to explain
- Generalization can appear either similarity-based
(graded) or rule-based (all-or-none). - Learning from just a few positive examples.
95Rule/similarity hybrid models
- Category learning
- Nosofsky, Palmeri et al. RULEX
- Erickson Kruschke ATRIUM
96Divisions into rule and similarity subsystems
- Category learning
- Nosofsky, Palmeri et al. RULEX
- Erickson Kruschke ATRIUM
- Language processing
- Pinker, Marcus et al. Past tense morphology
- Reasoning
- Sloman
- Rips
- Nisbett, Smith et al.
97Rule/similarity hybrid models
- Why two modules?
- Why do these modules work the way that they do,
and interact as they do? - How do people infer a rule or similarity metric
from just a few positive examples?
98Bayesian model
- H Hypothesis space of possible concepts
- h1 2, 4, 6, 8, 10, 12, , 96, 98, 100
(even numbers) - h2 10, 20, 30, 40, , 90, 100 (multiples
of 10) - h3 2, 4, 8, 16, 32, 64 (powers of 2)
- h4 50, 51, 52, , 59, 60 (numbers between
50 and 60) - . . .
- Representational interpretations for H
- Candidate rules
- Features for similarity
- Consequential subsets (Shepard, 1987)
99- Inferring hypotheses from similarity judgment
- Additive clustering (Shepard Arabie, 1977)
- similarity of stimuli i, j
- weight of cluster k
- membership of stimulus i in cluster k
- (1 if stimulus i in cluster k, 0 otherwise)
- Equivalent to similarity as a weighted sum of
common features (Tversky, 1977).
100- Additive clustering for the integers 0-9
-
Rank Weight Stimuli in cluster Interpretation
0 1 2 3 4 5 6 7 8 9 1 .444
powers
of two 2 .345 small numbers 3 .331
multiples of
three 4 .291
large numbers 5 .255
middle numbers 6 .216
odd numbers 7 .214 smallish
numbers 8 .172
largish numbers
101Three hypothesis subspaces for number concepts
- Mathematical properties (24 hypotheses)
- Odd, even, square, cube, prime numbers
- Multiples of small integers
- Powers of small integers
- Raw magnitude (5050 hypotheses)
- All intervals of integers with endpoints between
1 and 100. - Approximate magnitude (10 hypotheses)
- Decades (1-10, 10-20, 20-30, )
102 Hypothesis spaces and theories
- Why a hypothesis space is like a domain theory
- Represents one particular way of classifying
entities in a domain. - Not just an arbitrary collection of hypotheses,
but a principled system. - Whats missing?
- Explicit representation of the principles.
- Hypothesis spaces (and priors) are generated by
theories. Some analogies - Grammars generate languages (and priors over
structural descriptions) - Hierarchical Bayesian modeling
103Bayesian model
- H Hypothesis space of possible concepts
- Mathematical properties even, odd, square,
prime, . . . . - Approximate magnitude 1-10, 10-20, 20-30,
. . . . - Raw magnitude all intervals between 1 and 100.
- X x1, . . . , xn n examples of a concept C.
- Evaluate hypotheses given data
- p(h) prior domain knowledge, pre-existing
biases - p(Xh) likelihood statistical information in
examples. - p(hX) posterior degree of belief that h is
the true extension of C.
104Bayesian model
- H Hypothesis space of possible concepts
- Mathematical properties even, odd, square,
prime, . . . . - Approximate magnitude 1-10, 10-20, 20-30,
. . . . - Raw magnitude all intervals between 1 and 100.
- X x1, . . . , xn n examples of a concept C.
- Evaluate hypotheses given data
- p(h) prior domain knowledge, pre-existing
biases - p(Xh) likelihood statistical information in
examples. - p(hX) posterior degree of belief that h is
the true extension of C.
105- Likelihood p(Xh)
- Size principle Smaller hypotheses receive
greater likelihood, and exponentially more so as
n increases. - Follows from assumption of randomly sampled
examples. - Captures the intuition of a representative
sample.
106Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
107Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data slightly more of a coincidence under h1
108Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data much more of a coincidence under h1
109Bayesian Occams Razor
Law of Conservation of Belief
M1
p(D d M )
M2
All possible data sets d
For any model M,
110Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
111- Prior p(h)
- Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses. - Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70.
112- Prior p(h)
- Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses. - Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70. - p(h) encodes relative weights of alternative
theories
H Total hypothesis space
- H1 Math properties (24)
- even numbers
- powers of two
- multiples of three
- .
- H2 Raw magnitude (5050)
- 10-15
- 20-32
- 37-54
- .
- H3 Approx. magnitude (10)
- 10-20
- 20-30
- 30-40
- .
113A more complex approach to priors
- Start with a base set of regularities R and
combination operators C. - Hypothesis space closure of R under C.
- C and, or H unions and intersections of
regularities in R (e.g., multiples of 10 between
30 and 70). - C and-not H regularities in R with
exceptions (e.g., multiples of 10 except 50 and
70). - Two qualitatively similar priors
- Description length number of combinations in C
needed to generate hypothesis from R. - Bayesian Occams Razor, with model classes
defined by number of combinations more
combinations more hypotheses lower
prior
114- Posterior
- X 60, 80, 10, 30
- Why prefer multiples of 10 over even numbers?
p(Xh). - Why prefer multiples of 10 over multiples of
10 except 50 and 20? p(h). - Why does a good generalization need both high
prior and high likelihood? p(hX) p(Xh) p(h)
115Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
116Generalizing to new objects
Given p(hX), how do we compute ,
the probability that C applies to some new
stimulus y?
117Generalizing to new objects
Hypothesis averaging Compute the probability
that C applies to some new object y by averaging
the predictions of all hypotheses h, weighted by
p(hX)
118Examples 16
119Connection to feature-based similarity
- Additive clustering model of similarity
- Bayesian hypothesis averaging
- Equivalent if we identify features fk with
hypotheses h, and weights wk with .
120Examples 16 8 2 64
121Examples 16 23 19 20
122Model fits
Examples of yes numbers
Generalization judgments (N 20)
Bayesian Model (r 0.96)
60
60 80 10 30
60 52 57 55
123Model fits
Examples of yes numbers
Generalization judgments (N 20)
Bayesian Model (r 0.93)
16
16 8 2 64
16 23 19 20
124Summary of the Bayesian model
- How do the statistics of the examples interact
with prior knowledge to guide generalization? - Why does generalization appear rule-based or
similarity-based?
125Summary of the Bayesian model
- How do the statistics of the examples interact
with prior knowledge to guide generalization? - Why does generalization appear rule-based or
similarity-based?
126Alternative models
127Alternative models
- Neural networks
- Hypothesis ranking and elimination
Hypothesis ranking 1
2 3 4 .
even
multiple of 10
power of 2
multiple of 3
.
60
80
10
30
128Alternative models
- Neural networks
- Hypothesis ranking and elimination
- Similarity to exemplars
- Average similarity
60
60 80 10 30
60 52 57 55
Model (r 0.80)
Data
129Alternative models
- Neural networks
- Hypothesis ranking and elimination
- Similarity to exemplars
- Max similarity
60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
130Alternative models
- Neural networks
- Hypothesis ranking and elimination
- Similarity to exemplars
- Average similarity
- Max similarity
- Flexible similarity?
Bayes.
131Alternative models
- Neural networks
- Hypothesis ranking and elimination
- Similarity to exemplars
- Toolbox of simple heuristics
- 60 general similarity
- 60 80 10 30 most specific rule (subset
principle). - 60 52 57 55 similarity in magnitude
Why these heuristics? When to use which
heuristic? Bayes.
132Summary
- Generalization from limited data possible via the
interaction of structured knowledge and
statistics. - Structured knowledge space of candidate rules,
theories generate hypothesis space (c.f.
hierarchical priors) - Statistics Bayesian Occams razor.
- Better understand the interactions between
traditionally opposing concepts - Rules and statistics
- Rules and similarity
- Explains why central but notoriously slippery
processing-level concepts work the way they do. - Similarity
- Representativeness
- Rules and representativeness
133Why Bayes?
- A framework for explaining cognition.
- How people can learn so much from such limited
data. - Why process-level models work the way that they
do. - Strong quantitative models with minimal ad hoc
assumptions. - A framework for understanding how structured
knowledge and statistical inference interact. - How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning. - How simplicity trades off with fit to the data in
evaluating structural hypotheses (Occams razor). - How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance.
134Theory-Based Bayesian Models
- Rational statistical inference (Bayes)
- Learners domain theories generate their
hypothesis space H and prior p(h). - Well-matched to structure of the natural world.
- Learnable from limited data.
- Computationally tractable inference.
135Looking towards the afternoon
- How do we apply these ideas to more natural and
complex aspects of cognition? - Where do the hypothesis spaces come from?
- Can we formalize the contributions of domain
theories?
136(No Transcript)
137Outline
- Morning
- Introduction (Josh)
- Basic case study 1 Flipping coins (Tom)
- Basic case study 2 Rules and similarity (Josh)
- Afternoon
- Advanced case study 1 Causal induction (Tom)
- Advanced case study 2 Property induction (Josh)
- Quick tour of more advanced topics (Tom)
138Outline
- Morning
- Introduction (Josh)
- Basic case study 1 Flipping coins (Tom)
- Basic case study 2 Rules and similarity (Josh)
- Afternoon
- Advanced case study 1 Causal induction (Tom)
- Advanced case study 2 Property induction (Josh)
- Quick tour of more advanced topics (Tom)
139Marrs Three Levels of Analysis
- Computation
- What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out? - Representation and algorithm
- Cognitive psychology
- Implementation
- Neurobiology
140Working at the computational level
- What is the computational problem?
- input data
- output solution
141Working at the computational level
- What is the computational problem?
- input data
- output solution
- What knowledge is available to the learner?
- Where does that knowledge come from?
142Theory-Based Bayesian Models
- Rational statistical inference (Bayes)
- Learners domain theories generate their
hypothesis space H and prior p(h). - Well-matched to structure of the natural world.
- Learnable from limited data.
- Computationally tractable inference.
143Causality
144Bayes nets and beyond...
- Increasingly popular approach to studying human
causal inferences - (e.g. Glymour, 2001 Gopnik et al., 2004)
- Three reactions
- Bayes nets are the solution!
- Bayes nets are missing the point, not sure why
- what is a Bayes net?
145Bayes nets and beyond...
- What are Bayes nets?
- graphical models
- causal graphical models
- An example elemental causal induction
- Beyond Bayes nets
- other knowledge in causal induction
- formalizing causal theories
146Bayes nets and beyond...
- What are Bayes nets?
- graphical models
- causal graphical models
- An example elemental causal induction
- Beyond Bayes nets
- other knowledge in causal induction
- formalizing causal theories
147Graphical models
- Express the probabilistic dependency structure
among a set of variables (Pearl, 1988) - Consist of
- a set of nodes, corresponding to variables
- a set of edges, indicating dependency
- a set of functions defined on the graph that
defines a probability distribution
148Undirected graphical models
X3
X4
X1
- Consist of
- a set of nodes
- a set of edges
- a potential for each clique, multiplied together
to yield the distribution over variables - Examples
- statistical physics Ising model, spinglasses
- early neural networks (e.g. Boltzmann machines)
X2
X5
149Directed graphical models
X3
X4
X1
- Consist of
- a set of nodes
- a set of edges
- a conditional probability distribution for each
node, conditioned on its parents, multiplied
together to yield the distribution over variables - Constrained to directed acyclic graphs (DAG)
- AKA Bayesian networks, Bayes nets
X2
X5
150Bayesian networks and Bayes
- Two different problems
- Bayesian statistics is a method of inference
- Bayesian networks are a form of representation
- There is no necessary connection
- many users of Bayesian networks rely upon
frequentist statistical methods (e.g. Glymour) - many Bayesian inferences cannot be easily
represented using Bayesian networks
151Properties of Bayesian networks
- Efficient representation and inference
- exploiting dependency structure makes it easier
to represent and compute with probabilities - Explaining away
- pattern of probabilistic reasoning characteristic
of Bayesian networks, especially early use in AI
152Efficient representation and inference
- Three binary variables Cavity, Toothache, Catch
153Efficient representation and inference
- Three binary variables Cavity, Toothache, Catch
- Specifying P(Cavity, Toothache, Catch) requires 7
parameters (1 for each set of values, minus 1
because its a probability distribution) - With n variables, we need 2n -1 parameters
- Here n3. Realistically, many more X-ray, diet,
oral hygiene, personality, . . . .
154Conditional independence
- All three variables are dependent, but Toothache
and Catch are independent given the presence or
absence of Cavity - In probabilistic terms
- With n evidence variables, x1, , xn, we need 2 n
conditional probabilities
155A simple Bayesian network
- Graphical representation of relations between a
set of random variables - Probabilistic interpretation factorizing complex
terms
156A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
- Joint distribution sufficient for any inference
157A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
- Joint distribution sufficient for any inference
158A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
- Joint distribution sufficient for any inference
- General inference algorithm local message
passing (belief propagation Pearl, 1988) - efficiency depends on sparseness of graph
structure
159Explaining away
- Assume grass will be wet if and only if it rained
last night, or if the sprinklers were left on
160Explaining away
Compute probability it rained last night, given
that the grass is wet
161Explaining away
Compute probability it rained last night, given
that the grass is wet
162Explaining away
Compute probability it rained last night, given
that the grass is wet
163Explaining away
Compute probability it rained last night, given
that the grass is wet
164Explaining away
Compute probability it rained last night, given
that the grass is wet
165Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
166Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
167Explaining away
Discounting to prior probability.
168Contrast w/ production system
Rain
Grass Wet
- Formulate IF-THEN rules
- IF Rain THEN Wet
- IF Wet THEN Rain
- Rules do not distinguish directions of inference
- Requires combinatorial explosion of rules
169Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
- Observing rain, Wet becomes more active.
- Observing grass wet, Rain and Sprinkler become
more active. - Observing grass wet and sprinkler, Rain cannot
become less active. No explaining away!
- Excitatory links Rain Wet, Sprinkler
Wet
170Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
- Excitatory links Rain Wet, Sprinkler
Wet - Inhibitory link Rain Sprinkler
- Observing grass wet, Rain and Sprinkler become
more active. - Observing grass wet and sprinkler, Rain becomes
less active explaining away.
171Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet
- Each new variable requires more inhibitory
connections. - Interactions between variables are not causal.
- Not modular.
- Whether a connection exists depends on what other
connections exist, in non-transparent ways. - Big holism problem.
- Combinatorial explosion.
172Graphical models
- Capture dependency structure in distributions
- Provide an efficient means of representing and
reasoning with probabilities - Allow kinds of inference that are problematic for
other representations explaining away - hard to capture in a production system
- hard to capture with spreading activation
173Bayes nets and beyond...
- What are Bayes nets?
- graphical models
- causal graphical models
- An example causal induction
- Beyond Bayes nets
- other knowledge in causal induction
- formalizing causal theories
174Causal graphical models
- Graphical models represent statistical
dependencies among variables (ie. correlations) - can answer questions about observations
- Causal graphical models represent causal
dependencies among variables - express underlying causal structure
- can answer questions about both observations and
interventions (actions upon a variable)
175Observation and intervention
Battery
Radio
Ignition
Gas
Starts
On time to work
Graphical model P(RadioIgnition)
Causal graphical model P(Radiodo(Ignition))
176Observation and intervention
Battery
Radio
Ignition
Gas
Starts
On time to work
Graphical model P(RadioIgnition)
Causal graphical model P(Radiodo(Ignition))
graph surgery produces mutilated graph
177Assessing interventions
- To compute P(Ydo(Xx)), delete all edges coming
into X and reason with the resulting Bayesian
network (do calculus Pearl, 2000) - Allows a single structure to make predictions
about both observations and interventions
178Causality simplifies inference
- Using a representation in which the direction of
causality is correct produces sparser graphs - Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases - Does not capture the correlation between
symptoms falsely believe P(Ache, Catch)
P(Ache) P(Catch).
Ache
Catch
Cavity
179Causality simplifies inference
- Using a representation in which the direction of
causality is correct produces sparser graphs - Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases - Inserting a new arrow allows us to capture this
correlation. - This model is too complex do not believe that
Ache
Catch
Cavity
180Causality simplifies inference
- Using a representation in which the direction of
causality is correct produces sparser graphs - Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases - New symptoms require a combinatorial
proliferation of new arrows. This reduces
efficiency of inference.
Ache
X-ray
Catch
Cavity
181Learning causal graphical models
- Strength how strong is a relationship?
- Structure does a relationship exist?
B
B
182Causal structure vs. causal strength
- Strength how strong is a relationship?
B
B
183Causal structure vs. causal strength
- Strength how strong is a relationship?
- requires defining nature of relationship
B
B
184Parameterization
- Structures h1 h0
-
- Parameterization
C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
185Parameterization
- Structures h1 h0
-
- Parameterization
C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
186Parameterization
- Structures h1 h0
-
- Parameterization
C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
187Parameter estimation
- Maximum likelihood estimation
- maximize ?i P(bi,ci,ei w0, w1)
- Bayesian methods as in the Comparing infinitely
many hypotheses example
188Causal structure vs. causal strength
- Structure does a relationship exist?
B
B
189Approaches to structure learning
- Constraint-based
- dependency from statistical tests (eg. ?2)
- deduce structure from dependencies
C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
190Approaches to structure learning
- Constraint-based
- dependency from statistical tests (eg. ?2)
- deduce structure from dependencies
C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
191Approaches to structure learning
- Constraint-based
- dependency from statistical tests (eg. ?2)
- deduce structure from dependencies
C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
192Approaches to structure learning
- Constraint-based
- dependency from statistical tests (eg. ?2)
- deduce structure from dependencies
C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
Attempts to reduce inductive problem to deductive
problem
193Approaches to structure learning
- Constraint-based
- dependency from statistical tests (eg. ?2)
- deduce structure from dependencies
C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
- Bayesian
- compute posterior
- probability of structures,
- given observed data
C
B
C
B
E
E
P(S1data)
P(S0data)
P(Sdata) ? P(dataS) P(S)
(Heckerman, 1998 Friedman, 1999)
194Causal graphical models
- Extend graphical models to deal with
interventions as well as observations - Respecting the direction of causality results in
efficient representation and inference - Two steps in learning causal models
- parameter estimation
- structure learning
195Bayes nets and beyond...
- What are Bayes nets?
- graphical models
- causal graphical models
- An example elemental causal induction
- Beyond Bayes nets
- other knowledge in causal induction
- formalizing causal theories
196Elemental causal induction
C present
C absent
E present
a
c
E absent
d
b
To what extent does C cause E?
197Causal structure vs. causal strength
- Strength how strong is a relationship?
- Structure does a relationship exist?
B
B
198Causal strength
- Assume structure
- Leading models (DP and causal power) are maximum
likelihood estimates of the strength parameter
w1, under different parameterizations for
P(EB,C) - linear ? DP, Noisy-OR ? causal power
B
199Causal structure
- Hypotheses h1 h0
-
- Bayesian causal inference
- support
B
B
200Buehner and Cheng (1997)
People
DP (r 0.89)
Power (r 0.88)
Support (r 0.97)
201The importance of parameterization
- Noisy-OR incorporates mechanism assumptions
- generativity causes increase probability of
effects - each cause is sufficient to produce the effect
- causes act via independent mechanisms
- (Cheng, 1997)
- Co