Bayesian models of inductive learning

About This Presentation

Title:

Bayesian models of inductive learning

Description:

Some comparison to other approaches. Opportunities ... Comparing two simple hypotheses. P(H) = 0.5 vs. P(H) = 1.0 ... Comparing simple and complex hypotheses ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 385

Provided by: joshtenenb

Learn more at: https://cocosci.princeton.edu

more less

Transcript and Presenter's Notes

Title: Bayesian models of inductive learning

1
Bayesian models of inductive learning

Josh Tenenbaum Tom Griffiths
MIT
Computational Cognitive Science Group Department
of Brain and Cognitive Sciences
Computer Science and AI Lab (CSAIL)

2
What to expect

What youll get out of this tutorial
Our view of what Bayesian models have to offer
cognitive science.
In-depth examples of basic and advanced models
how the math works what it buys you.
Some comparison to other approaches.
Opportunities to ask questions.
What you wont get
Detailed, hands-on how-to.
Where you can learn more
http//bayesiancognition.com

3
Outline

Morning
Introduction (Josh)
Basic case study 1 Flipping coins (Tom)
Basic case study 2 Rules and similarity (Josh)
Afternoon
Advanced case study 1 Causal induction (Tom)
Advanced case study 2 Property induction (Josh)
Quick tour of more advanced topics (Tom)

4
Outline

Morning
Introduction (Josh)
Basic case study 1 Flipping coins (Tom)
Basic case study 2 Rules and similarity (Josh)
Afternoon
Advanced case study 1 Causal induction (Tom)
Advanced case study 2 Property induction (Josh)
Quick tour of more advanced topics (Tom)

5
Bayesian models in cognitive science

Vision
Motor control
Memory
Language
Inductive learning and reasoning.

6
Everyday inductive leaps

Learning concepts and words from examples

horse
horse
horse
7
Learning concepts and words

Can you pick out the tufas?

8
Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
9
Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
10
Everyday inductive leaps

How can we learn so much about . . .
Properties of natural kinds
Meanings of words
Future outcomes of a dynamic process
Hidden causal properties of an object
Causes of a persons action (beliefs, goals)
Causal laws governing a domain
. . . from such limited data?

11
The Challenge

How do we generalize successfully from very
limited data?
Just one or a few examples
Often only positive examples
Philosophy
Induction is a problem, a riddle, a
paradox, a scandal, or a myth.
Machine learning and statistics
Focus on generalization from many examples, both
positive and negative.

12
Rational statistical inference(Bayes, Laplace)
Sum over space of hypotheses
13
Bayesian models of inductive learning some
recent history

Shepard (1987)
Analysis of one-shot stimulus generalization, to
explain the universal exponential law.
Anderson (1990)
Models of categorization and causal induction.
Oaksford Chater (1994)
Model of conditional reasoning (Wason selection
task).
Heit (1998)
Framework for category-based inductive reasoning.

14
Theory-Based Bayesian Models

Rational statistical inference (Bayes)
Learners domain theories generate their
hypothesis space H and prior p(h).
Well-matched to structure of the natural world.
Learnable from limited data.
Computationally tractable inference.

15
What is a theory?

Working definition
An ontology and a system of abstract principles
that generates a hypothesis space of candidate
world structures along with their relative
probabilities.
Analogy to grammar in language.
Example Newtons laws

16
Structure and statistics

A framework for understanding how structured
knowledge and statistical inference interact.
How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning.
How simplicity trades off with fit to the data in
evaluating structural hypotheses.
How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance.

17
Structure and statistics

A framework for understanding how structured
knowledge and statistical inference interact.
How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning.
Hierarchical Bayes.
How simplicity trades off with fit to the data in
evaluating structural hypotheses.
Bayesian Occams Razor.
How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance.
Non-parametric Bayes.

18
Alternative approaches to inductive generalization

Associative learning
Connectionist networks
Similarity to examples
Toolkit of simple heuristics
Constraint satisfaction
Analogical mapping

19
Marrs Three Levels of Analysis

Computation
What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?
Representation and algorithm
Cognitive psychology
Implementation
Neurobiology

20
Why Bayes?

A framework for explaining cognition.
How people can learn so much from such limited
data.
Why process-level models work the way that they
do.
Strong quantitative models with minimal ad hoc
assumptions.
A framework for understanding how structured
knowledge and statistical inference interact.
How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning.
How simplicity trades off with fit to the data in
evaluating structural hypotheses (Occams razor).
How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance.

21
Outline

Morning
Introduction (Josh)
Basic case study 1 Flipping coins (Tom)
Basic case study 2 Rules and similarity (Josh)
Afternoon
Advanced case study 1 Causal induction (Tom)
Advanced case study 2 Property induction (Josh)
Quick tour of more advanced topics (Tom)

22
Coin flipping
23
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
24
Bayes rule
For data D and a hypothesis H, we have

Posterior probability
Prior probability
Likelihood

25
The origin of Bayes rule

A simple consequence of using probability to
represent degrees of belief
For any two random variables

26
Why represent degrees of belief with
probabilities?

Good statistics
consistency, and worst-case error bounds.
Cox Axioms
necessary to cohere with common sense
Dutch Book Survival of the Fittest
if your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord.
Provides a theory of learning
a common currency for combining prior knowledge
and the lessons of experience.

27
Bayes rule
For data D and a hypothesis H, we have

Posterior probability
Prior probability
Likelihood

28
Hypotheses in Bayesian inference

Hypotheses H refer to processes that could have
generated the data D
Bayesian inference provides a distribution over
these hypotheses, given D
P(DH) is the probability of D being generated by
the process identified by H
Hypotheses H are mutually exclusive only one
process could have generated D

29
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT

Fair coin, P(H) 0.5
Coin with P(H) p
Markov model
Hidden Markov model
...

30
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT

Fair coin, P(H) 0.5
Coin with P(H) p
Markov model
Hidden Markov model
...

31
Representing generative models

Graphical model notation
Pearl (1988), Jordan (1998)
Variables are nodes, edges indicate dependency
Directed edges show causal process of data
generation

32
Models with latent structure

Not all nodes in a graphical model need to be
observed
Some variables reflect latent structure, used in
generating D but unobserved

33
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p
Comparing infinitely many hypotheses
P(H) p
Psychology Representativeness

34
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p
Comparing infinitely many hypotheses
P(H) p
Psychology Representativeness

35
Comparing two simple hypotheses

Contrast simple hypotheses
H1 fair coin, P(H) 0.5
H2always heads, P(H) 1.0
Bayes rule
With two hypotheses, use odds form

36
Bayes rule in odds form

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D data
H1, H2 models
P(H1D) posterior probability H1 generated the
data
P(DH1) likelihood of data under model H1
P(H1) prior probability H1 generated the data

x
37
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
38
Comparing two simple hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHTHT
H1, H2 fair coin, always heads
P(DH1) 1/25 P(H1) 999/1000
P(DH2) 0 P(H2) 1/1000
P(H1D) / P(H2D) infinity

x
39
Comparing two simple hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHHHH
H1, H2 fair coin, always heads
P(DH1) 1/25 P(H1) 999/1000
P(DH2) 1 P(H2) 1/1000
P(H1D) / P(H2D) ? 30

x
40
Comparing two simple hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHHHHHHHHH
H1, H2 fair coin, always heads
P(DH1) 1/210 P(H1) 999/1000
P(DH2) 1 P(H2) 1/1000
P(H1D) / P(H2D) ? 1

x
41
Comparing two simple hypotheses

Bayes rule tells us how to combine prior beliefs
with new data
top-down and bottom-up influences
As a model of human inference
predicts conclusions drawn from data
identifies point at which prior beliefs are
overwhelmed by new experiences
But more complex cases?

42
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p
Comparing infinitely many hypotheses
P(H) p
Psychology Representativeness

43
Comparing simple and complex hypotheses
vs.

Which provides a better account of the data the
simple hypothesis of a fair coin, or the complex
hypothesis that P(H) p?

44
Comparing simple and complex hypotheses

P(H) p is more complex than P(H) 0.5 in two
ways
P(H) 0.5 is a special case of P(H) p
for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5

45
Comparing simple and complex hypotheses
Probability
46
Comparing simple and complex hypotheses
Probability
HHHHH p 1.0
47
Comparing simple and complex hypotheses
Probability
HHTHT p 0.6
48
Comparing simple and complex hypotheses

P(H) p is more complex than P(H) 0.5 in two
ways
P(H) 0.5 is a special case of P(H) p
for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5
How can we deal with this?
frequentist hypothesis testing
information theorist minimum description length
Bayesian just use probability theory!

49
Comparing simple and complex hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
Computing P(DH1) is easy
P(DH1) 1/2N
Compute P(DH2) by averaging over p

x
50
Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
51
Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
52
Comparing simple and complex hypotheses

Simple and complex hypotheses can be compared
directly using Bayes rule
requires summing over latent variables
Complex hypotheses are penalized for their
greater flexibility Bayesian Occams razor
This principle is used in model selection methods
in psychology (e.g. Myung Pitt, 1997)

53
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p
Comparing infinitely many hypotheses
P(H) p
Psychology Representativeness

54
Comparing infinitely many hypotheses

Assume data are generated from a model
What is the value of p?
each value of p is a hypothesis H
requires inference over infinitely many hypotheses

55
Comparing infinitely many hypotheses

Flip a coin 10 times and see 5 heads, 5 tails.
P(H) on next flip? 50
Why? 50 5 / (55) 5/10.
Future will be like the past.
Suppose we had seen 4 heads and 6 tails.
P(H) on next flip? Closer to 50 than to 40.
Why? Prior knowledge.

56
Integrating prior knowledge and data

Posterior distribution P(p D) is a probability
density over p P(H)
Need to work out likelihood P(D p) and specify
prior distribution P(p)

P(p D) ? P(D p) P(p)
57
Likelihood and prior

Likelihood
P(D p) pNH (1-p)NT
NH number of heads
NT number of tails
Prior
P(p) ? pFH-1 (1-p)FT-1

?
58
A simple method of specifying priors

Imagine some fictitious trials, reflecting a set
of previous experiences
strategy often used with neural networks
e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair
In fact, this is a sensible statistical idea...

59
Likelihood and prior

Likelihood
P(D p) pNH (1-p)NT
NH number of heads
NT number of tails
Prior
P(p) ? pFH-1 (1-p)FT-1
FH fictitious observations of heads
FT fictitious observations of tails

Beta(FH,FT)
60
Conjugate priors

Exist for many standard distributions
formula for exponential family conjugacy
Define prior in terms of fictitious observations
Beta is conjugate to Bernoulli (coin-flipping)

FH FT 1 FH FT 3 FH FT 1000
61
Likelihood and prior

Likelihood
P(D p) pNH (1-p)NT
NH number of heads
NT number of tails
Prior
P(p) ? pFH-1 (1-p)FT-1
FH fictitious observations of heads
FT fictitious observations of tails

62
Comparing infinitely many hypotheses
P(p D) ? P(D p) P(p) pNHFH-1 (1-p)NTFT-1

Posterior is Beta(NHFH,NTFT)
same form as conjugate prior
Posterior mean
Posterior predictive distribution

63
Some examples

e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair
After seeing 4 heads, 6 tails, P(H) on next flip
1004 / (10041006) 49.95
e.g., F 3 heads, 3 tails weak expectation
that any new coin will be fair
After seeing 4 heads, 6 tails, P(H) on next flip
7 / (79) 43.75
Prior knowledge too weak

64
But flipping thumbtacks

e.g., F 4 heads, 3 tails weak expectation
that tacks are slightly biased towards heads
After seeing 2 heads, 0 tails, P(H) on next flip
6 / (63) 67
Some prior knowledge is always necessary to avoid
jumping to hasty conclusions...
Suppose F After seeing 2 heads, 0 tails,
P(H) on next flip 2 / (20) 100

65
Origin of prior knowledge

Tempting answer prior experience
Suppose you have previously seen 2000 coin flips
1000 heads, 1000 tails
By assuming all coins (and flips) are alike,
these observations of other coins are as good as
observations of the present coin

66
Problems with simple empiricism

Havent really seen 2000 coin flips, or any flips
of a thumbtack
Prior knowledge is stronger than raw experience
justifies
Havent seen exactly equal number of heads and
tails
Prior knowledge is smoother than raw experience
justifies
Should be a difference between observing 2000
flips of a single coin versus observing 10 flips
each for 200 coins, or 1 flip each for 2000 coins
Prior knowledge is more structured than raw
experience

67
A simple theory

Coins are manufactured by a standardized
procedure that is effective but not perfect.
Justifies generalizing from previous coins to the
present coin.
Justifies smoother and stronger prior than raw
experience alone.
Explains why seeing 10 flips each for 200 coins
is more valuable than seeing 2000 flips of one
coin.
Tacks are asymmetric, and manufactured to less
exacting standards.

68
Limitations

Can all domain knowledge be represented so
simply, in terms of an equivalent number of
fictional observations?
Suppose you flip a coin 25 times and get all
heads. Something funny is going on
But with F 1000 heads, 1000 tails, P(H) on
next flip 1025 / (10251000) 50.6.
Looks like nothing unusual

69
Hierarchical priors

Higher-order hypothesis is this coin fair or
unfair?
Example probabilities
P(fair) 0.99
P(pfair) is Beta(1000,1000)
P(punfair) is Beta(1,1)
25 heads in a row propagates up, affecting p and
then P(fairD)

fair
p
d1 d2 d3 d4
70
More hierarchical priors

Latent structure can capture coin variability
10 flips from 200 coins is better than 2000 flips
from a single coin allows estimation of FH, FT

p Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
p
p
p
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
71
Yet more hierarchical priors
physical knowledge

Discrete beliefs (e.g. symmetry) can influence
estimation of continuous properties (e.g. FH, FT)

FH,FT
p
p
p
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
72
Comparing infinitely many hypotheses

Apply Bayes rule to obtain posterior probability
density
Requires prior over all hypotheses
computation simplified by conjugate priors
richer structure with hierarchical priors
Hierarchical priors indicate how simple theories
can inform statistical inferences
one step towards structure and statistics

73
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p
Comparing infinitely many hypotheses
P(H) p
Psychology Representativeness

74
Psychology Representativeness

Which sequence is more likely from a fair coin?

HHTHT
more representative of a fair coin (Kahneman
Tversky, 1972)
HHHHH
75
What might representativeness mean?
Evidence for a random generating process
76
A constrained hypothesis space

Four hypotheses
h1 fair coin HHTHTTTH
h2 always alternates HTHTHTHT
h3 mostly heads HHTHTHHH
h4 always heads HHHHHHHH

77
Representativeness judgments
78
Results

Good account of representativeness data, with
three pseudo-free parameters, ? 0.91
always alternates means 99 of the time
mostly heads means P(H) 0.85
always heads means P(H) 0.99
With scaling parameter, r 0.95

(Tenenbaum Griffiths, 2001)
79
The role of theories

The fact that HHTHT looks representative of a
fair coin and HHHHH does not reflects our
implicit theories of how the world works.
Easy to imagine how a trick all-heads coin could
work high prior probability.
Hard to imagine how a trick HHTHT coin could
work low prior probability.

80
Summary

Three kinds of Bayesian inference
comparing two simple hypotheses
comparing simple and complex hypotheses
comparing an infinite number of hypotheses
Critical notions
generative models, graphical models
Bayesian Occams razor
priors conjugate, hierarchical (theories)

81
Outline

Morning
Introduction (Josh)
Basic case study 1 Flipping coins (Tom)
Basic case study 2 Rules and similarity (Josh)
Afternoon
Advanced case study 1 Causal induction (Tom)
Advanced case study 2 Property induction (Josh)
Quick tour of more advanced topics (Tom)

82
Rules and similarity
83
Structure versus statistics
Statistics Similarity Typicality
Rules Logic Symbols
84
A better metaphor
85
A better metaphor
86
Structure and statistics
Statistics Similarity Typicality
Rules Logic Symbols
87
Structure and statistics

Basic case study 1 Flipping coins
Learning and reasoning with structured
statistical models.
Basic case study 2 Rules and similarity
Statistical learning with structured
representations.

88
The number game

Program input number between 1 and 100
Program output yes or no

89
The number game

Learning task
Observe one or more positive (yes) examples.
Judge whether other numbers are yes or no.

90
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
91
The number game
Examples of yes numbers
Generalization judgments (n 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
92
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
Focused similarity numbers near 50-60
60 52 57 55
93
The number game
Examples of yes numbers
Generalization judgments (N 20)
16
Diffuse similarity
16 8 2 64
Rule powers of 2
Focused similarity numbers near 20
16 23 19 20
94
The number game

Main phenomena to explain
Generalization can appear either similarity-based
(graded) or rule-based (all-or-none).
Learning from just a few positive examples.

95
Rule/similarity hybrid models

Category learning
Nosofsky, Palmeri et al. RULEX
Erickson Kruschke ATRIUM

96
Divisions into rule and similarity subsystems

Category learning
Nosofsky, Palmeri et al. RULEX
Erickson Kruschke ATRIUM
Language processing
Pinker, Marcus et al. Past tense morphology
Reasoning
Sloman
Rips
Nisbett, Smith et al.

97
Rule/similarity hybrid models

Why two modules?
Why do these modules work the way that they do,
and interact as they do?
How do people infer a rule or similarity metric
from just a few positive examples?

98
Bayesian model

H Hypothesis space of possible concepts
h1 2, 4, 6, 8, 10, 12, , 96, 98, 100
(even numbers)
h2 10, 20, 30, 40, , 90, 100 (multiples
of 10)
h3 2, 4, 8, 16, 32, 64 (powers of 2)
h4 50, 51, 52, , 59, 60 (numbers between
50 and 60)
. . .

Representational interpretations for H
Candidate rules
Features for similarity
Consequential subsets (Shepard, 1987)

Inferring hypotheses from similarity judgment
Additive clustering (Shepard Arabie, 1977)
similarity of stimuli i, j
weight of cluster k
membership of stimulus i in cluster k
(1 if stimulus i in cluster k, 0 otherwise)
Equivalent to similarity as a weighted sum of
common features (Tversky, 1977).

100

Additive clustering for the integers 0-9

Rank Weight Stimuli in cluster Interpretation
0 1 2 3 4 5 6 7 8 9 1 .444
powers
of two 2 .345 small numbers 3 .331
multiples of
three 4 .291
large numbers 5 .255
middle numbers 6 .216
odd numbers 7 .214 smallish
numbers 8 .172
largish numbers
101
Three hypothesis subspaces for number concepts

Mathematical properties (24 hypotheses)
Odd, even, square, cube, prime numbers
Multiples of small integers
Powers of small integers
Raw magnitude (5050 hypotheses)
All intervals of integers with endpoints between
1 and 100.
Approximate magnitude (10 hypotheses)
Decades (1-10, 10-20, 20-30, )

102
Hypothesis spaces and theories

Why a hypothesis space is like a domain theory
Represents one particular way of classifying
entities in a domain.
Not just an arbitrary collection of hypotheses,
but a principled system.
Whats missing?
Explicit representation of the principles.
Hypothesis spaces (and priors) are generated by
theories. Some analogies
Grammars generate languages (and priors over
structural descriptions)
Hierarchical Bayesian modeling

103
Bayesian model

H Hypothesis space of possible concepts
Mathematical properties even, odd, square,
prime, . . . .
Approximate magnitude 1-10, 10-20, 20-30,
. . . .
Raw magnitude all intervals between 1 and 100.
X x1, . . . , xn n examples of a concept C.
Evaluate hypotheses given data
p(h) prior domain knowledge, pre-existing
biases
p(Xh) likelihood statistical information in
examples.
p(hX) posterior degree of belief that h is
the true extension of C.

104
Bayesian model

H Hypothesis space of possible concepts
Mathematical properties even, odd, square,
prime, . . . .
Approximate magnitude 1-10, 10-20, 20-30,
. . . .
Raw magnitude all intervals between 1 and 100.
X x1, . . . , xn n examples of a concept C.
Evaluate hypotheses given data
p(h) prior domain knowledge, pre-existing
biases
p(Xh) likelihood statistical information in
examples.
p(hX) posterior degree of belief that h is
the true extension of C.

105

Likelihood p(Xh)
Size principle Smaller hypotheses receive
greater likelihood, and exponentially more so as
n increases.
Follows from assumption of randomly sampled
examples.
Captures the intuition of a representative
sample.

106
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
107
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data slightly more of a coincidence under h1
108
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data much more of a coincidence under h1
109
Bayesian Occams Razor
Law of Conservation of Belief
M1
p(D d M )
M2
All possible data sets d
For any model M,
110
Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
111

Prior p(h)
Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses.
Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70.

112

Prior p(h)
Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses.
Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70.
p(h) encodes relative weights of alternative
theories

H Total hypothesis space

H1 Math properties (24)
even numbers
powers of two
multiples of three
.

H2 Raw magnitude (5050)
10-15
20-32
37-54
.

H3 Approx. magnitude (10)
10-20
20-30
30-40
.

113
A more complex approach to priors

Start with a base set of regularities R and
combination operators C.
Hypothesis space closure of R under C.
C and, or H unions and intersections of
regularities in R (e.g., multiples of 10 between
30 and 70).
C and-not H regularities in R with
exceptions (e.g., multiples of 10 except 50 and
70).
Two qualitatively similar priors
Description length number of combinations in C
needed to generate hypothesis from R.
Bayesian Occams Razor, with model classes
defined by number of combinations more
combinations more hypotheses lower
prior

114

Posterior
X 60, 80, 10, 30
Why prefer multiples of 10 over even numbers?
p(Xh).
Why prefer multiples of 10 over multiples of
10 except 50 and 20? p(h).
Why does a good generalization need both high
prior and high likelihood? p(hX) p(Xh) p(h)

115
Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
116
Generalizing to new objects
Given p(hX), how do we compute ,
the probability that C applies to some new
stimulus y?
117
Generalizing to new objects
Hypothesis averaging Compute the probability
that C applies to some new object y by averaging
the predictions of all hypotheses h, weighted by
p(hX)
118
Examples 16
119
Connection to feature-based similarity

Additive clustering model of similarity
Bayesian hypothesis averaging
Equivalent if we identify features fk with
hypotheses h, and weights wk with .

120
Examples 16 8 2 64
121
Examples 16 23 19 20
122
Model fits
Examples of yes numbers
Generalization judgments (N 20)
Bayesian Model (r 0.96)
60
60 80 10 30
60 52 57 55
123
Model fits
Examples of yes numbers
Generalization judgments (N 20)
Bayesian Model (r 0.93)
16
16 8 2 64
16 23 19 20
124
Summary of the Bayesian model

How do the statistics of the examples interact
with prior knowledge to guide generalization?
Why does generalization appear rule-based or
similarity-based?

125
Summary of the Bayesian model

How do the statistics of the examples interact
with prior knowledge to guide generalization?
Why does generalization appear rule-based or
similarity-based?

126
Alternative models

Neural networks

127
Alternative models

Neural networks
Hypothesis ranking and elimination

Hypothesis ranking 1
2 3 4 .
even
multiple of 10
power of 2
multiple of 3
.
60
80
10
30
128
Alternative models

Neural networks
Hypothesis ranking and elimination
Similarity to exemplars
Average similarity

60
60 80 10 30
60 52 57 55
Model (r 0.80)
Data
129
Alternative models

Neural networks
Hypothesis ranking and elimination
Similarity to exemplars
Max similarity

60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
130
Alternative models

Neural networks
Hypothesis ranking and elimination
Similarity to exemplars
Average similarity
Max similarity
Flexible similarity?

Bayes.
131
Alternative models

Neural networks
Hypothesis ranking and elimination
Similarity to exemplars
Toolbox of simple heuristics
60 general similarity
60 80 10 30 most specific rule (subset
principle).
60 52 57 55 similarity in magnitude

Why these heuristics? When to use which
heuristic? Bayes.
132
Summary

Generalization from limited data possible via the
interaction of structured knowledge and
statistics.
Structured knowledge space of candidate rules,
theories generate hypothesis space (c.f.
hierarchical priors)
Statistics Bayesian Occams razor.
Better understand the interactions between
traditionally opposing concepts
Rules and statistics
Rules and similarity
Explains why central but notoriously slippery
processing-level concepts work the way they do.
Similarity
Representativeness

Rules and representativeness

133
Why Bayes?

A framework for explaining cognition.
How people can learn so much from such limited
data.
Why process-level models work the way that they
do.
Strong quantitative models with minimal ad hoc
assumptions.
A framework for understanding how structured
knowledge and statistical inference interact.
How structured knowledge guides statistical
inference, and is itself acquired through
higher-order statistical learning.
How simplicity trades off with fit to the data in
evaluating structural hypotheses (Occams razor).
How increasingly complex structures may grow as
required by new data, rather than being
pre-specified in advance.

134
Theory-Based Bayesian Models

Rational statistical inference (Bayes)
Learners domain theories generate their
hypothesis space H and prior p(h).
Well-matched to structure of the natural world.
Learnable from limited data.
Computationally tractable inference.

135
Looking towards the afternoon

How do we apply these ideas to more natural and
complex aspects of cognition?
Where do the hypothesis spaces come from?
Can we formalize the contributions of domain
theories?

136
(No Transcript)
137
Outline

Morning
Introduction (Josh)
Basic case study 1 Flipping coins (Tom)
Basic case study 2 Rules and similarity (Josh)
Afternoon
Advanced case study 1 Causal induction (Tom)
Advanced case study 2 Property induction (Josh)
Quick tour of more advanced topics (Tom)

138
Outline

Morning
Introduction (Josh)
Basic case study 1 Flipping coins (Tom)
Basic case study 2 Rules and similarity (Josh)
Afternoon
Advanced case study 1 Causal induction (Tom)
Advanced case study 2 Property induction (Josh)
Quick tour of more advanced topics (Tom)

139
Marrs Three Levels of Analysis

Computation
What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?
Representation and algorithm
Cognitive psychology
Implementation
Neurobiology

140
Working at the computational level

What is the computational problem?
input data
output solution

141
Working at the computational level

What is the computational problem?
input data
output solution
What knowledge is available to the learner?
Where does that knowledge come from?

142
Theory-Based Bayesian Models

Rational statistical inference (Bayes)
Learners domain theories generate their
hypothesis space H and prior p(h).
Well-matched to structure of the natural world.
Learnable from limited data.
Computationally tractable inference.

143
Causality
144
Bayes nets and beyond...

Increasingly popular approach to studying human
causal inferences
(e.g. Glymour, 2001 Gopnik et al., 2004)
Three reactions
Bayes nets are the solution!
Bayes nets are missing the point, not sure why
what is a Bayes net?

145
Bayes nets and beyond...

What are Bayes nets?
graphical models
causal graphical models
An example elemental causal induction
Beyond Bayes nets
other knowledge in causal induction
formalizing causal theories

146
Bayes nets and beyond...

What are Bayes nets?
graphical models
causal graphical models
An example elemental causal induction
Beyond Bayes nets
other knowledge in causal induction
formalizing causal theories

147
Graphical models

Express the probabilistic dependency structure
among a set of variables (Pearl, 1988)
Consist of
a set of nodes, corresponding to variables
a set of edges, indicating dependency
a set of functions defined on the graph that
defines a probability distribution

148
Undirected graphical models
X3
X4
X1

Consist of
a set of nodes
a set of edges
a potential for each clique, multiplied together
to yield the distribution over variables
Examples
statistical physics Ising model, spinglasses
early neural networks (e.g. Boltzmann machines)

X2
X5
149
Directed graphical models
X3
X4
X1

Consist of
a set of nodes
a set of edges
a conditional probability distribution for each
node, conditioned on its parents, multiplied
together to yield the distribution over variables
Constrained to directed acyclic graphs (DAG)
AKA Bayesian networks, Bayes nets

X2
X5
150
Bayesian networks and Bayes

Two different problems
Bayesian statistics is a method of inference
Bayesian networks are a form of representation
There is no necessary connection
many users of Bayesian networks rely upon
frequentist statistical methods (e.g. Glymour)
many Bayesian inferences cannot be easily
represented using Bayesian networks

151
Properties of Bayesian networks

Efficient representation and inference
exploiting dependency structure makes it easier
to represent and compute with probabilities
Explaining away
pattern of probabilistic reasoning characteristic
of Bayesian networks, especially early use in AI

152
Efficient representation and inference

Three binary variables Cavity, Toothache, Catch

153
Efficient representation and inference

Three binary variables Cavity, Toothache, Catch
Specifying P(Cavity, Toothache, Catch) requires 7
parameters (1 for each set of values, minus 1
because its a probability distribution)
With n variables, we need 2n -1 parameters
Here n3. Realistically, many more X-ray, diet,
oral hygiene, personality, . . . .

154
Conditional independence

All three variables are dependent, but Toothache
and Catch are independent given the presence or
absence of Cavity
In probabilistic terms
With n evidence variables, x1, , xn, we need 2 n
conditional probabilities

155
A simple Bayesian network

Graphical representation of relations between a
set of random variables
Probabilistic interpretation factorizing complex
terms

156
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work

Joint distribution sufficient for any inference

157
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work

Joint distribution sufficient for any inference

158
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work

Joint distribution sufficient for any inference
General inference algorithm local message
passing (belief propagation Pearl, 1988)
efficiency depends on sparseness of graph
structure

159
Explaining away

Assume grass will be wet if and only if it rained
last night, or if the sprinklers were left on

160
Explaining away
Compute probability it rained last night, given
that the grass is wet
161
Explaining away
Compute probability it rained last night, given
that the grass is wet
162
Explaining away
Compute probability it rained last night, given
that the grass is wet
163
Explaining away
Compute probability it rained last night, given
that the grass is wet
164
Explaining away
Compute probability it rained last night, given
that the grass is wet
165
Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
166
Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
167
Explaining away
Discounting to prior probability.
168
Contrast w/ production system
Rain
Grass Wet

Formulate IF-THEN rules
IF Rain THEN Wet
IF Wet THEN Rain
Rules do not distinguish directions of inference
Requires combinatorial explosion of rules

169
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet

Observing rain, Wet becomes more active.
Observing grass wet, Rain and Sprinkler become
more active.
Observing grass wet and sprinkler, Rain cannot
become less active. No explaining away!

Excitatory links Rain Wet, Sprinkler
Wet

170
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet

Excitatory links Rain Wet, Sprinkler
Wet
Inhibitory link Rain Sprinkler

Observing grass wet, Rain and Sprinkler become
more active.
Observing grass wet and sprinkler, Rain becomes
less active explaining away.

171
Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet

Each new variable requires more inhibitory
connections.
Interactions between variables are not causal.
Not modular.
Whether a connection exists depends on what other
connections exist, in non-transparent ways.
Big holism problem.
Combinatorial explosion.

172
Graphical models

Capture dependency structure in distributions
Provide an efficient means of representing and
reasoning with probabilities
Allow kinds of inference that are problematic for
other representations explaining away
hard to capture in a production system
hard to capture with spreading activation

173
Bayes nets and beyond...

What are Bayes nets?
graphical models
causal graphical models
An example causal induction
Beyond Bayes nets
other knowledge in causal induction
formalizing causal theories

174
Causal graphical models

Graphical models represent statistical
dependencies among variables (ie. correlations)
can answer questions about observations
Causal graphical models represent causal
dependencies among variables
express underlying causal structure
can answer questions about both observations and
interventions (actions upon a variable)

175
Observation and intervention
Battery
Radio
Ignition
Gas
Starts
On time to work
Graphical model P(RadioIgnition)
Causal graphical model P(Radiodo(Ignition))
176
Observation and intervention
Battery
Radio
Ignition
Gas
Starts
On time to work
Graphical model P(RadioIgnition)
Causal graphical model P(Radiodo(Ignition))
graph surgery produces mutilated graph
177
Assessing interventions

To compute P(Ydo(Xx)), delete all edges coming
into X and reason with the resulting Bayesian
network (do calculus Pearl, 2000)
Allows a single structure to make predictions
about both observations and interventions

178
Causality simplifies inference

Using a representation in which the direction of
causality is correct produces sparser graphs
Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases
Does not capture the correlation between
symptoms falsely believe P(Ache, Catch)
P(Ache) P(Catch).

Ache
Catch
Cavity
179
Causality simplifies inference

Using a representation in which the direction of
causality is correct produces sparser graphs
Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases
Inserting a new arrow allows us to capture this
correlation.
This model is too complex do not believe that

Ache
Catch
Cavity
180
Causality simplifies inference

Using a representation in which the direction of
causality is correct produces sparser graphs
Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases
New symptoms require a combinatorial
proliferation of new arrows. This reduces
efficiency of inference.

Ache
X-ray
Catch
Cavity
181
Learning causal graphical models

Strength how strong is a relationship?
Structure does a relationship exist?

B
B
182
Causal structure vs. causal strength

Strength how strong is a relationship?

B
B
183
Causal structure vs. causal strength

Strength how strong is a relationship?
requires defining nature of relationship

B
B
184
Parameterization

Structures h1 h0
Parameterization

C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
185
Parameterization

Structures h1 h0
Parameterization

C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
186
Parameterization

Structures h1 h0
Parameterization

C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
187
Parameter estimation

Maximum likelihood estimation
maximize ?i P(bi,ci,ei w0, w1)
Bayesian methods as in the Comparing infinitely
many hypotheses example

188
Causal structure vs. causal strength

Structure does a relationship exist?

B
B
189
Approaches to structure learning

Constraint-based
dependency from statistical tests (eg. ?2)
deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
190
Approaches to structure learning

Constraint-based
dependency from statistical tests (eg. ?2)
deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
191
Approaches to structure learning

Constraint-based
dependency from statistical tests (eg. ?2)
deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
192
Approaches to structure learning

Constraint-based
dependency from statistical tests (eg. ?2)
deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
Attempts to reduce inductive problem to deductive
problem
193
Approaches to structure learning

Constraint-based
dependency from statistical tests (eg. ?2)
deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)

Bayesian
compute posterior
probability of structures,
given observed data

C
B
C
B
E
E
P(S1data)
P(S0data)
P(Sdata) ? P(dataS) P(S)
(Heckerman, 1998 Friedman, 1999)
194
Causal graphical models

Extend graphical models to deal with
interventions as well as observations
Respecting the direction of causality results in
efficient representation and inference
Two steps in learning causal models
parameter estimation
structure learning

195
Bayes nets and beyond...

What are Bayes nets?
graphical models
causal graphical models
An example elemental causal induction
Beyond Bayes nets
other knowledge in causal induction
formalizing causal theories

196
Elemental causal induction
C present
C absent
E present
a
c
E absent
d
b
To what extent does C cause E?
197
Causal structure vs. causal strength

Strength how strong is a relationship?
Structure does a relationship exist?

B
B
198
Causal strength

Assume structure
Leading models (DP and causal power) are maximum
likelihood estimates of the strength parameter
w1, under different parameterizations for
P(EB,C)
linear ? DP, Noisy-OR ? causal power

B
199
Causal structure