Bayesian models of inductive learning

About This Presentation
Title:

Bayesian models of inductive learning

Description:

Some comparison to other approaches. Opportunities ... Comparing two simple hypotheses. P(H) = 0.5 vs. P(H) = 1.0 ... Comparing simple and complex hypotheses ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Bayesian models of inductive learning


1
Bayesian models of inductive learning
  • Josh Tenenbaum Tom Griffiths
  • MIT
  • Computational Cognitive Science Group Department
    of Brain and Cognitive Sciences
  • Computer Science and AI Lab (CSAIL)

2
What to expect
  • What youll get out of this tutorial
  • Our view of what Bayesian models have to offer
    cognitive science.
  • In-depth examples of basic and advanced models
    how the math works what it buys you.
  • Some comparison to other approaches.
  • Opportunities to ask questions.
  • What you wont get
  • Detailed, hands-on how-to.
  • Where you can learn more
  • http//bayesiancognition.com

3
Outline
  • Morning
  • Introduction (Josh)
  • Basic case study 1 Flipping coins (Tom)
  • Basic case study 2 Rules and similarity (Josh)
  • Afternoon
  • Advanced case study 1 Causal induction (Tom)
  • Advanced case study 2 Property induction (Josh)
  • Quick tour of more advanced topics (Tom)

4
Outline
  • Morning
  • Introduction (Josh)
  • Basic case study 1 Flipping coins (Tom)
  • Basic case study 2 Rules and similarity (Josh)
  • Afternoon
  • Advanced case study 1 Causal induction (Tom)
  • Advanced case study 2 Property induction (Josh)
  • Quick tour of more advanced topics (Tom)

5
Bayesian models in cognitive science
  • Vision
  • Motor control
  • Memory
  • Language
  • Inductive learning and reasoning.

6
Everyday inductive leaps
  • Learning concepts and words from examples

horse
horse
horse
7
Learning concepts and words
  • Can you pick out the tufas?

8
Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
9
Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
10
Everyday inductive leaps
  • How can we learn so much about . . .
  • Properties of natural kinds
  • Meanings of words
  • Future outcomes of a dynamic process
  • Hidden causal properties of an object
  • Causes of a persons action (beliefs, goals)
  • Causal laws governing a domain
  • . . . from such limited data?

11
The Challenge
  • How do we generalize successfully from very
    limited data?
  • Just one or a few examples
  • Often only positive examples
  • Philosophy
  • Induction is a problem, a riddle, a
    paradox, a scandal, or a myth.
  • Machine learning and statistics
  • Focus on generalization from many examples, both
    positive and negative.

12
Rational statistical inference(Bayes, Laplace)
Sum over space of hypotheses
13
Bayesian models of inductive learning some
recent history
  • Shepard (1987)
  • Analysis of one-shot stimulus generalization, to
    explain the universal exponential law.
  • Anderson (1990)
  • Models of categorization and causal induction.
  • Oaksford Chater (1994)
  • Model of conditional reasoning (Wason selection
    task).
  • Heit (1998)
  • Framework for category-based inductive reasoning.

14
Theory-Based Bayesian Models
  • Rational statistical inference (Bayes)
  • Learners domain theories generate their
    hypothesis space H and prior p(h).
  • Well-matched to structure of the natural world.
  • Learnable from limited data.
  • Computationally tractable inference.

15
What is a theory?
  • Working definition
  • An ontology and a system of abstract principles
    that generates a hypothesis space of candidate
    world structures along with their relative
    probabilities.
  • Analogy to grammar in language.
  • Example Newtons laws

16
Structure and statistics
  • A framework for understanding how structured
    knowledge and statistical inference interact.
  • How structured knowledge guides statistical
    inference, and is itself acquired through
    higher-order statistical learning.
  • How simplicity trades off with fit to the data in
    evaluating structural hypotheses.
  • How increasingly complex structures may grow as
    required by new data, rather than being
    pre-specified in advance.

17
Structure and statistics
  • A framework for understanding how structured
    knowledge and statistical inference interact.
  • How structured knowledge guides statistical
    inference, and is itself acquired through
    higher-order statistical learning.
  • Hierarchical Bayes.
  • How simplicity trades off with fit to the data in
    evaluating structural hypotheses.
  • Bayesian Occams Razor.
  • How increasingly complex structures may grow as
    required by new data, rather than being
    pre-specified in advance.
  • Non-parametric Bayes.

18
Alternative approaches to inductive generalization
  • Associative learning
  • Connectionist networks
  • Similarity to examples
  • Toolkit of simple heuristics
  • Constraint satisfaction
  • Analogical mapping

19
Marrs Three Levels of Analysis
  • Computation
  • What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?
  • Representation and algorithm
  • Cognitive psychology
  • Implementation
  • Neurobiology

20
Why Bayes?
  • A framework for explaining cognition.
  • How people can learn so much from such limited
    data.
  • Why process-level models work the way that they
    do.
  • Strong quantitative models with minimal ad hoc
    assumptions.
  • A framework for understanding how structured
    knowledge and statistical inference interact.
  • How structured knowledge guides statistical
    inference, and is itself acquired through
    higher-order statistical learning.
  • How simplicity trades off with fit to the data in
    evaluating structural hypotheses (Occams razor).
  • How increasingly complex structures may grow as
    required by new data, rather than being
    pre-specified in advance.

21
Outline
  • Morning
  • Introduction (Josh)
  • Basic case study 1 Flipping coins (Tom)
  • Basic case study 2 Rules and similarity (Josh)
  • Afternoon
  • Advanced case study 1 Causal induction (Tom)
  • Advanced case study 2 Property induction (Josh)
  • Quick tour of more advanced topics (Tom)

22
Coin flipping
23
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
24
Bayes rule
For data D and a hypothesis H, we have
  • Posterior probability
  • Prior probability
  • Likelihood

25
The origin of Bayes rule
  • A simple consequence of using probability to
    represent degrees of belief
  • For any two random variables

26
Why represent degrees of belief with
probabilities?
  • Good statistics
  • consistency, and worst-case error bounds.
  • Cox Axioms
  • necessary to cohere with common sense
  • Dutch Book Survival of the Fittest
  • if your beliefs do not accord with the laws of
    probability, then you can always be out-gambled
    by someone whose beliefs do so accord.
  • Provides a theory of learning
  • a common currency for combining prior knowledge
    and the lessons of experience.

27
Bayes rule
For data D and a hypothesis H, we have
  • Posterior probability
  • Prior probability
  • Likelihood

28
Hypotheses in Bayesian inference
  • Hypotheses H refer to processes that could have
    generated the data D
  • Bayesian inference provides a distribution over
    these hypotheses, given D
  • P(DH) is the probability of D being generated by
    the process identified by H
  • Hypotheses H are mutually exclusive only one
    process could have generated D

29
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
  • Fair coin, P(H) 0.5
  • Coin with P(H) p
  • Markov model
  • Hidden Markov model
  • ...

30
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
  • Fair coin, P(H) 0.5
  • Coin with P(H) p
  • Markov model
  • Hidden Markov model
  • ...

31
Representing generative models
  • Graphical model notation
  • Pearl (1988), Jordan (1998)
  • Variables are nodes, edges indicate dependency
  • Directed edges show causal process of data
    generation

32
Models with latent structure
  • Not all nodes in a graphical model need to be
    observed
  • Some variables reflect latent structure, used in
    generating D but unobserved

33
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p
  • Comparing infinitely many hypotheses
  • P(H) p
  • Psychology Representativeness

34
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p
  • Comparing infinitely many hypotheses
  • P(H) p
  • Psychology Representativeness

35
Comparing two simple hypotheses
  • Contrast simple hypotheses
  • H1 fair coin, P(H) 0.5
  • H2always heads, P(H) 1.0
  • Bayes rule
  • With two hypotheses, use odds form

36
Bayes rule in odds form
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D data
  • H1, H2 models
  • P(H1D) posterior probability H1 generated the
    data
  • P(DH1) likelihood of data under model H1
  • P(H1) prior probability H1 generated the data

x
37
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
38
Comparing two simple hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 0 P(H2) 1/1000
  • P(H1D) / P(H2D) infinity

x
39
Comparing two simple hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000
  • P(H1D) / P(H2D) ? 30

x
40
Comparing two simple hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHHHHHHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/210 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000
  • P(H1D) / P(H2D) ? 1

x
41
Comparing two simple hypotheses
  • Bayes rule tells us how to combine prior beliefs
    with new data
  • top-down and bottom-up influences
  • As a model of human inference
  • predicts conclusions drawn from data
  • identifies point at which prior beliefs are
    overwhelmed by new experiences
  • But more complex cases?

42
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p
  • Comparing infinitely many hypotheses
  • P(H) p
  • Psychology Representativeness

43
Comparing simple and complex hypotheses
vs.
  • Which provides a better account of the data the
    simple hypothesis of a fair coin, or the complex
    hypothesis that P(H) p?

44
Comparing simple and complex hypotheses
  • P(H) p is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) p
  • for any observed sequence X, we can choose p such
    that X is more probable than if P(H) 0.5

45
Comparing simple and complex hypotheses
Probability
46
Comparing simple and complex hypotheses
Probability
HHHHH p 1.0
47
Comparing simple and complex hypotheses
Probability
HHTHT p 0.6
48
Comparing simple and complex hypotheses
  • P(H) p is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) p
  • for any observed sequence X, we can choose p such
    that X is more probable than if P(H) 0.5
  • How can we deal with this?
  • frequentist hypothesis testing
  • information theorist minimum description length
  • Bayesian just use probability theory!

49
Comparing simple and complex hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • Computing P(DH1) is easy
  • P(DH1) 1/2N
  • Compute P(DH2) by averaging over p

x
50
Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
51
Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
52
Comparing simple and complex hypotheses
  • Simple and complex hypotheses can be compared
    directly using Bayes rule
  • requires summing over latent variables
  • Complex hypotheses are penalized for their
    greater flexibility Bayesian Occams razor
  • This principle is used in model selection methods
    in psychology (e.g. Myung Pitt, 1997)

53
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p
  • Comparing infinitely many hypotheses
  • P(H) p
  • Psychology Representativeness

54
Comparing infinitely many hypotheses
  • Assume data are generated from a model
  • What is the value of p?
  • each value of p is a hypothesis H
  • requires inference over infinitely many hypotheses

55
Comparing infinitely many hypotheses
  • Flip a coin 10 times and see 5 heads, 5 tails.
  • P(H) on next flip? 50
  • Why? 50 5 / (55) 5/10.
  • Future will be like the past.
  • Suppose we had seen 4 heads and 6 tails.
  • P(H) on next flip? Closer to 50 than to 40.
  • Why? Prior knowledge.

56
Integrating prior knowledge and data
  • Posterior distribution P(p D) is a probability
    density over p P(H)
  • Need to work out likelihood P(D p) and specify
    prior distribution P(p)

P(p D) ? P(D p) P(p)
57
Likelihood and prior
  • Likelihood
  • P(D p) pNH (1-p)NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(p) ? pFH-1 (1-p)FT-1

?
58
A simple method of specifying priors
  • Imagine some fictitious trials, reflecting a set
    of previous experiences
  • strategy often used with neural networks
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • In fact, this is a sensible statistical idea...

59
Likelihood and prior
  • Likelihood
  • P(D p) pNH (1-p)NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(p) ? pFH-1 (1-p)FT-1
  • FH fictitious observations of heads
  • FT fictitious observations of tails

Beta(FH,FT)
60
Conjugate priors
  • Exist for many standard distributions
  • formula for exponential family conjugacy
  • Define prior in terms of fictitious observations
  • Beta is conjugate to Bernoulli (coin-flipping)

FH FT 1 FH FT 3 FH FT 1000
61
Likelihood and prior
  • Likelihood
  • P(D p) pNH (1-p)NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(p) ? pFH-1 (1-p)FT-1
  • FH fictitious observations of heads
  • FT fictitious observations of tails

62
Comparing infinitely many hypotheses
P(p D) ? P(D p) P(p) pNHFH-1 (1-p)NTFT-1
  • Posterior is Beta(NHFH,NTFT)
  • same form as conjugate prior
  • Posterior mean
  • Posterior predictive distribution

63
Some examples
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • After seeing 4 heads, 6 tails, P(H) on next flip
    1004 / (10041006) 49.95
  • e.g., F 3 heads, 3 tails weak expectation
    that any new coin will be fair
  • After seeing 4 heads, 6 tails, P(H) on next flip
    7 / (79) 43.75
  • Prior knowledge too weak

64
But flipping thumbtacks
  • e.g., F 4 heads, 3 tails weak expectation
    that tacks are slightly biased towards heads
  • After seeing 2 heads, 0 tails, P(H) on next flip
    6 / (63) 67
  • Some prior knowledge is always necessary to avoid
    jumping to hasty conclusions...
  • Suppose F After seeing 2 heads, 0 tails,
    P(H) on next flip 2 / (20) 100

65
Origin of prior knowledge
  • Tempting answer prior experience
  • Suppose you have previously seen 2000 coin flips
    1000 heads, 1000 tails
  • By assuming all coins (and flips) are alike,
    these observations of other coins are as good as
    observations of the present coin

66
Problems with simple empiricism
  • Havent really seen 2000 coin flips, or any flips
    of a thumbtack
  • Prior knowledge is stronger than raw experience
    justifies
  • Havent seen exactly equal number of heads and
    tails
  • Prior knowledge is smoother than raw experience
    justifies
  • Should be a difference between observing 2000
    flips of a single coin versus observing 10 flips
    each for 200 coins, or 1 flip each for 2000 coins
  • Prior knowledge is more structured than raw
    experience

67
A simple theory
  • Coins are manufactured by a standardized
    procedure that is effective but not perfect.
  • Justifies generalizing from previous coins to the
    present coin.
  • Justifies smoother and stronger prior than raw
    experience alone.
  • Explains why seeing 10 flips each for 200 coins
    is more valuable than seeing 2000 flips of one
    coin.
  • Tacks are asymmetric, and manufactured to less
    exacting standards.

68
Limitations
  • Can all domain knowledge be represented so
    simply, in terms of an equivalent number of
    fictional observations?
  • Suppose you flip a coin 25 times and get all
    heads. Something funny is going on
  • But with F 1000 heads, 1000 tails, P(H) on
    next flip 1025 / (10251000) 50.6.
  • Looks like nothing unusual

69
Hierarchical priors
  • Higher-order hypothesis is this coin fair or
    unfair?
  • Example probabilities
  • P(fair) 0.99
  • P(pfair) is Beta(1000,1000)
  • P(punfair) is Beta(1,1)
  • 25 heads in a row propagates up, affecting p and
    then P(fairD)

fair
p
d1 d2 d3 d4
70
More hierarchical priors
  • Latent structure can capture coin variability
  • 10 flips from 200 coins is better than 2000 flips
    from a single coin allows estimation of FH, FT

p Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
p
p
p
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
71
Yet more hierarchical priors
physical knowledge
  • Discrete beliefs (e.g. symmetry) can influence
    estimation of continuous properties (e.g. FH, FT)

FH,FT
p
p
p
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
72
Comparing infinitely many hypotheses
  • Apply Bayes rule to obtain posterior probability
    density
  • Requires prior over all hypotheses
  • computation simplified by conjugate priors
  • richer structure with hierarchical priors
  • Hierarchical priors indicate how simple theories
    can inform statistical inferences
  • one step towards structure and statistics

73
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p
  • Comparing infinitely many hypotheses
  • P(H) p
  • Psychology Representativeness

74
Psychology Representativeness
  • Which sequence is more likely from a fair coin?

HHTHT
more representative of a fair coin (Kahneman
Tversky, 1972)
HHHHH
75
What might representativeness mean?
Evidence for a random generating process
76
A constrained hypothesis space
  • Four hypotheses
  • h1 fair coin HHTHTTTH
  • h2 always alternates HTHTHTHT
  • h3 mostly heads HHTHTHHH
  • h4 always heads HHHHHHHH

77
Representativeness judgments
78
Results
  • Good account of representativeness data, with
    three pseudo-free parameters, ? 0.91
  • always alternates means 99 of the time
  • mostly heads means P(H) 0.85
  • always heads means P(H) 0.99
  • With scaling parameter, r 0.95

(Tenenbaum Griffiths, 2001)
79
The role of theories
  • The fact that HHTHT looks representative of a
    fair coin and HHHHH does not reflects our
    implicit theories of how the world works.
  • Easy to imagine how a trick all-heads coin could
    work high prior probability.
  • Hard to imagine how a trick HHTHT coin could
    work low prior probability.

80
Summary
  • Three kinds of Bayesian inference
  • comparing two simple hypotheses
  • comparing simple and complex hypotheses
  • comparing an infinite number of hypotheses
  • Critical notions
  • generative models, graphical models
  • Bayesian Occams razor
  • priors conjugate, hierarchical (theories)

81
Outline
  • Morning
  • Introduction (Josh)
  • Basic case study 1 Flipping coins (Tom)
  • Basic case study 2 Rules and similarity (Josh)
  • Afternoon
  • Advanced case study 1 Causal induction (Tom)
  • Advanced case study 2 Property induction (Josh)
  • Quick tour of more advanced topics (Tom)

82
Rules and similarity
83
Structure versus statistics
Statistics Similarity Typicality
Rules Logic Symbols
84
A better metaphor
85
A better metaphor
86
Structure and statistics
Statistics Similarity Typicality
Rules Logic Symbols
87
Structure and statistics
  • Basic case study 1 Flipping coins
  • Learning and reasoning with structured
    statistical models.
  • Basic case study 2 Rules and similarity
  • Statistical learning with structured
    representations.

88
The number game
  • Program input number between 1 and 100
  • Program output yes or no

89
The number game
  • Learning task
  • Observe one or more positive (yes) examples.
  • Judge whether other numbers are yes or no.

90
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
91
The number game
Examples of yes numbers
Generalization judgments (n 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
92
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
Focused similarity numbers near 50-60
60 52 57 55
93
The number game
Examples of yes numbers
Generalization judgments (N 20)
16
Diffuse similarity
16 8 2 64
Rule powers of 2
Focused similarity numbers near 20
16 23 19 20
94
The number game
  • Main phenomena to explain
  • Generalization can appear either similarity-based
    (graded) or rule-based (all-or-none).
  • Learning from just a few positive examples.

95
Rule/similarity hybrid models
  • Category learning
  • Nosofsky, Palmeri et al. RULEX
  • Erickson Kruschke ATRIUM

96
Divisions into rule and similarity subsystems
  • Category learning
  • Nosofsky, Palmeri et al. RULEX
  • Erickson Kruschke ATRIUM
  • Language processing
  • Pinker, Marcus et al. Past tense morphology
  • Reasoning
  • Sloman
  • Rips
  • Nisbett, Smith et al.

97
Rule/similarity hybrid models
  • Why two modules?
  • Why do these modules work the way that they do,
    and interact as they do?
  • How do people infer a rule or similarity metric
    from just a few positive examples?

98
Bayesian model
  • H Hypothesis space of possible concepts
  • h1 2, 4, 6, 8, 10, 12, , 96, 98, 100
    (even numbers)
  • h2 10, 20, 30, 40, , 90, 100 (multiples
    of 10)
  • h3 2, 4, 8, 16, 32, 64 (powers of 2)
  • h4 50, 51, 52, , 59, 60 (numbers between
    50 and 60)
  • . . .
  • Representational interpretations for H
  • Candidate rules
  • Features for similarity
  • Consequential subsets (Shepard, 1987)

99
  • Inferring hypotheses from similarity judgment
  • Additive clustering (Shepard Arabie, 1977)
  • similarity of stimuli i, j
  • weight of cluster k
  • membership of stimulus i in cluster k
  • (1 if stimulus i in cluster k, 0 otherwise)
  • Equivalent to similarity as a weighted sum of
    common features (Tversky, 1977).

100
  • Additive clustering for the integers 0-9

Rank Weight Stimuli in cluster Interpretation
0 1 2 3 4 5 6 7 8 9 1 .444
powers
of two 2 .345 small numbers 3 .331
multiples of
three 4 .291
large numbers 5 .255
middle numbers 6 .216
odd numbers 7 .214 smallish
numbers 8 .172
largish numbers
101
Three hypothesis subspaces for number concepts
  • Mathematical properties (24 hypotheses)
  • Odd, even, square, cube, prime numbers
  • Multiples of small integers
  • Powers of small integers
  • Raw magnitude (5050 hypotheses)
  • All intervals of integers with endpoints between
    1 and 100.
  • Approximate magnitude (10 hypotheses)
  • Decades (1-10, 10-20, 20-30, )

102
Hypothesis spaces and theories
  • Why a hypothesis space is like a domain theory
  • Represents one particular way of classifying
    entities in a domain.
  • Not just an arbitrary collection of hypotheses,
    but a principled system.
  • Whats missing?
  • Explicit representation of the principles.
  • Hypothesis spaces (and priors) are generated by
    theories. Some analogies
  • Grammars generate languages (and priors over
    structural descriptions)
  • Hierarchical Bayesian modeling

103
Bayesian model
  • H Hypothesis space of possible concepts
  • Mathematical properties even, odd, square,
    prime, . . . .
  • Approximate magnitude 1-10, 10-20, 20-30,
    . . . .
  • Raw magnitude all intervals between 1 and 100.
  • X x1, . . . , xn n examples of a concept C.
  • Evaluate hypotheses given data
  • p(h) prior domain knowledge, pre-existing
    biases
  • p(Xh) likelihood statistical information in
    examples.
  • p(hX) posterior degree of belief that h is
    the true extension of C.

104
Bayesian model
  • H Hypothesis space of possible concepts
  • Mathematical properties even, odd, square,
    prime, . . . .
  • Approximate magnitude 1-10, 10-20, 20-30,
    . . . .
  • Raw magnitude all intervals between 1 and 100.
  • X x1, . . . , xn n examples of a concept C.
  • Evaluate hypotheses given data
  • p(h) prior domain knowledge, pre-existing
    biases
  • p(Xh) likelihood statistical information in
    examples.
  • p(hX) posterior degree of belief that h is
    the true extension of C.

105
  • Likelihood p(Xh)
  • Size principle Smaller hypotheses receive
    greater likelihood, and exponentially more so as
    n increases.
  • Follows from assumption of randomly sampled
    examples.
  • Captures the intuition of a representative
    sample.

106
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
107
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data slightly more of a coincidence under h1
108
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data much more of a coincidence under h1
109
Bayesian Occams Razor
Law of Conservation of Belief
M1
p(D d M )
M2
All possible data sets d
For any model M,
110
Comparing simple and complex hypotheses
Probability
Distribution is an average over all values of p
111
  • Prior p(h)
  • Choice of hypothesis space embodies a strong
    prior effectively, p(h) 0 for many logically
    possible but conceptually unnatural hypotheses.
  • Prevents overfitting by highly specific but
    unnatural hypotheses, e.g. multiples of 10
    except 50 and 70.

112
  • Prior p(h)
  • Choice of hypothesis space embodies a strong
    prior effectively, p(h) 0 for many logically
    possible but conceptually unnatural hypotheses.
  • Prevents overfitting by highly specific but
    unnatural hypotheses, e.g. multiples of 10
    except 50 and 70.
  • p(h) encodes relative weights of alternative
    theories

H Total hypothesis space
  • H1 Math properties (24)
  • even numbers
  • powers of two
  • multiples of three
  • .
  • H2 Raw magnitude (5050)
  • 10-15
  • 20-32
  • 37-54
  • .
  • H3 Approx. magnitude (10)
  • 10-20
  • 20-30
  • 30-40
  • .

113
A more complex approach to priors
  • Start with a base set of regularities R and
    combination operators C.
  • Hypothesis space closure of R under C.
  • C and, or H unions and intersections of
    regularities in R (e.g., multiples of 10 between
    30 and 70).
  • C and-not H regularities in R with
    exceptions (e.g., multiples of 10 except 50 and
    70).
  • Two qualitatively similar priors
  • Description length number of combinations in C
    needed to generate hypothesis from R.
  • Bayesian Occams Razor, with model classes
    defined by number of combinations more
    combinations more hypotheses lower
    prior

114
  • Posterior
  • X 60, 80, 10, 30
  • Why prefer multiples of 10 over even numbers?
    p(Xh).
  • Why prefer multiples of 10 over multiples of
    10 except 50 and 20? p(h).
  • Why does a good generalization need both high
    prior and high likelihood? p(hX) p(Xh) p(h)

115
Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
116
Generalizing to new objects
Given p(hX), how do we compute ,
the probability that C applies to some new
stimulus y?
117
Generalizing to new objects
Hypothesis averaging Compute the probability
that C applies to some new object y by averaging
the predictions of all hypotheses h, weighted by
p(hX)
118
Examples 16
119
Connection to feature-based similarity
  • Additive clustering model of similarity
  • Bayesian hypothesis averaging
  • Equivalent if we identify features fk with
    hypotheses h, and weights wk with .

120
Examples 16 8 2 64
121
Examples 16 23 19 20
122
Model fits
Examples of yes numbers
Generalization judgments (N 20)
Bayesian Model (r 0.96)
60
60 80 10 30
60 52 57 55
123
Model fits
Examples of yes numbers
Generalization judgments (N 20)
Bayesian Model (r 0.93)
16
16 8 2 64
16 23 19 20
124
Summary of the Bayesian model
  • How do the statistics of the examples interact
    with prior knowledge to guide generalization?
  • Why does generalization appear rule-based or
    similarity-based?

125
Summary of the Bayesian model
  • How do the statistics of the examples interact
    with prior knowledge to guide generalization?
  • Why does generalization appear rule-based or
    similarity-based?

126
Alternative models
  • Neural networks

127
Alternative models
  • Neural networks
  • Hypothesis ranking and elimination

Hypothesis ranking 1
2 3 4 .
even
multiple of 10
power of 2
multiple of 3
.
60
80
10
30
128
Alternative models
  • Neural networks
  • Hypothesis ranking and elimination
  • Similarity to exemplars
  • Average similarity

60
60 80 10 30
60 52 57 55
Model (r 0.80)
Data
129
Alternative models
  • Neural networks
  • Hypothesis ranking and elimination
  • Similarity to exemplars
  • Max similarity

60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
130
Alternative models
  • Neural networks
  • Hypothesis ranking and elimination
  • Similarity to exemplars
  • Average similarity
  • Max similarity
  • Flexible similarity?

Bayes.
131
Alternative models
  • Neural networks
  • Hypothesis ranking and elimination
  • Similarity to exemplars
  • Toolbox of simple heuristics
  • 60 general similarity
  • 60 80 10 30 most specific rule (subset
    principle).
  • 60 52 57 55 similarity in magnitude

Why these heuristics? When to use which
heuristic? Bayes.
132
Summary
  • Generalization from limited data possible via the
    interaction of structured knowledge and
    statistics.
  • Structured knowledge space of candidate rules,
    theories generate hypothesis space (c.f.
    hierarchical priors)
  • Statistics Bayesian Occams razor.
  • Better understand the interactions between
    traditionally opposing concepts
  • Rules and statistics
  • Rules and similarity
  • Explains why central but notoriously slippery
    processing-level concepts work the way they do.
  • Similarity
  • Representativeness
  • Rules and representativeness

133
Why Bayes?
  • A framework for explaining cognition.
  • How people can learn so much from such limited
    data.
  • Why process-level models work the way that they
    do.
  • Strong quantitative models with minimal ad hoc
    assumptions.
  • A framework for understanding how structured
    knowledge and statistical inference interact.
  • How structured knowledge guides statistical
    inference, and is itself acquired through
    higher-order statistical learning.
  • How simplicity trades off with fit to the data in
    evaluating structural hypotheses (Occams razor).
  • How increasingly complex structures may grow as
    required by new data, rather than being
    pre-specified in advance.

134
Theory-Based Bayesian Models
  • Rational statistical inference (Bayes)
  • Learners domain theories generate their
    hypothesis space H and prior p(h).
  • Well-matched to structure of the natural world.
  • Learnable from limited data.
  • Computationally tractable inference.

135
Looking towards the afternoon
  • How do we apply these ideas to more natural and
    complex aspects of cognition?
  • Where do the hypothesis spaces come from?
  • Can we formalize the contributions of domain
    theories?

136
(No Transcript)
137
Outline
  • Morning
  • Introduction (Josh)
  • Basic case study 1 Flipping coins (Tom)
  • Basic case study 2 Rules and similarity (Josh)
  • Afternoon
  • Advanced case study 1 Causal induction (Tom)
  • Advanced case study 2 Property induction (Josh)
  • Quick tour of more advanced topics (Tom)

138
Outline
  • Morning
  • Introduction (Josh)
  • Basic case study 1 Flipping coins (Tom)
  • Basic case study 2 Rules and similarity (Josh)
  • Afternoon
  • Advanced case study 1 Causal induction (Tom)
  • Advanced case study 2 Property induction (Josh)
  • Quick tour of more advanced topics (Tom)

139
Marrs Three Levels of Analysis
  • Computation
  • What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?
  • Representation and algorithm
  • Cognitive psychology
  • Implementation
  • Neurobiology

140
Working at the computational level
  • What is the computational problem?
  • input data
  • output solution

141
Working at the computational level
  • What is the computational problem?
  • input data
  • output solution
  • What knowledge is available to the learner?
  • Where does that knowledge come from?

142
Theory-Based Bayesian Models
  • Rational statistical inference (Bayes)
  • Learners domain theories generate their
    hypothesis space H and prior p(h).
  • Well-matched to structure of the natural world.
  • Learnable from limited data.
  • Computationally tractable inference.

143
Causality
144
Bayes nets and beyond...
  • Increasingly popular approach to studying human
    causal inferences
  • (e.g. Glymour, 2001 Gopnik et al., 2004)
  • Three reactions
  • Bayes nets are the solution!
  • Bayes nets are missing the point, not sure why
  • what is a Bayes net?

145
Bayes nets and beyond...
  • What are Bayes nets?
  • graphical models
  • causal graphical models
  • An example elemental causal induction
  • Beyond Bayes nets
  • other knowledge in causal induction
  • formalizing causal theories

146
Bayes nets and beyond...
  • What are Bayes nets?
  • graphical models
  • causal graphical models
  • An example elemental causal induction
  • Beyond Bayes nets
  • other knowledge in causal induction
  • formalizing causal theories

147
Graphical models
  • Express the probabilistic dependency structure
    among a set of variables (Pearl, 1988)
  • Consist of
  • a set of nodes, corresponding to variables
  • a set of edges, indicating dependency
  • a set of functions defined on the graph that
    defines a probability distribution

148
Undirected graphical models
X3
X4
X1
  • Consist of
  • a set of nodes
  • a set of edges
  • a potential for each clique, multiplied together
    to yield the distribution over variables
  • Examples
  • statistical physics Ising model, spinglasses
  • early neural networks (e.g. Boltzmann machines)

X2
X5
149
Directed graphical models
X3
X4
X1
  • Consist of
  • a set of nodes
  • a set of edges
  • a conditional probability distribution for each
    node, conditioned on its parents, multiplied
    together to yield the distribution over variables
  • Constrained to directed acyclic graphs (DAG)
  • AKA Bayesian networks, Bayes nets

X2
X5
150
Bayesian networks and Bayes
  • Two different problems
  • Bayesian statistics is a method of inference
  • Bayesian networks are a form of representation
  • There is no necessary connection
  • many users of Bayesian networks rely upon
    frequentist statistical methods (e.g. Glymour)
  • many Bayesian inferences cannot be easily
    represented using Bayesian networks

151
Properties of Bayesian networks
  • Efficient representation and inference
  • exploiting dependency structure makes it easier
    to represent and compute with probabilities
  • Explaining away
  • pattern of probabilistic reasoning characteristic
    of Bayesian networks, especially early use in AI

152
Efficient representation and inference
  • Three binary variables Cavity, Toothache, Catch

153
Efficient representation and inference
  • Three binary variables Cavity, Toothache, Catch
  • Specifying P(Cavity, Toothache, Catch) requires 7
    parameters (1 for each set of values, minus 1
    because its a probability distribution)
  • With n variables, we need 2n -1 parameters
  • Here n3. Realistically, many more X-ray, diet,
    oral hygiene, personality, . . . .

154
Conditional independence
  • All three variables are dependent, but Toothache
    and Catch are independent given the presence or
    absence of Cavity
  • In probabilistic terms
  • With n evidence variables, x1, , xn, we need 2 n
    conditional probabilities

155
A simple Bayesian network
  • Graphical representation of relations between a
    set of random variables
  • Probabilistic interpretation factorizing complex
    terms

156
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
  • Joint distribution sufficient for any inference

157
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
  • Joint distribution sufficient for any inference

158
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
  • Joint distribution sufficient for any inference
  • General inference algorithm local message
    passing (belief propagation Pearl, 1988)
  • efficiency depends on sparseness of graph
    structure

159
Explaining away
  • Assume grass will be wet if and only if it rained
    last night, or if the sprinklers were left on

160
Explaining away
Compute probability it rained last night, given
that the grass is wet
161
Explaining away
Compute probability it rained last night, given
that the grass is wet
162
Explaining away
Compute probability it rained last night, given
that the grass is wet
163
Explaining away
Compute probability it rained last night, given
that the grass is wet
164
Explaining away
Compute probability it rained last night, given
that the grass is wet
165
Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
166
Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
167
Explaining away
Discounting to prior probability.
168
Contrast w/ production system
Rain
Grass Wet
  • Formulate IF-THEN rules
  • IF Rain THEN Wet
  • IF Wet THEN Rain
  • Rules do not distinguish directions of inference
  • Requires combinatorial explosion of rules

169
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
  • Observing rain, Wet becomes more active.
  • Observing grass wet, Rain and Sprinkler become
    more active.
  • Observing grass wet and sprinkler, Rain cannot
    become less active. No explaining away!
  • Excitatory links Rain Wet, Sprinkler
    Wet

170
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
  • Excitatory links Rain Wet, Sprinkler
    Wet
  • Inhibitory link Rain Sprinkler
  • Observing grass wet, Rain and Sprinkler become
    more active.
  • Observing grass wet and sprinkler, Rain becomes
    less active explaining away.

171
Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet
  • Each new variable requires more inhibitory
    connections.
  • Interactions between variables are not causal.
  • Not modular.
  • Whether a connection exists depends on what other
    connections exist, in non-transparent ways.
  • Big holism problem.
  • Combinatorial explosion.

172
Graphical models
  • Capture dependency structure in distributions
  • Provide an efficient means of representing and
    reasoning with probabilities
  • Allow kinds of inference that are problematic for
    other representations explaining away
  • hard to capture in a production system
  • hard to capture with spreading activation

173
Bayes nets and beyond...
  • What are Bayes nets?
  • graphical models
  • causal graphical models
  • An example causal induction
  • Beyond Bayes nets
  • other knowledge in causal induction
  • formalizing causal theories

174
Causal graphical models
  • Graphical models represent statistical
    dependencies among variables (ie. correlations)
  • can answer questions about observations
  • Causal graphical models represent causal
    dependencies among variables
  • express underlying causal structure
  • can answer questions about both observations and
    interventions (actions upon a variable)

175
Observation and intervention
Battery
Radio
Ignition
Gas
Starts
On time to work
Graphical model P(RadioIgnition)
Causal graphical model P(Radiodo(Ignition))
176
Observation and intervention
Battery
Radio
Ignition
Gas
Starts
On time to work
Graphical model P(RadioIgnition)
Causal graphical model P(Radiodo(Ignition))
graph surgery produces mutilated graph
177
Assessing interventions
  • To compute P(Ydo(Xx)), delete all edges coming
    into X and reason with the resulting Bayesian
    network (do calculus Pearl, 2000)
  • Allows a single structure to make predictions
    about both observations and interventions

178
Causality simplifies inference
  • Using a representation in which the direction of
    causality is correct produces sparser graphs
  • Suppose we get the direction of causality wrong,
    thinking that symptoms causes diseases
  • Does not capture the correlation between
    symptoms falsely believe P(Ache, Catch)
    P(Ache) P(Catch).

Ache
Catch
Cavity
179
Causality simplifies inference
  • Using a representation in which the direction of
    causality is correct produces sparser graphs
  • Suppose we get the direction of causality wrong,
    thinking that symptoms causes diseases
  • Inserting a new arrow allows us to capture this
    correlation.
  • This model is too complex do not believe that

Ache
Catch
Cavity
180
Causality simplifies inference
  • Using a representation in which the direction of
    causality is correct produces sparser graphs
  • Suppose we get the direction of causality wrong,
    thinking that symptoms causes diseases
  • New symptoms require a combinatorial
    proliferation of new arrows. This reduces
    efficiency of inference.

Ache
X-ray
Catch
Cavity
181
Learning causal graphical models
  • Strength how strong is a relationship?
  • Structure does a relationship exist?

B
B
182
Causal structure vs. causal strength
  • Strength how strong is a relationship?

B
B
183
Causal structure vs. causal strength
  • Strength how strong is a relationship?
  • requires defining nature of relationship

B
B
184
Parameterization
  • Structures h1 h0
  • Parameterization

C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
185
Parameterization
  • Structures h1 h0
  • Parameterization

C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
186
Parameterization
  • Structures h1 h0
  • Parameterization

C
B
C
B
E
E
C
B
h1 P(E 1 C, B)
h0 P(E 1 C, B)
0 0 1 0 0 1 1 1
187
Parameter estimation
  • Maximum likelihood estimation
  • maximize ?i P(bi,ci,ei w0, w1)
  • Bayesian methods as in the Comparing infinitely
    many hypotheses example

188
Causal structure vs. causal strength
  • Structure does a relationship exist?

B
B
189
Approaches to structure learning
  • Constraint-based
  • dependency from statistical tests (eg. ?2)
  • deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
190
Approaches to structure learning
  • Constraint-based
  • dependency from statistical tests (eg. ?2)
  • deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
191
Approaches to structure learning
  • Constraint-based
  • dependency from statistical tests (eg. ?2)
  • deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
192
Approaches to structure learning
  • Constraint-based
  • dependency from statistical tests (eg. ?2)
  • deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
Attempts to reduce inductive problem to deductive
problem
193
Approaches to structure learning
  • Constraint-based
  • dependency from statistical tests (eg. ?2)
  • deduce structure from dependencies

C
B
B
E
(Pearl, 2000 Spirtes et al., 1993)
  • Bayesian
  • compute posterior
  • probability of structures,
  • given observed data

C
B
C
B
E
E
P(S1data)
P(S0data)
P(Sdata) ? P(dataS) P(S)
(Heckerman, 1998 Friedman, 1999)
194
Causal graphical models
  • Extend graphical models to deal with
    interventions as well as observations
  • Respecting the direction of causality results in
    efficient representation and inference
  • Two steps in learning causal models
  • parameter estimation
  • structure learning

195
Bayes nets and beyond...
  • What are Bayes nets?
  • graphical models
  • causal graphical models
  • An example elemental causal induction
  • Beyond Bayes nets
  • other knowledge in causal induction
  • formalizing causal theories

196
Elemental causal induction
C present
C absent
E present
a
c
E absent
d
b
To what extent does C cause E?
197
Causal structure vs. causal strength
  • Strength how strong is a relationship?
  • Structure does a relationship exist?

B
B
198
Causal strength
  • Assume structure
  • Leading models (DP and causal power) are maximum
    likelihood estimates of the strength parameter
    w1, under different parameterizations for
    P(EB,C)
  • linear ? DP, Noisy-OR ? causal power

B
199
Causal structure
  • Hypotheses h1 h0
  • Bayesian causal inference
  • support

B
B
200
Buehner and Cheng (1997)
People
DP (r 0.89)
Power (r 0.88)
Support (r 0.97)
201
The importance of parameterization
  • Noisy-OR incorporates mechanism assumptions
  • generativity causes increase probability of
    effects
  • each cause is sufficient to produce the effect
  • causes act via independent mechanisms
  • (Cheng, 1997)
  • Co
Write a Comment
User Comments (0)