Title: Bayesian approaches to cognitive sciences
1Bayesian approaches to cognitive sciences
2- Word learning
- Bayesian property induction
- Theory-based causal inference
3Word Learning
4Word Learning
- Some constrains on word learning
- Very few examples required
- Learning is possible with only positive examples
- Word meanings overlap
- Learning is often graded
5Word Learning
- Given a few instances of a particular words, say
dog, how do we generalize to new instance? - Hypothesis elimination use deductive logic
(along with prior knowledge) to eliminate
hypothesis that are inconsistent with the use of
the word.
6Word Learning
- Some constrains on word learning
- Very few examples required
- Learning is possible with only positive examples
- Word meaning overlap
- Learning is often graded
7Word Learning
- Given a few instances of a particular words, say
dog, how do we generalize to new instance? - Connectionist (associative) approach compute the
probability of co-occurrences of object features
and the corresponding word
8Word Learning
- Some constrains on word learning
- Very few examples required
- Learning is possible with only positive examples
- Word meaning overlap
- Learning is often graded
9Word learning
- Alternative Rational statistical inference with
structure hypothesis space - Suppose you see a Dalmatian and you hear fep.
Does fep refer to all dogs or just Dalmatians?
What if you hear 3 more example, all
corresponding to Dalmatians? Then it should be
clear fep are Dalmatians because this
observation would be suspicious coincidence if
fep referred to all dogs. - Therefore, logic is not enough, you also need
probabilities. However, you dont need that many
examples. And co-occurrence frequencies is not
enough (in our example, fep is associated 100 of
the time with Dalmatians whether you see one or
three examples) - We need structured prior knowledge
10Word Learning
- Suppose objects are organized in taxonomic trees
(animals)
(dogs)
(dalmatians)
11Word learning
- Were given N examples of a word C. The goal of
learning will be to determine whether C
corresponds to the subordinate, basic or
superordinate level. The level in the taxonomy is
what we mean by meaning. - h Hypothesis, i.e., word meaning.
- The set of possible hypothesis is strongly
constrained by the tree structure.
12Word Learning
T Tree structure
H hypotheses
(animals)
(dogs)
(dalmatians)
13Word learning
- Inference just follow Bayes rule
h Hypothesis is this a dog/basic level word?
x Data, e.g. a set of labeled images of animals
T type of representation being assumed (e.g.
tree stucture)
14Word learning
Prior the prior is strongly constrained by the
tree structure. Only some hypothesis are possible
(the ones corresponding to the hierarchical
levels in the tree)
- Inference just follow Bayes rule
Likelihood function probability of the data
15Word learning
- Likelihood functions and the size principle
- Assume youre given n example of a particular
group (e.g. 3 examples of dogs, or 3 examples of
dalmatians). Then
16Word learning
- Lets assume there are 100 dogs in the world, 10
of them Dalmatians. If examples are drawn
randomly with replacement from those pools, we
have
17Word learning
- More generally, the probability of getting n
examples of a particular hypothesis h is given
by - This is known as the size principle. Multiple
examples drawn from smaller sets are more likely.
18Word learning
- Let say youre given 1 example of a Dalmatian
- Conclusion its very likely to be a subordinate
word
19Word learning
- Let say youre given 4 examples, all Dalmatians.
- Conclusion its a subordinate word (dalmatian)
with near certainty or its a very suspicious
coincidence!
20Word learning
- Let say youre given 5 examples, 2 Dalmatians and
3 German Shepards. - Conclusion its a basic level word (dog) with
near certainty
Probablity that images got mislabeled. Assumed to
be very small.
21Word Learning
- Subject shown one Dalmatian and told its a fep
- Subord. match subject is shown a new dalmatian
and asked if its a fep - Basic match subject is shown a new dog
(non-dalmatian) and asked whether its a fep
22Word Learning
As more subord. examples are collected,
probability for basic and superord. level go
down.
- Subject shown three Dalmatians and told they are
feps - Subord. match subject is shown a new dalmatian
and asked if its a fep - Basic match subject is shown a new dog
(non-dalmatian) and asked whether its a fep
23Word Learning
As more basic examples are collected, probability
for basic level goes up.
With only one example, sub level is favored.
24Word Learning
Model produces similar behavior
25(No Transcript)
26(No Transcript)
27Bayesian property induction
28Bayesian property induction
- Given that Gorillas and Chimpanzees have gene X,
do Macaques have gene X? - If cheetah and giraffes carry disease X, do polar
bear carry disease X? - Classic approach boolean logic.
- Problem such questions are inherently
probabilistic. Answering yes or no would be very
misleading. - Fuzzy logic?
29Bayesian property induction
- C concept (e.g. mammals can get disease X),
i.e., a set of animals defined by a particular
property. - H Hypothesis space. Space of all possible
concepts, i.e., all possible sets of animal. With
10 animals, H contains 210 sets. - h a particular set. Note that there is an h for
which hC. - y a particular statement (Dolphins can get
disease X), that is, a subset of any hypothesis. - X a set of observations drawn for the concept C.
30Bayesian property induction
- The goal of inference is to determine whether y
belong to a concept C for which we have samples
X - E.g., Given that Gorillas and Chimpanzees have
gene X, do Macaques have gene X? - X Gorillas, Chimpanzees
- y Macaques
- C the set of all animals with gene X.
- Note that we dont know the full list of animal
in set C. C is a hidden (or latent) variable. We
need to integrate it out.
31Bayesian property induction
- Animal bufallo, zebra, giraffe, sealb,z,g,s
- Xb,z have property a
- yg, do giraffes have property a?
Probability that g has property a given that b
and z have property a
Probability that g has property a given that h
contains g. It must be equal to 1.
32Bayesian property induction
Warning this is the probability that X will be
observed given h. It is not the probability that
members of X belong to h.
33Bayesian property induction
- The likelihood function
- Animal bufallo, zebra, giraffe, sealb,z,g,s
- If hb,z
- p(Xb h)0.5
- p(Xb,z h)0.50.5
- p(Xg h)0. This rules out all hypothesis that
do not contain all of the observations - If hb,z,g have property a
- p(Xb,zh)0.32. The larger the set, the
smaller the likelihood. Occams razor.
34Bayesian property induction
- The likelihood function
- Non zero only if X is subset of h. Note that sets
that contains no or some elements of X, but no
all elements of X, are ruled out. Also, data are
less likely to come from large sets (Occams
razor?).
35Bayesian property induction
- The Prior the prior should embed knowledge of
the domain. - Naive approach If we have 10 animals and we
consider all possible sets, we ended up with
2101024 sets. A flat prior over this would yield
a prior of 1/1024 for each set.
36Bayesian property induction
- Bayesian taxonomic prior
- Note only 19 sets have non zero prior.
210-191005 sets have a prior of zero. HUGE
constrain!
37Bayesian property induction
- Taxonomic prior is not enough. Why?
- Seals and squirrels catch disease X, so horses
are also susceptible - Seals and cows catch disease X, so horses are
also susceptible - Most people say that statement 2 is stronger.
38Bayesian property induction
- Seals and squirrels catch disease X, so horses
are also susceptible - Seals and cows catch disease X, so horses are
also susceptible
Only hypothesis that can have all three animals
Only hypothesis that can have all three animals
39Bayesian property induction
- This approach does not distinguish the two
statements
Only hypothesis that can have all three animals
Only hypothesis that can have all three animals
40Bayesian property induction
41Bayesian property induction
- Evolutionary Bayes all 1024 sets of animals are
possible, but they differ by their prior. Sets
that contains animals that are nearby in the tree
are more likely.
42Bayesian property induction
p
x
p2
43Bayesian property induction
likely set pp2
44Bayesian property induction
unlikely set p2
45Bayesian property induction
- Seals and squirrels catch disease X, so horses
are also susceptible - Seals and cows catch disease X, so horses are
also susceptible - Under the evolutionary prior, there are more
scenarios that are compatible with the second
statement.
46Bayesian property induction
- Evolutionary prior explain the data better than
any other model (but not by much).
47Other priors
48How to learn the structure of the prior
- Syntactic rules for growing graphs
49(No Transcript)
50(No Transcript)
51Theory-based causal inference
52Theory-based causal inference
- Can we use this framework to infer causality?
53Theory-based causal inference
- Blicket detector activates when a blicket is
placed onto it. - Observation 1 B1 and B2 detector on
- Most kids say that B1 and B2 are blickets
- Observation 2 B1 alone detector on
- All kids say B1 is a blicket but not B2.
- This is known as extinction or explaining away
54Theory-based causal inference
- Impossible to capture with usual learning
algorithm because there arent enough trials to
learn all the probabilities involved. - Simple reasoning could be used, along with
Occams razor (e.g., B1 alone is enough to
explain all the data), but its hard to formalize
(How do we define Occams razor?)
55Theory-based causal inference
- Alternative assume the data were generated by a
causal process. - We observed two trials d1e1, x11, x21 and
d1e1, x11, x20. What kind of Bayesian net
can account for these data? - There are only four possible networks
56Theory-based causal inference
- Which network is most likely to explain the data?
- Bayesian approach compute the posterior over
networks given the data - If we assume that the probability of any object
to be a blicket is r, the prior over Bayesian
nets is given by
57Theory-based causal inference
- Let consider what happen after we observe
d1e1, x11, x21 - If we assume the machine does not go off by
itself (p(e1)0), we have
58Theory-based causal inference
- Let consider what happen after we observe
d1e1, x11, x21 - For the other nets
59Theory-based causal inference
- Therefore, were left with three networks and for
each of them we have - Assuming blickets are rare (rlt0.5), the most
likely explanations are the ones for which only
one object is a blicket (h10 and h01). Therefore,
object1 or object 2 is a blicket (but its
unlikely that both are blickets!)
60Theory-based causal inference
- We now observe d2e1, x11, x20
- Again, if we assume the machine does not go off
by itself (p(e1)0), we have
Were assuming that the machine does not go off
if there is no blicket
61Theory-based causal inference
- We observed two trials d1e1, x11, x21 and
d1e1, x11, x20. h00 and h01 are
inconsistent with these data, so were left with
the other two.
62Theory-based causal inference
- We observed two trials d1e1, x11, x21 and
d1e1, x11, x20 and were left with two
networks. - Assuming blickets are rare (rlt0.5), The network
in which only one object is a blicket (h10) is
the most likely explanation
63Theory-based causal inference
- But what happens to the probability that X2 is a
blicket? - To compute this we need to compute
Either this link is in network hjk or its not
Hypothesis for which this link exists must have a
1 here.
64Theory-based causal inference
- But what happens to the probability that X2 is a
blicket?
(Data suggest that r1/3)
65Theory-based causal inference
- Probability that X2 is a blicket after the second
observation - Therefore the probability that X2 is a blicket
went down after the second observation (3/5 to
1/3) which is consistent with kids reports.
Occams razor comes from assuming rlt0.5.
66Theory-based causal inference
- This approach can be generalized to much more
complicated generative models.