Title: Lexicalized and Probabilistic Parsing
1Lexicalized and Probabilistic Parsing
Read J M Chapter 12.
2Using Probabilities
- Resolving ambiguities
- I saw the Statue of Liberty flying over New York.
- Predicting for recognition
- I have to go. vs. I half to go.
- vs.
- I half way thought Id go.
3Its Mostly About Semantics
He drew one card. I saw the Statue of Liberty
flying over New York. I saw a plane flying over
New York. Workers dumped sacks into a bin. Moscow
sent more than 100,000 soldiers into
Afghanistan. John hit the ball with the bat. John
hit the ball with the autograph. Visiting
relatives can be trying. Visiting museums can be
trying.
4How to Add Semantics to Parsing?
- The classic approach to this problem
- Ask a semantics module to choose. Two ways to do
that - Cascade the two systems. Build all the parses,
then pass them to semantics to rate them.
Combinatorially awful. - Do semantics incrementally. Pass constituents,
get ratings and filter. - In either case, we need to reason about the world.
5The Modern Approach
- The modern approach
- Skip meaning and the corresponding need for a
knowledge base and an inference engine. - Notice that the facts about meaning manifest
themselves in probabilities of observed sentences
if there are enough sentences. - Why is this approach in vogue?
- Building world models is a lot harder than early
researchers realized. - But, we do have huge text corpora from which we
can draw statistics.
6Probabilistic Context-Free Grammars
A PCFG is a context-free grammar in which each
rule has been augmented with a probability A ?
? p is the probability that a given
nonterminal symbol A will be rewritten as ?
via this rule. Another way to think of this
is P(A ? ?A) So the sum of all the
probabilities of rules with left hand side A must
be 1.
7A Toy Example
8How Can We Use These?
In a top-down parser, we can follow the more
likely path first. In a bottom-up parser, we can
build all the constituents and then compare them.
9The Probability of Some Parse T
P(T)
where p(r(n)) means the probability that rule r
will apply to expand the nonterminal n.
Note the independence assumption.
So what we want is
where ?(S) is the set of possible parses for S.
10An Example
Can you book TWA flights?
11An Example The Probabilities
1.5 ? 10-6 1.7 ? 10-6 Note
how small the probabilities are, even with this
tiny grammar.
12Using Probabilities for Language Modeling
Since there are fewer grammar rules than there
are word sequences, it can be useful, in language
modeling, to use grammar probabilities instead of
flat n-gram frequencies. So the probability of
some sentence S is the sum of the probabilities
of its possible parses
Contrast with
13Adding Probabilities to a Parser
- Adding probabilities to a top-down parser, e.g.,
Earley - This is easy since were going top-down, we can
choose which rule to prefer. - Adding probabilities to a bottom-up parser
- At each step, build the pieces, then add
probabilities to them.
14Limitations to Attaching Probabilities Just to
Rules
Sometimes its enough to know that one rule
applies more often than another Can you book
TWA flights? But often it matters what the
context is. Consider S ? NP VP NP ?
Pronoun .8 NP ? LexNP .2 But, when the NP
is the subject, the true probability of a pronoun
is .91. When the NP is the direct object, the
true probability of a pronoun is .34.
15Often the Probabilities Depend on Lexical Choices
I saw the Statue of Liberty flying over New
York. I saw a plane flying over New York. Workers
dumped sacks into a bin. Workers dumped sacks of
potatoes. John hit the ball with the bat. John
hit the ball with the autograph. Visiting
relatives can be trying. Visiting museums can be
trying. There were dogs in houses and
cats. There were dogs in houses and cages.
16The Dogs in Houses Example
The problem is that both parses used the same
rules so they will get the same probabilities
assigned to them.
17The Fix Use the Lexicon
The lexicon is an approximation to a knowledge
base. It will let us treat into and of
differently with respect to dumping without any
clue what dumping means or what into and of
mean. Note the difference between this approach
and subcategorization rules, e.g., dump
SUBCAT NP SUBCAT LOCATION Subcategorization
rules specify requirements, not preferences.
18Lexicalized Trees
Key idea Each constituent has a HEAD word
19Adding Lexical Items to the Rules
VP(dumped) ? VBD (dumped) NP (sacks) PP (into) 3
? 10-10 VP(dumped) ? VBD (dumped) NP (cats) PP
(into) 8 ? 10-10 VP(dumped) ? VBD (dumped) NP
(hats) PP (into) 4 ? 10-10 VP(dumped) ? VBD
(dumped) NP (sacks) PP (above) 1 ? 10-12
We need fewer numbers than we would for N-gram
frequencies The workers dumped sacks of
potatoes into a bin. The workers dumped sacks of
onions into a bin. The workers dumped all the
sacks of potatoes into a bin. But there are still
too many and most will be 0 in any given corpus.
20Collapsing These Cases
Instead of caring about specific rules
like VP(dumped) ? VBD (dumped) NP (sacks) PP
(into) 3 ? 10-10 Or about very general rules
like VP ? VBD NP PP Well do something partway
in between VP(dumped) ? VBD NP PP p(r(n) n,
h(n))
21Computing Probabilities of Heads
- Well let the probability of some node n having
head h depend on two factors - the syntactic category of the node, and
- the head of the nodes mother (h(m(n)))
- So we will compute
- P(h(n) wordi n, h(m(n)))
VP (dumped) VP (dumped) NP (sacks) p
p1 p p2
p p3 PP (into) PP (of)
PP (of)
So now weve got probabilistic subcat information.
22Revised Rule for Probability of a Parse
Our initial rule
P(T)
where p(r(n)) means the probability that rule r
will apply to expand the nonterminal n.
Our new rule
P(T)
probability of choosing this rule given the
nonterminal and its head ? probability that
this node has head h given the nonterminal and
the head of its mother
23So We Can Solve the Dumped Sacks Problem
From the Brown corpus p(VP ? VBD NP PP VP,
dumped) .67 p(VP ? VBD NP VP, dumped)
0 p(into PP, dumped) .22 p(into PP,
sacks) 0 So, the contribution of this part of
the parse to the total scores for the two
candidates is dumped into .67 ? .22
.147 sacks into 0 ? 0 0
24Its Mostly About Semantics But Its Also About
Psychology
- What do people do?
- People have limited memory for processing
language. - So we should consider two aspects of language
skill - competence (what could we in principle do?), and
- performance (what do we actually do, including
mistakes?)
25Garden Path Sentences
- Are people deterministic parsers?
- Consider garden path sentences such as
- The horse raced past the barn fell.
- The complex houses married and single students
and their families. - I told the boy the dog bit Sue would help him.
26Embedding Limitations
There are limits to the theoretical ability to
apply recursion in grammar rules The
Republicans who the senator who she voted for
chastised were trying to cut all benefits for
veterans. Tom figured that that Susan wanted to
take the cat out bothered Betsy out.
(Church) Harold heard that John told the teacher
that Bill said that Sam thought that Mike threw
the first punch yesterday. (Church)
27Building Deterministic Parsers
- What if we impose performance constraints on our
parsers? Will they work? - Require that the parser be deterministic. At any
point, it must simply choose the best parse given
what has come so far and, perhaps, some limited
number of lookahead constituents (Marcus allowed
3). - Limit the amount of memory that the parser may
use. This effectively makes the parser an FSM,
in fact a deterministic FSM.