Lexicalized and Probabilistic Parsing - PowerPoint PPT Presentation

About This Presentation

Title:

Lexicalized and Probabilistic Parsing

Description:

... of Liberty flying over New York. Predicting for recognition: I have ... I saw the Statue of Liberty flying over New York. I saw a plane flying over New York. ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 28

Provided by: elaine8

Learn more at: https://www.cs.utexas.edu

Category:

Tags: autograph | cat | choosing | do | dog | draw | family | for | grammar | how | lexicalized | parsing | probabilistic | rules | statueofliberty | to | you

more less

Transcript and Presenter's Notes

Title: Lexicalized and Probabilistic Parsing

1
Lexicalized and Probabilistic Parsing
Read J M Chapter 12.
2
Using Probabilities

Resolving ambiguities
I saw the Statue of Liberty flying over New York.
Predicting for recognition
I have to go. vs. I half to go.
vs.
I half way thought Id go.

3
Its Mostly About Semantics
He drew one card. I saw the Statue of Liberty
flying over New York. I saw a plane flying over
New York. Workers dumped sacks into a bin. Moscow
sent more than 100,000 soldiers into
Afghanistan. John hit the ball with the bat. John
hit the ball with the autograph. Visiting
relatives can be trying. Visiting museums can be
trying.
4
How to Add Semantics to Parsing?

The classic approach to this problem
Ask a semantics module to choose. Two ways to do
that
Cascade the two systems. Build all the parses,
then pass them to semantics to rate them.
Combinatorially awful.
Do semantics incrementally. Pass constituents,
get ratings and filter.
In either case, we need to reason about the world.

5
The Modern Approach

The modern approach
Skip meaning and the corresponding need for a
knowledge base and an inference engine.
Notice that the facts about meaning manifest
themselves in probabilities of observed sentences
if there are enough sentences.
Why is this approach in vogue?
Building world models is a lot harder than early
researchers realized.
But, we do have huge text corpora from which we
can draw statistics.

6
Probabilistic Context-Free Grammars
A PCFG is a context-free grammar in which each
rule has been augmented with a probability A ?
? p is the probability that a given
nonterminal symbol A will be rewritten as ?
via this rule. Another way to think of this
is P(A ? ?A) So the sum of all the
probabilities of rules with left hand side A must
be 1.
7
A Toy Example
8
How Can We Use These?
In a top-down parser, we can follow the more
likely path first. In a bottom-up parser, we can
build all the constituents and then compare them.
9
The Probability of Some Parse T
P(T)
where p(r(n)) means the probability that rule r
will apply to expand the nonterminal n.
Note the independence assumption.
So what we want is
where ?(S) is the set of possible parses for S.
10
An Example
Can you book TWA flights?
11
An Example The Probabilities
1.5 ? 10-6 1.7 ? 10-6 Note
how small the probabilities are, even with this
tiny grammar.
12
Using Probabilities for Language Modeling
Since there are fewer grammar rules than there
are word sequences, it can be useful, in language
modeling, to use grammar probabilities instead of
flat n-gram frequencies. So the probability of
some sentence S is the sum of the probabilities
of its possible parses
Contrast with
13
Adding Probabilities to a Parser

Adding probabilities to a top-down parser, e.g.,
Earley
This is easy since were going top-down, we can
choose which rule to prefer.
Adding probabilities to a bottom-up parser
At each step, build the pieces, then add
probabilities to them.

14
Limitations to Attaching Probabilities Just to
Rules
Sometimes its enough to know that one rule
applies more often than another Can you book
TWA flights? But often it matters what the
context is. Consider S ? NP VP NP ?
Pronoun .8 NP ? LexNP .2 But, when the NP
is the subject, the true probability of a pronoun
is .91. When the NP is the direct object, the
true probability of a pronoun is .34.
15
Often the Probabilities Depend on Lexical Choices
I saw the Statue of Liberty flying over New
York. I saw a plane flying over New York. Workers
dumped sacks into a bin. Workers dumped sacks of
potatoes. John hit the ball with the bat. John
hit the ball with the autograph. Visiting
relatives can be trying. Visiting museums can be
trying. There were dogs in houses and
cats. There were dogs in houses and cages.
16
The Dogs in Houses Example
The problem is that both parses used the same
rules so they will get the same probabilities
assigned to them.
17
The Fix Use the Lexicon
The lexicon is an approximation to a knowledge
base. It will let us treat into and of
differently with respect to dumping without any
clue what dumping means or what into and of
mean. Note the difference between this approach
and subcategorization rules, e.g., dump
SUBCAT NP SUBCAT LOCATION Subcategorization
rules specify requirements, not preferences.
18
Lexicalized Trees
Key idea Each constituent has a HEAD word
19
Adding Lexical Items to the Rules
VP(dumped) ? VBD (dumped) NP (sacks) PP (into) 3
? 10-10 VP(dumped) ? VBD (dumped) NP (cats) PP
(into) 8 ? 10-10 VP(dumped) ? VBD (dumped) NP
(hats) PP (into) 4 ? 10-10 VP(dumped) ? VBD
(dumped) NP (sacks) PP (above) 1 ? 10-12
We need fewer numbers than we would for N-gram
frequencies The workers dumped sacks of
potatoes into a bin. The workers dumped sacks of
onions into a bin. The workers dumped all the
sacks of potatoes into a bin. But there are still
too many and most will be 0 in any given corpus.
20
Collapsing These Cases
Instead of caring about specific rules
like VP(dumped) ? VBD (dumped) NP (sacks) PP
(into) 3 ? 10-10 Or about very general rules
like VP ? VBD NP PP Well do something partway
in between VP(dumped) ? VBD NP PP p(r(n) n,
h(n))
21
Computing Probabilities of Heads

Well let the probability of some node n having
head h depend on two factors
the syntactic category of the node, and
the head of the nodes mother (h(m(n)))
So we will compute
P(h(n) wordi n, h(m(n)))

VP (dumped) VP (dumped) NP (sacks) p
p1 p p2
p p3 PP (into) PP (of)
PP (of)
So now weve got probabilistic subcat information.
22
Revised Rule for Probability of a Parse
Our initial rule
P(T)
where p(r(n)) means the probability that rule r
will apply to expand the nonterminal n.
Our new rule
P(T)
probability of choosing this rule given the
nonterminal and its head ? probability that
this node has head h given the nonterminal and
the head of its mother
23
So We Can Solve the Dumped Sacks Problem
From the Brown corpus p(VP ? VBD NP PP VP,
dumped) .67 p(VP ? VBD NP VP, dumped)
0 p(into PP, dumped) .22 p(into PP,
sacks) 0 So, the contribution of this part of
the parse to the total scores for the two
candidates is dumped into .67 ? .22
.147 sacks into 0 ? 0 0
24
Its Mostly About Semantics But Its Also About
Psychology

What do people do?
People have limited memory for processing
language.
So we should consider two aspects of language
skill
competence (what could we in principle do?), and
performance (what do we actually do, including
mistakes?)

25
Garden Path Sentences

Are people deterministic parsers?
Consider garden path sentences such as
The horse raced past the barn fell.
The complex houses married and single students
and their families.
I told the boy the dog bit Sue would help him.

26
Embedding Limitations
There are limits to the theoretical ability to
apply recursion in grammar rules The
Republicans who the senator who she voted for
chastised were trying to cut all benefits for
veterans. Tom figured that that Susan wanted to
take the cat out bothered Betsy out.
(Church) Harold heard that John told the teacher
that Bill said that Sam thought that Mike threw
the first punch yesterday. (Church)
27
Building Deterministic Parsers

What if we impose performance constraints on our
parsers? Will they work?
Require that the parser be deterministic. At any
point, it must simply choose the best parse given
what has come so far and, perhaps, some limited
number of lookahead constituents (Marcus allowed
3).
Limit the amount of memory that the parser may
use. This effectively makes the parser an FSM,
in fact a deterministic FSM.