Title: Syntax
1Syntax
- Sudeshna Sarkar
- 25 Aug 2008
2Top-Down and Bottom-Up
- Top-down
- Only searches for trees that can be answers (i.e.
Ss) - But also suggests trees that are not consistent
with any of the words - Bottom-up
- Only forms trees consistent with the words
- But suggest trees that make no sense globally
3Problems
- Even with the best filtering, backtracking
methods are doomed if they dont address certain
problems - Ambiguity
- Shared subproblems
4Ambiguity
5Shared Sub-Problems
- No matter what kind of search (top-down or
bottom-up or mixed) that we choose. - We dont want to unnecessarily redo work weve
already done.
6Shared Sub-Problems
- Consider
- A flight from Indianapolis to Houston on TWA
7Shared Sub-Problems
- Assume a top-down parse making bad initial
choices on the Nominal rule. - In particular
- Nominal -gt Nominal Noun
- Nominal -gt Nominal PP
8Shared Sub-Problems
9Shared Sub-Problems
10Shared Sub-Problems
11Shared Sub-Problems
12Parsing
- CKY
- Earley
- Both are dynamic programming solutions that run
in O(n3) time. - CKY is bottom-up
- Earley is top-down
13Sample Grammar
14Dynamic Programming
- DP methods fill tables with partial results and
- Do not do too much avoidable repeated work
- Solve exponential problems in polynomial time
(sort of) - Efficiently store ambiguous structures with
shared sub-parts.
15CKY Parsing
- First well limit our grammar to epsilon-free,
binary rules (more later) - Consider the rule A -gt BC
- If there is an A in the input then there must be
a B followed by a C in the input. - If the A spans from i to j in the input then
there must be some k st. iltkltj - Ie. The B splits from the C someplace.
16CKY
- So lets build a table so that an A spanning from
i to j in the input is placed in cell i,j in
the table. - So a non-terminal spanning an entire string will
sit in cell 0, n - If we build the table bottom up well know that
the parts of the A must go from i to k and from k
to j
17CKY
- Meaning that for a rule like A -gt B C we should
look for a B in i,k and a C in k,j. - In other words, if we think there might be an A
spanning i,j in the input AND - A -gt B C is a rule in the grammar THEN
- There must be a B in i,k and a C in k,j for
some iltkltj
18CKY
- So to fill the table loop over the celli,j
values in some systematic way - What constraint should we put on that?
- For each cell loop over the appropriate k values
to search for things to add.
19CKY Table
20CKY Algorithm
21CKY Parsing
22Note
- We arranged the loops to fill the table a column
at a time, from left to right, bottom to top. - This assures us that whenever were filling a
cell, the parts needed to fill it are already in
the table (to the left and below)
23Example
24Other Ways to Do It?
- Are there any other sensible ways to fill the
table that still guarantee that the cells we need
are already filled?
25Other Ways to Do It?
26Sample Grammar
27Problem
- What if your grammar isnt binary?
- As in the case of the TreeBank grammar?
- Convert it to binary any arbitrary CFG can be
rewritten into Chomsky-Normal Form automatically. - What does this mean?
- The resulting grammar accepts (and rejects) the
same set of strings as the original grammar. - But the resulting derivations (trees) are
different.
28Problem
- More specifically, rules have to be of the form
- A -gt B C
- Or
- A -gt w
- That is rules can expand to either 2
non-terminals or to a single terminal.
29Binarization Intuition
- Eliminate chains of unit productions.
- Introduce new intermediate non-terminals into the
grammar that distribute rules with length gt 2
over several rules. So - S -gt A B C
- Turns into
- S -gt X C
- X - A B
- Where X is a symbol that doesnt occur anywhere
else in the the grammar.
30CNF Conversion
31CKY Algorithm
32Example
Filling column 5
33Example
34Example
35Example
36Example
37END
38Statistical parsing
- Over the last 12 years statistical parsing has
succeeded wonderfully! - NLP researchers have produced a range of (often
free, open source) statistical parsers, which can
parse any sentence and often get most of it
correct - These parsers are now a commodity component
- The parsers are still improving year-on-year.
39Classical NLP Parsing
- Wrote symbolic grammar and lexicon
- S ? NP VP NN ? interest
- NP ? (DT) NN NNS ? rates
- NP ? NN NNS NNS ? raises
- NP ? NNP VBP ? interest
- VP ? V NP VBZ ? rates
-
- Used proof systems to prove parses from words
- This scaled very badly and didnt give coverage
- Minimal grammar on Fed raises sentence 36
parses - Simple 10 rule grammar 592 parses
- Real-size broad-coverage grammar millions of
parses
40Classical NLP ParsingThe problem and its
solution
- Very constrained grammars attempt to limit
unlikely/weird parses for sentences - But the attempt make the grammars not robust
many sentences have no parse - A less constrained grammar can parse more
sentences - But simple sentences end up with ever more parses
- Solution We need mechanisms that allow us to
find the most likely parse(s) - Statistical parsing lets us work with very loose
grammars that admit millions of parses for
sentences but to still quickly find the best
parse(s)
41The rise of annotated dataThe Penn Treebank
- ( (S
- (NP-SBJ (DT The) (NN move))
- (VP (VBD followed)
- (NP
- (NP (DT a) (NN round))
- (PP (IN of)
- (NP
- (NP (JJ similar) (NNS increases))
- (PP (IN by)
- (NP (JJ other) (NNS lenders)))
- (PP (IN against)
- (NP (NNP Arizona) (JJ real) (NN
estate) (NNS loans)))))) - (, ,)
- (S-ADV
- (NP-SBJ (-NONE- ))
- (VP (VBG reflecting)
- (NP
- (NP (DT a) (VBG continuing) (NN
decline)) - (PP-LOC (IN in)
42The rise of annotated data
- Going into it, building a treebank seems a lot
slower and less useful than building a grammar - But a treebank gives us many things
- Reusability of the labor
- Broad coverage
- Frequencies and distributional information
- A way to evaluate systems
43Human parsing
- Humans often do ambiguity maintenance
- Have the police eaten their supper?
- come in and look
around. - taken out and shot.
- But humans also commit early and are garden
pathed - The man who hunts ducks out on weekends.
- The cotton shirts are made from grows in
Mississippi. - The horse raced past the barn fell.
44Phrase structure grammars context-free grammars
- G (T, N, S, R)
- T is set of terminals
- N is set of nonterminals
- For NLP, we usually distinguish out a set P ? N
of preterminals, which always rewrite as
terminals - S is the start symbol (one of the nonterminals)
- R is rules/productions of the form X ? ?, where X
is a nonterminal and ? is a sequence of terminals
and nonterminals (possibly an empty sequence) - A grammar G generates a language L.
45Probabilistic or stochastic context-free grammars
(PCFGs)
- G (T, N, S, R, P)
- T is set of terminals
- N is set of nonterminals
- For NLP, we usually distinguish out a set P ? N
of preterminals, which always rewrite as
terminals - S is the start symbol (one of the nonterminals)
- R is rules/productions of the form X ? ?, where X
is a nonterminal and ? is a sequence of terminals
and nonterminals (possibly an empty sequence) - P(R) gives the probability of each rule.
- A grammar G generates a language model L.
46Soundness and completeness
- A parser is sound if every parse it returns is
valid/correct - A parser terminates if it is guaranteed to not go
off into an infinite loop - A parser is complete if for any given grammar and
sentence, it is sound, produces every valid parse
for that sentence, and terminates - (For many purposes, we settle for sound but
incomplete parsers e.g., probabilistic parsers
that return a k-best list.)
47Top-down parsing
- Top-down parsing is goal directed
- A top-down parser starts with a list of
constituents to be built. The top-down parser
rewrites the goals in the goal list by matching
one against the LHS of the grammar rules, and
expanding it with the RHS, attempting to match
the sentence to be derived. - If a goal can be rewritten in several ways, then
there is a choice of which rule to apply (search
problem) - Can use depth-first or breadth-first search, and
goal ordering.
48Top-down parsing
49Bottom-up parsing
- Bottom-up parsing is data directed
- The initial goal list of a bottom-up parser is
the string to be parsed. If a sequence in the
goal list matches the RHS of a rule, then this
sequence may be replaced by the LHS of the rule. - Parsing is finished when the goal list contains
just the start category. - If the RHS of several rules match the goal list,
then there is a choice of which rule to apply
(search problem) - Can use depth-first or breadth-first search, and
goal ordering. - The standard presentation is as shift-reduce
parsing.
50Shift-reduce parsing one path
- cats scratch people with claws
- cats scratch people with claws SHIFT
- N scratch people with claws REDUCE
- NP scratch people with claws REDUCE
- NP scratch people with claws SHIFT
- NP V people with claws REDUCE
- NP V people with claws SHIFT
- NP V N with claws REDUCE
- NP V NP with claws REDUCE
- NP V NP with claws SHIFT
- NP V NP P claws REDUCE
- NP V NP P claws SHIFT
- NP V NP P N REDUCE
- NP V NP P NP REDUCE
- NP V NP PP REDUCE
- NP VP REDUCE
- S REDUCE
- What other search paths are there for parsing
this sentence?
51Problems with top-down parsing
- Left recursive rules
- A top-down parser will do badly if there are many
different rules for the same LHS. Consider if
there are 600 rules for S, 599 of which start
with NP, but one of which starts with V, and the
sentence starts with V. - Useless work expands things that are possible
top-down but not there - Top-down parsers do well if there is useful
grammar-driven control search is directed by the
grammar - Top-down is hopeless for rewriting parts of
speech (preterminals) with words (terminals). In
practice that is always done bottom-up as lexical
lookup. - Repeated work anywhere there is common
substructure
52Problems with bottom-up parsing
- Unable to deal with empty categories termination
problem, unless rewriting empties as constituents
is somehow restricted (but then it's generally
incomplete) - Useless work locally possible, but globally
impossible. - Inefficient when there is great lexical ambiguity
(grammar-driven control might help here) - Conversely, it is data-directed it attempts to
parse the words that are there. - Repeated work anywhere there is common
substructure
53Repeated work
54Principles for success take 1
- If you are going to do parsing-as-search with a
grammar as is - Left recursive structures must be found, not
predicted - Empty categories must be predicted, not found
- Doing these things doesn't fix the repeated work
problem - Both TD (LL) and BU (LR) parsers can (and
frequently do) do work exponential in the
sentence length on NLP problems.
55Principles for success take 2
- Grammar transformations can fix both
left-recursion and epsilon productions - Then you parse the same language but with
different trees - Linguists tend to hate you
- But this is a misconception they shouldn't
- You can fix the trees post hoc
- The transform-parse-detransform paradigm
56Principles for success take 3
- Rather than doing parsing-as-search, we do
parsing as dynamic programming - This is the most standard way to do things
- Q.v. CKY parsing, next time
- It solves the problem of doing repeated work
- But there are also other ways of solving the
problem of doing repeated work - Memoization (remembering solved subproblems)
- Also, next time
- Doing graph-search rather than tree-search.
57Probabilistic or stochastic context-free grammars
(PCFGs)
- G (T, N, S, R, P)
- T is set of terminals
- N is set of nonterminals
- For NLP, we usually distinguish out a set P ? N
of preterminals, which always rewrite as
terminals - S is the start symbol (one of the nonterminals)
- R is rules/productions of the form X ? ?, where X
is a nonterminal and ? is a sequence of terminals
and nonterminals (possibly an empty sequence) - P(R) gives the probability of each rule.
- A grammar G generates a language model L.
58PCFGs Notation
- w1n w1 wn the word sequence from 1 to n
(sentence of length n) - wab the subsequence wa wb
- Njab the nonterminal Nj dominating wa wb
- Nj
- wa wb
- Well write P(Ni ? ?j) to mean P(Ni ? ?j Ni
) - Well want to calculate maxt P(t ? wab)
59The probability of trees and strings
- P(t) -- The probability of tree is the product of
the probabilities of the rules used to generate
it. - P(w1n) -- The probability of the string is the
sum of the probabilities of the trees which have
that string as their yield - P(w1n) Sj P(w1n, t) where t is a parse of
w1n - Sj P(t)
60A Simple PCFG (in CNF)
S ? NP VP 1.0 VP ? V NP 0.7 VP ? VP PP 0.3 PP ? P NP 1.0 P ? with 1.0 V ? saw 1.0 NP ? NP PP 0.4 NP ? astronomers 0.1 NP ? ears 0.18 NP ? saw 0.04 NP ? stars 0.18 NP ? telescope 0.1
61(No Transcript)
62(No Transcript)
63Tree and String Probabilities
- w15 astronomers saw stars with ears
- P(t1) 1.0 0.1 0.7 1.0 0.4 0.18
- 1.0 1.0 0.18
- 0.0009072
- P(t2) 1.0 0.1 0.3 0.7 1.0 0.18
- 1.0 1.0 0.18
- 0.0006804
- P(w15) P(t1) P(t2)
- 0.0009072 0.0006804
- 0.0015876
64Chomsky Normal Form
- All rules are of the form X ? Y Z or X ? w.
- This makes parsing easier/more efficient
65Treebank binarization
N-ary Trees in Treebank
TreeAnnotations.annotateTree
Binary Trees
Lexicon and Grammar
TODO CKY parsing
Parsing
66An example before binarization
ROOT
S
VP
NP
NP
V
PP
N
P
NP
N
N
with
people
cats
claws
scratch
67After binarization..
ROOT
S
_at_S-gt_NP
VP
NP
_at_VP-gt_V
_at_VP-gt_V_NP
NP
V
PP
N
P
_at_PP-gt_P
N
NP
N
people
claws
with
cats
scratch
68The CKY algorithm (1960/1965)
function CKY(words, grammar) returns most
probable parse/prob score new
double(words)1(words)(nonterms) back
new Pair(words)1(words)1nonterms
for i0 ilt(words) i for A in nonterms
if A -gt wordsi in grammar
scoreii1A P(A -gt wordsi) //handle
unaries boolean added true while added
added false for A, B in nonterms
if scoreii1B gt 0 A-gtB in grammar
prob P(A-gtB)scoreii1B
if(prob gt scoreii1A)
scoreii1A prob backii1
A B added true
69The CKY algorithm (1960/1965)
for span 2 to (words) for begin 0 to
(words)- span end begin span for
split begin1 to end-1 for A,B,C in
nonterms probscorebeginsplitBs
coresplitendCP(A-gtBC) if(prob gt
scorebeginendA)
scorebeginendA prob
backbeginendA new Triple(split,B,C)
//handle unaries boolean added true
while added added false for A,
B in nonterms prob P(A-gtB)scorebegin
endB if(prob gt scorebeginend
A) scorebeginend A prob
backbeginend A B
added true return buildTree(score, back)
70cats
scratch
walls
with
claws
1
2
3
4
5
0
score01
score02
score03
score04
score05
1
score12
score13
score14
score15
2
score23
score24
score25
3
score34
score35
4
score45
5
71cats
scratch
walls
with
claws
1
2
3
4
5
0
N?cats P?cats V?cats
1
N?scratch P?scratch V?scratch
2
N?walls P?walls V?walls
3
N?with P?with V?with
for i0 ilt(words) i for A in nonterms
if A -gt wordsi in grammar
scoreii1A P(A -gt wordsi)
4
N?claws P?claws V?claws
5
72cats
scratch
walls
with
claws
1
2
3
4
5
0
N?cats P?cats V?cats NP?N _at_VP-gtV?NP _at_PP-gtP?NP
1
N?scratch P?scratch V?scratch NP?N _at_VP-gtV?NP _at_PP-gt
P?NP
2
N?walls P?walls V?walls NP?N _at_VP-gtV?NP _at_PP-gtP?NP
3
N?with P?with V?with NP?N _at_VP-gtV?NP _at_PP-gtP?NP
// handle unaries
4
N?claws P?claws V?claws NP?N _at_VP-gtV?NP _at_PP-gtP?NP
5
73cats
scratch
walls
with
claws
1
2
3
4
5
0
N?cats P?cats V?cats NP?N _at_VP-gtV?NP _at_PP-gtP?NP
PP?P _at_PP-gt_P VP?V _at_VP-gt_V
1
N?scratch P?scratch V?scratch NP?N _at_VP-gtV?NP _at_PP-gt
P?NP
PP?P _at_PP-gt_P VP?V _at_VP-gt_V
2
N?walls P?walls V?walls NP?N _at_VP-gtV?NP _at_PP-gtP?NP
PP?P _at_PP-gt_P VP?V _at_VP-gt_V
3
N?with P?with V?with NP?N _at_VP-gtV?NP _at_PP-gtP?NP
PP?P _at_PP-gt_P VP?V _at_VP-gt_V
4
N?claws P?claws V?claws NP?N _at_VP-gtV?NP _at_PP-gtP?NP
probscorebeginsplitBscoresplitendCP
(A-gtBC) probscore01Pscore12_at_PP-gt_PP
(PP?P _at_PP-gt_P) For each A, only keep the A-gtBC
with highest prob.
5
741
2
3
4
5
cats
scratch
walls
with
claws
0
N?cats P?cats V?cats NP?N _at_VP-gtV?NP _at_PP-gtP?NP
PP?P _at_PP-gt_P VP?V _at_VP-gt_V _at_S-gt_NP?VP _at_NP-gt_NP?PP _at_
VP-gt_V_NP?PP
1
N?scratch P?scratch V?scratch NP?N _at_VP-gtV?NP _at_PP-gt
P?NP
N?scratch 0.0967 P?scratch 0.0773 V?scratch 0.9285
NP?N 0.0859 _at_VP-gtV?NP 0.0573 _at_PP-gtP?NP 0.0859
PP?P _at_PP-gt_P VP?V _at_VP-gt_V _at_S-gt_NP?VP _at_NP-gt_NP?PP _at_
VP-gt_V_NP?PP
2
N?walls P?walls V?walls NP?N _at_VP-gtV?NP _at_PP-gtP?NP
N?walls 0.2829 P?walls 0.0870 V?walls 0.1160 NP?N
0.2514 _at_VP-gtV?NP 0.1676 _at_PP-gtP?NP 0.2514
PP?P _at_PP-gt_P VP?V _at_VP-gt_V _at_S-gt_NP?VP _at_NP-gt_NP?PP _at_
VP-gt_V_NP?PP
3
N?with P?with V?with NP?N _at_VP-gtV?NP _at_PP-gtP?NP
N?with 0.0967 P?with 1.3154 V?with 0.1031 NP?N 0.0
859 _at_VP-gtV?NP 0.0573 _at_PP-gtP?NP 0.0859
PP?P _at_PP-gt_P VP?V _at_VP-gt_V _at_S-gt_NP?VP _at_NP-gt_NP?PP _at_
VP-gt_V_NP?PP
// handle unaries
4
N?claws P?claws V?claws NP?N _at_VP-gtV?NP _at_PP-gtP?NP
N?claws 0.4062 P?claws 0.0773 V?claws 0.1031 NP?N
0.3611 _at_VP-gtV?NP 0.2407 _at_PP-gtP?NP 0.3611
5
75 76cats
scratch
walls
with
claws
1
2
3
4
5
0
N?cats 0.5259 P?cats 0.0725 V?cats 0.0967 NP?N 0.4
675 _at_VP-gtV?NP 0.3116 _at_PP-gtP?NP 0.4675
PP?P _at_PP-gt_P 0.0062 VP?V _at_VP-gt_V
0.0055 _at_S-gt_NP?VP 0.0055 _at_NP-gt_NP?PP
0.0062 _at_VP-gt_V_NP?PP 0.0062
_at_VP-gt_V?NP _at_VP-gt_V_NP
0.0030 NP?NP _at_NP-gt_NP
0.0010 S?NP _at_S-gt_NP
0.0727 ROOT?S
0.0727 _at_PP-gt_P?NP 0.0010
PP?P _at_PP-gt_P 5.187E-6 VP?V _at_VP-gt_V
2.074E-5 _at_S-gt_NP?VP 2.074E-5 _at_NP-gt_NP?PP
5.187E-6 _at_VP-gt_V_NP?PP
5.187E-6
_at_VP-gt_V?NP _at_VP-gt_V_NP
1.600E-4 NP?NP _at_NP-gt_NP 5.335E-5 S?NP
_at_S-gt_NP 0.0172 ROOT?S
0.0172 _at_PP-gt_P?NP 5.335E-5
1
N?scratch 0.0967 P?scratch 0.0773 V?scratch 0.9285
NP?N 0.0859 _at_VP-gtV?NP 0.0573 _at_PP-gtP?NP 0.0859
PP?P _at_PP-gt_P 0.0194 VP?V _at_VP-gt_V 0.1556
_at_S-gt_NP?VP 0.1556 _at_NP-gt_NP?PP
0.0194 _at_VP-gt_V_NP?PP 0.0194
_at_VP-gt_V?NP _at_VP-gt_V_NP
2.145E-4 NP?NP _at_NP-gt_NP 7.150E-5 S?NP
_at_S-gt_NP 5.720E-4 ROOT?S
5.720E-4 _at_PP-gt_P?NP 7.150E-5
PP?P _at_PP-gt_P 0.0010 VP?V _at_VP-gt_V
0.0369 _at_S-gt_NP?VP 0.0369 _at_NP-gt_NP?PP
0.0010 _at_VP-gt_V_NP?PP 0.0010
2
N?walls 0.2829 P?walls 0.0870 V?walls 0.1160 NP?N
0.2514 _at_VP-gtV?NP 0.1676 _at_PP-gtP?NP 0.2514
PP?P _at_PP-gt_P 0.0074 VP?V _at_VP-gt_V
0.0066 _at_S-gt_NP?VP 0.0066 _at_NP-gt_NP?PP
0.0074 _at_VP-gt_V_NP?PP 0.0074
_at_VP-gt_V?NP _at_VP-gt_V_NP
0.0398 NP?NP _at_NP-gt_NP 0.0132 S?NP
_at_S-gt_NP 0.0062 ROOT?S
0.0062 _at_PP-gt_P?NP 0.0132
3
N?with 0.0967 P?with 1.3154 V?with 0.1031 NP?N 0.0
859 _at_VP-gtV?NP 0.0573 _at_PP-gtP?NP 0.0859
PP?P _at_PP-gt_P 0.4750 VP?V _at_VP-gt_V 0.0248
_at_S-gt_NP?VP 0.0248 _at_NP-gt_NP?PP
0.4750 _at_VP-gt_V_NP?PP 0.4750
4
N?claws 0.4062 P?claws 0.0773 V?claws 0.1031 NP?N
0.3611 _at_VP-gtV?NP 0.2407 _at_PP-gtP?NP 0.3611
Call buildTree(score, back) to get the best parse
5