Title: David Caley
1Accurate Parsing
('they worry that air the shows , drink too much
, whistle johnny b. goode and watch the other
ropes , whistle johnny b. goode and watch closely
and suffer through the sale', 2.1730387621600077e-
11)
- David Caley
- Thomas Folz-Donahue
- Rob Hall
- Matt Marzilli
2Accurate Parsing Our Goal
- Given a grammar
- For a sentence S, return the parse tree with the
max probability conditioned upon S. - arg-max t in T P (t S) where T is the set of
possible parse trees of sentence S
3Talking Points
- Using the Penn-Treebank
- Reading in n-ary trees
- Finding Head-tags within n-ary productions
- Converting to Binary Trees
- Inducing a CFG grammar
- Probabilistic CYK
- Handling Unary rules
- Dealing with unknowns
- Dealing with run times
- Beam search, limiting depth of unary rules,
further optimizations - Example Parses and Trees
- Lexicalization Attempts
4Using the Penn-Treebank Our Training Data
- Contains tagged data and n-ary trees used from a
Wall Street Journal corpus. - Contains some information unneeded by the parser.
- Questionable Tagging
- (JJ the) ??
- Example
5Using the Penn-Treebank Handling N-ary trees
( (S (NP-SBJ-1 (NNS Consumers) ) (VP (MD
may) (VP (VB want) (S
(NP-SBJ (-NONE- -1) ) (VP (TO to)
(VP (VB move) (NP (PRP
their) (NNS telephones) )
(ADVP-DIR (NP (DT a) (RB little)
) (RBR closer)
(PP (TO to) (NP (DT the) (NN
TV) (NN set) ))))))))
Functional tags such as NP-SBJ-1 are ignored We
simply call this an NP Also NONE- tags are used
for traces, these are ignored also.
6Using the Penn-Treebank Head-Tag Finding
Algorithm
- For a context-free rule X -gt Y1 Yn, for each
rule we can use a function to determine the
head of the rule. - In the example above this could be any Y1 Yn
- The head is the most important child tag.
- Head-Tags Algorithm as Outlined in Collins Thesis
- Allow us to determine the head-tags that will be
used for later binary tree conversion
7Using the Penn-Treebank Head-Tag Finding
Algorithm
If nothing is found in a list traversal the
head-tag becomes the left or right most element.
8Using the Penn-Treebank Head-Rule Finding
Algorithm
- Rules for NPs are a bit different
- If the last word is tagged POS, return
(last-word) - Else
- Search from right to left for the first child
which is in the set - NN, NNP, NNPS, NNS, NX, POS, JJR
- Else
- Search from left to right for first child which
is an NP - Else
- Search from right to left for the first child
which is in the set , ADJP, PRN - Else
- Do the same with the set CD
- Else
- Do the same with the set JJ, JJS, RB, QP
- Else
- Return the last word
9Using the Penn-Treebank Binary Tree Conversion
- Now we put the Head-Tags to use
- Necessary for CFG grammar use with probabilistic
CYK - R - gt LiLi-1L1LoHRoR1 Ri-1Ri A General
n-ary rule - LiLi-1L1LoHRoR1 Ri-1 Ri
- On right side of H-tag we recursively split last
element to make a new binary rule, left
recursive. On the left side we do the same by
removing the first element, right recursive. - Li Li-1L1LoH
10Using the Penn-Treebank Grammar Induction
Procedure
- After we have binary trees we can easily begin to
identify rules and record their frequency - Identify every production and save them into a
python dictionary - Frequencies cached in a local file for later use,
read-in on subsequent executions - No immediate smoothing is done on probabilities,
Grammar is later trimmed to help with performance
11Probabilistic CYK The Parsing Step
- We use a Probabilistic CYK implementation to
parse our CFG grammar and also assign
probabilities to final parse trees. - Useful to provide multiple parses and
disambiguate sentences - New Concerns
- Unary Rules and their lengths
- Runtime (result of incredibly large grammar)
12Probabilistic CYK Handling Unary Rules within
Grammar
- Unary Rules of the form X-gtY or X-gta are
ubiquitous in our grammar - The closure of a constituent is needed to
determine all the unary productions that can lead
to that constituent. - Def Closure(X) UClosure(Y) Y-gtX, i.e all
non terminals that are reachable, by unary rules,
from X. - We implement this iteratively and also maintain a
closed list and limit depth, to prevent possible
infinite recursion
13Probabilistic CYK Dealing with Run times
- Beam Search
- Limit the number of nodes saved in each cell of
CYK dynamic programming table. - Using beam width k, All generations are kept
sorted and the k best are saved for the next
iteration - Experiences with 100, 200, 1000?
list size lt k
14Probabilistic CYK Dealing with Run Times
- Another optimization was to remove all
productions rules with frequency lt fc - Used fc 1, 2
- Also limited depth when calculating the unary
rules (closure) of a constituent present in our
CYK table - Extensive unary rules found to greatly slow down
our parser - Also long chains of unary productions have
extremely low probabilities, they are commonly
pruned by beam search anyway
15Probabilistic CYK Random Sentences and Example
Trees
- Some random sentences from our grammar with
associated probabilities. - ('buy jam , cocoa and other war-rationed
goodies',0.0046296296296296294) - ('cartoonist garry trudeau refused to impose
sanctions , including petroleum equipment , which
go into semiannual payments , including watches ,
including three , which the federal government ,
the same company formed by mrs. yeargin school
district would be confidential',
2.9911073159300768e-33) - ('33 men selling individual copies selling
securities at the central plaza hotel die',
7.4942533128815141e-08)
16Probabilistic CYK Random Sentences and Example
Trees
- ('young people believe criticism is led by south
korea', 1.3798001044090654e-11) - ('the purchasing managers believe the art is the
often amusing , often supercilious , even vicious
chronicle of bank of the issue yen-support
intervention', 7.1905882731776209e-1)
17 S(buy) --VP(buy) --VB(buy)
--buy --NP(jam) --NP(jam)-NP(goodi
es) --NP(jam)-CC(and)
--NP(jam)-NP(cocoa) --NP(jam)
--NN(jam)
--jam --,(,)
--, --NP(cocoa)
--NN(cocoa) --cocoa
--CC(and) --and
--NP(goodies) --JJ(other)
--other --NP(goodies)JJ(other)-
--JJ(war-rationed)
--war-rationed --NNS(goodies)
--goodies
S-(VP) --VP --VP-(VB)-NP
--VP-(VB) --VB
--buy --NP --NP-(NP)-NP
--NP-(NP)-CC
--NP-(NP)-NP --NP-(NP)-,
--NP-(NP)
--NP
--NP-(NN)
--NN --jam
--,
--, --NP
--NP-(NN)
--NN --cocoa
--CC --and
--NP --NP-(NNS)JJ-NNS
--JJ
--other --NP-(NNS)JJ-NNS
--JJ
--war-rationed
--NP-(NNS) --NNS
--goodies
18Accurate Parsing Conclusion
- Massive Lexicalized Grammar
- Working Probabilistic Parser
- Future Work
- Handle sparsity
- Smooth Probabilities