Title: CYK Parser
1CYK Parser
- Von Carla und Cornelia Kempa
2Overview
Top-down Bottom-up
Non-directional methods Unger Parser CYK Parser
3Cocke Younger Kasami -method
4Recognition phase
5Example grammar
- Number(s) ? Integer Real
- Integer ? Digit Integer Digit
- Real ? Integer Fraction Scale
- Fraction ? . Integer
- Scale ? e Sign Integer Empty
- Digit ? 0 1 2 3 4 5 6 7 8 9
- Empty ? ?
- Sign ? -
6Example Sentence 32.5e1
- 1. concentrate on the substrings of the
input sentence
7Building the recognition table
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
1632.5e 1 is in the language
- What problems can we already see in this example?
17Another complication ?- rules
18(No Transcript)
19The ?- Problem
- Shortest substrings of any input sentence
- ?-substrings
We must compute R? the set of non-terminals that
derive ?
R? Empty, Scale
20Non- empty substrings of the input sentence
- Input z z1 z2 z3 z4 .zn
- Compute the set of Non-Terminals
- that derive the substring of z starting at
position i, of length l.
21Terminology (also on the handout)
- i index we are starting at
- l length of this substring
- R s i,l set of Non-Terminals deriving the
substring s i, l - S i, 0 ?
- Set of Non- Terminals that derive ?
- R s i,0 R ?
22S i, l z i z i1 z i l-1
23The set of Non- Terminals deriving the substring
s i, l R s i, l
- 1.) substrings of length 0
- S i, 0 ? and R s i, l R ?
- 2.) short substrings
- 3.) longer substrings (say l j )
- All the information on substrings with
- l lt j is available
24Check each RH-side (Right-Hand -side) in the
grammar to see if it derives s i, l
S i, l ( divided into m segments (
possibly empty))
A1 ? first segment of s i, l A2 ? second segment
of s i, l .? .
25A 1 .Am ? s i,l
- So A1 ? first part of s i,l
- (lets say A1 has to derive a first part of
- s i, l of length k)
- A1 ? s i, k
- A1 is in the set R s i,k
26A 1 .Am ? s i,l
- Assuming this A2Am has to derive the rest
- A2 Am ? Sik, l-k
-
This is attempted for every k
27Problems with this Approach
- 1) Consider A2Am
- m could be 1 and A1 a Non-terminal
- ? We are Dealing with a unit- rule
- A1 must derive the whole substring
- s i, l and thus be a member of R s i, l
- But thats the set we are computing right
- now
28Solution to this problem
- A1 ? s i, l
- Somewhere along the derivation there must be a
first step not using a unit rule - A1 ? B ?? C ? s i, l
- C is the first Non-Terminal using a
- non-unit-rule in the derivation
29Solution cont.
- At some stage C is added to Rs i, l
- If we repeat the process again and again
- At some point B will be added and in the next
step A1 will be added - ? We have to repeat the process again and again
until no new Non-Terminals are added to R s i,l
30Problem 2
- ?-rules
- Consider all but one of the At derive ?
- B ? A1 A2 A3 A4 A5 . At
- B and A1 - t are Non-Terminals
- A2 At derive ?
- So what stays is B ? A1
- A unit-rule
31We have computed all the Rs i,l
- If S is a member of Rs 1, n the start symbol
derives z (s 1, n) (the input string)
32CYK recognition with a grammar in - form
- What are the Restrictions we want to have on our
grammar ?
33Useful Restrictions
- No ?- rules
- No unit-rules
- Limit the length of the right- hand side of each
rule, say to two - What we get out of this
- A ? a
- A ? BC
- Where a is a terminal and ABC are Non- Terminals
34Chomsky-Normal-Form
- ( not only to annoy students )
- Perfect grammar for CYK
35How CYK works for a grammar in CNF
- R? is empty
- R s i, 1 can be read directly from the rules
- (A ? a)
- A rule A ? BC
- can never derive a single terminal
36Procedure
- Iteratively (as before)
- 1) Fill the sets R s, 1 directly
- 2) Process all substrings of length 1
- 3) Process all substrings of length 2
- 4) Process all substrings of length l
- For the first step we use the rules of the form A
? a - For all the following steps we have to use the
rules of the form A ? BC
37CYK and CNF
Question the CYK-Parser has to answear is Does
such a k exist?
38Answearing this question is easy
- Just try all possibilities
- no problem since you are a computer -)
- Range from 1 to (l-1)
- All the sets R s i,k and R s ik , l-1
- have already been computed at this point
39Transform our sample CF-grammar into Chomsky
Normal Form
- Overview
- 1) eliminate ?-rules
- 2) eliminate unit-rules
- 3) remove non-productive non-terminals
- 4) remove non reachable non-terminals
- 5) modify the rest until all grammar rules
are of the form A ? a , A ? BC
40Our number grammar in CNF
- Number(s) ? 0 1 2 3 4 5 6 7 8 9
- Number(s) ? Integer Digit
- Number(s) ? N1 Scale Integer Fraction
- N1 ? Integer Fraction
- Integer ?0 1 2 3 4 5 6 7 8 9
- Integer ? Integer Digit
- Fraction ? T1 Integer
- T1 ? .
- Scale ? N2 Integer
- N2 ? T2 Sign
- T2 ? e
- Digit ? 0 1 2 3 4 5 6 7 8 9
- Sign ? -
41Building the recognition table
- Input
- Our example grammar in CNF
- input sentence 32.5 e 1
42Building the recognition table
- 1) bottom-row read directly from the
grammar (rules of the form A? a ) - 2) Check each RHS in the grammar
43Check each RHS of the grammar
- Two Ways Example 2.5 e ( s 2, 4)
- 1) check each RHS e.g N1 Scale
- 2) compute possible RH-Sides from the recognition
table
44How this is done
- 1) N1 not in R s 2, 1 or R s 2, 2
- N1 is a member of R s 2, 3
- But Scale is not a member of R s 5, 1
- 2) R s 2, 4 is the set of Non- Terminals that
have a RHS AB where either - A in R s 2, 1 and B in R s 3, 3
- A in R s 2, 2 and B in R s 4, 2
- A in R s 2, 3 and B in R s 5, 1
- Possible combinations N1 T2 or Number T2
- In our grammar we do not have such a RHS, so
nothing is added to R s 2, 4.
45Recognition table
46Recognition table (well-formed substring table)
47Computing R s i, lfollow the arrows V and W
simultaneously
- A ? BC ,
- B a member of a set on the V arrow ,
- C a member of a set on the W arrow
48Comparison
- This process is much less complicated than the
one we saw before - Why?
49Conclusion
- This process is much less complicated
- Reasons
- 1) We do not have to repeat the process again and
again until no new Non-Terminals are added to R s
i,l - (The substrings we are dealing with
- are really substrings and cannot be equal to
the string we start with)
50Reasons cont.
- 2) We only have to find one place where the
substring must be split into two - A ? B C
Here !
51Result of the algorithm we have seen so far
- Complete collection of sets R s i, l
- These sets can be organized in a triangular table
52Cost of CYK - algorithm
- Operations dependent on n,
- the number of input symbols
- (n ( n1) ) / 2 substrings to be examined
- For each substring n-1 different k-positions as
the worst case
53Cost of CYK algorithm cont.
- All other operations are independent of n
- ? The algorithm works in a time at most
proportional to n ³ - ? Thats far more efficient than exhaustive
search (time exponential in the length of the
input sentence)