Title: CFL Pumping Lemma
1CFL Pumping Lemma
- Similar to regular-language PL, but you have to
pump two strings in the middle of the string, in
tandem (i.e., the same number of copies of each).
Formally - ? CFL L
- ? integer n
- ?z in L, with z ?n
- ?uvwxy z such that vwx ?n and vx gt 0
- ?i? 0, uviwxiy is in L.
2Outline of Proof of PL
- Let there be a Chomsky-normal-form CFG for L with
m variables. Pick n 2m. - Because CNF grammars have bodies of no more than
2 symbols, a string z of length n must have some
path with at least m 1 variables. - Thus, some variable must appear twice on the
path. - Compare with the DFA argument about a path longer
than the number of states. - Focus on some path that is as long as any path in
the tree. In this path, we can find a duplication
of some variable A among the bottom m 1
variables on the path. - Let the lower A derive w and the upper A derive
vwx.
3- CNF guarantees us that vwx ?n and vx??.
- By repeatedly replacing the lower A's tree by the
upper A's tree, we see uviwxiy has a parse tree
for all igt 1. - And replacing the upper by the lower shows the
case i 0 i.e., uwy is in L.
4(No Transcript)
5- Consider the derivation S? uAy ?uvAxy ?uvwxy
S
A
y
u
A
A
x
v
? uviwxiy ? L
w
6Another formulation
- A path in a binary tree representing a derivation
for a grammar in CNF either is empty or consists
of a node, one of its descendents, and all nodes
in between. - Length of the path is the number of nodes it
contains - Height of a binary tree is the length of its
longest path.
LEMMA For any h? 1, a binary tree having more
than 2h leaf nodes, must have a height greater
than h 1
7- Pumping lemma constant for CFLs is 2n1 where n
is the number of variables in the grammar in CNF - Why? Because the derivation tree for a
sufficiently long string must have a height of at
least n 2 - It has more than 2nleaf nodes, and therefore its
height is greater than n 1 - If n non-terminals, string must have tree with
height n2 to be sufficiently long for the PL to
be applied - Height of n1 gets us to the level of the
bottom-most non-leaf nodes (nodes labeled with a
non-terminal) - Consider a leaf and the n 1 nodes above it
since there are only n variables, one must appear
twice.
8S
z (ab)(b)(ab)(b)(a) u v w x y
C1
C2
Root of vwx
C3
C5
A
C4
? Height of this subtree ? n 2
a
a
b
B
C6
Cannot overlap w
b
C7
A
?vx gt 0
b
C9
C8
a
b
9Example
The classic non-CFL
- L aibici i? 0 is not a CFL.
- Suppose it were. Then let n be the PL constant
for L. - Consider z anbncn. We can write z uvwxy, with
vwx ?n and vx gt 0 (i.e., either v or x is
non-empty). - Two cases to consider
- Both v and x contain only one type of alphabet
symbol v does not contain both a's and b's or
both b's and c's, and the same holds for x. But
in this case uv2wx2y cannot contain equal numbers
of a's, b's, and c's. - Either v or x contain more than one type of
symbol in this case uv2wx2y may contain equal
numbers of a's, b's, and c'sbut they won't be in
the correct order. - One of these cases must occur, and both result in
contradiction. So the assumption that L is a CFL
is false.
10Example
- L 0k2 k is any integer is not a CFL.
- Suppose it were. Then let n be the PL constant
for L. - Consider z 0n2. We can write z uvwxy, with
vwx ?n and vx gt 0. - Then uvvwxxy should be in L.
- But n2lt uvvwxxy ?n2 nlt (n 1)2, so there is
no perfect square that uvvwxxy could be. - By "proof by contradiction, L is not a CFL.
11Example
- L ww w? 0,1 is not a CFL.
- Suppose it were. Then let n be the PL constant
for L. - Choosing a string is less obvious for this
language. - Try z 0n10n1.
- But it can be pumped by dividing as follows
12- Try another string 0n1n0n1n
- seems to capture more of the "essence" of the
language - Use PL condition that the string can be pumped by
dividing into z uvwxy, where vwx ? n. - vwx must straddle the midpoint of z. Otherwise,
if only in the first half of z, pumping up
touv2wx2y moves a 1 into the first position of
the second half, so it cannot be of form ww. If
in the second half, a 0 is moved into the last
position of the first half, so cannot be of form
ww. - If vwx straddles the midpoint of z, pumping z
down to uwy yields 0n1i0j1n, where i and j cannot
both be n. This string cannot be of form ww.
Contradiction!
13Example
- L aibjck iltjltk is not a CFL
- Suppose it were. Then let n be the PL constant
for L.
Consider z anbn1cn2. We can write z uvwxy,
with vwx ?n and vx gt 0, and uviwxiy ?L for
every i ? 0.
This time must pump down as well as pump
up. First we consider the case where vx contains
at least one a. Then sincevwx ?n, vx can
contain no c's. Therefore, uv2wx2y has at least n
1 a's and exactly n 2 c's, which is
impossible for strings in L. If vx contains no
a's, then it must contain either b or c.In this
case, uv0wx0y uwy has either fewer than n 1
b's or fewer than n 2 c's, but in either case
exactly no a's. This is also impossible for
strings in L.
By "proof by contradiction," L is not a CFL.
14Example
- L aibjck 0 ? i ? j ? k is not a CFL
- Suppose it were. Then let n be the PL
constant for L.
Consider z anbncn. We can write z uvwxy, with
vwx ?n and vx gt 0, and uviwxiy ?L for every i
? 0..
- When both v and x contain only one type of
symbol, v does not contain both as and bs and
bsor cs and the same holds for x. Must divide
into three sub-cases - No as. Then try pumping down to obtain uv0wx0y
uwy. Contains too few bs or cs. - No bs. Then either as or cs must appear in v
or x because both cant be the empty string. If
as appear, then uv2wx2y contains more as than
bs. If cs appear, then uv0wx0y contains more
bs than cs. - No cs. The string uv2wx2y contains more as or
more bs than cs. - When either v or x contain more than one type of
symbol, uv2wx2y will not contain symbols in the
correct order.
By "proof by contradiction," L is not a CFL.
15Example
- L xyx x,y ? a,b and x 1is not a CFL
- Suppose it were. Then let n be the PL
constant for L.
Let z anbnanbn. Then z uvwxy for some u, v,
w, x, and y, satisfying vx gt 0, vwx ?n, and
uviwxiy ?L for every i ? 0.
Suppose that vx contains either only a's from
the first group or only b's from the last group.
Then uv2wx2y is either anibnanbn or anbnanbni
for some 0 lti ?n, and in neither case can this
string be in the form xyx for any x with x gt
0. Otherwise, vx contains either a b from the
first group or an a from the second. In this case
uv0wx0y is either aibjakbn or anbiajbk where in
either case i and k are positive and jltn. Neither
of these strings can be in the form required for
L either.
By "proof by contradiction," L is not a CFL.
16Closure Properties of CFLs
- CFLs are closed under union, concatenation, and
Kleene star. - CFLs are closed under reversal.
- Proofs are the same as for regular languages by
construction
17If L1 and L2 are CFLs, then L1? L2is a CFL
Start with CFGs G1 (V1,?,S1,P1) and G2
(V2,?,S2,P2)
- Construct Gu (Vu,?,Su,Pu) generating L1? L2
- rename elements of V2 if necessary so that V1 ?
V2 ? - define VuV1 ? V2 ? Su, Su ?V1 ,V2
- define PuP1 ? P2 ? Su ? S1 S2
If L1 and L2 are CFLs, then L1L2is a CFL
Construct Gc (Vc,?,Sc,Pc) generating L1L2 -
rename elements of V2 so that V1 ? V2 ? -
define VcV1 ? V2 ? Sc, Sc ? V1 ,V2 - define
PcP1 ? P2 ? Sc ? S1S2
If L1 is a CFL, then L1is a CFL Construct
G (V,?,S,P) generating L1 - define VV1
? S - define PP1 ? S ? S1S ?
18Non-closure Under Intersection
- The following language L 0i1j2k3l i k and
j l is not a CFL. - Intuitively, you need a variable and productions
like A? 0A2 02 to generate the matching 0's and
2's, while you need another variable to generate
matching 1's and 3's. But these variables would
have to generate strings that did not interleave. - However, the simpler language 0i1j2k3l i k
is a CFL. - A grammar
- S? S3 A
- A? 0A2 B
- B?1B ?
- Likewise the CFL 0i1j2k3l j l.
- Their intersection is L.
19Closure under Complement?
- The complements of some CFLs are also CFLs
- Example anbn n 0
- Complement can be accepted by a PDA
- swap accepting states of PDA that recognizes
anbn.
20Non-closure of CFL's Under Complement
- But not always!
- The complement of non-CFL L 0i1j2k3l i k
and j l is a CFL. - Here is a PDA P recognizing it
- Non-deterministically choose whether to check i?k
or j?l. - Non-deterministic PDAchecks one or the other,
but capable of checking either one - Say we want to check i?k.
- As long as 0's come in, count them on the stack.
- Ignore 1's.
- Pop the stack for each 2.
- As long as we have not just exposed the
bottom-of-stack marker when the first 3 comes in,
accept, and keep accepting as long as 3's come
in. - But we also have to accept, and keep accepting,
as soon as we see that the input is not in
L(0123).
21Another Example
- If L1 and L2 are CFLs, then if the CFLs are
closed under complementation, then the language - Is context free because the CFLs are closed under
union - But by DeMorgans Law,
- It would then follow that CFLs are closed under
intersection. But we have already seen that they
are not!
22Testing Emptiness of a CFL
- As for regular languages, we really take a
representation of some language and ask whether
it represents ?. - In this case, the representation can be a CFG or
PDA. - Our choice, since there are algorithms to convert
one to the other. - The test Use a CFG check if the start symbol is
useless
23Testing Finiteness of a CFL
- Let L be a CFL. Then there is some pumping lemma
constant n for L. - Test all strings of length between n and 2n - 1
for membership (as in next slide). - If there is any such string, it can be pumped,
and the language is infinite. - If there is no such string, then n - 1 is an
upper limit on the length of strings, so the
language is finite. - Trick If there were a string z uvwxy of length
2n or longer, you can find a shorter string uwy
in L, but it's at most n shorter (why?). Thus, if
there are any strings of length 2n or more, you
can repeatedly cut out vx to get, eventually, a
string whose length is in the range n to 2n - 1.
24Testing Membership of a String in a CFL
- Simulating a PDA for L on string w doesn't quite
work, because the PDA can grow its stack
indefinitely on ? input, and we never finish,
even if the PDA is deterministic. - There is an O(n3) algorithm (n length of w)
that uses a "dynamic programming" technique. - Called Cocke-Younger-Kasami (CYK) algorithm.
25CYK Algorithm
- Start with a CNF grammar for L.
- Build a two-dimensional table
- Row length of a substring of w.
- Column beginning position of the substring.
- Entry in row i and column j set of variables
that generate the substring of w beginning at
position j and extending for i positions. - These entries are denoted Xj,ij-1 i.e., the
subscripts are the first and last positions of
the string represented, so the first row is
X11,X22, ,Xnn, the second row is X12,X23,
,Xn-1,n, and so on.
26Table
- The horizontal axis corresponds to the positions
of the string w a1a2an - Table entry Xij is the set of non-terminals A
such that A?aiai1aj - We are particularly interested in whether S is in
Xn1 because that is the same as saying S?w (that
is, w is in L)
27- Basis (row 1) Xii the set of variables A such
that A ? a is a production, and a is the symbol
at position i of w. - The grammar is in CNF, therefore the only way to
derive a terminal is with a production of the
form A ? a, so Xii is the set of non-terminals
such that A ? ai is a production of G - Induction Suppose we want to compute Xij, which
is in row j i 1 - We can derive aiai1 aj from A if there is a
production A ? BC, B derives any prefix of aiai1
aj, and C derives the rest. - Thus, we must ask if there is any value of k such
that - i ? k lt j.
- B is in Xik.
- C is in Xk1,j .
28Example
- We'll use the algorithm to determine if the
string w aabbb is in the language generated by
the grammar - S ?AB
- A ?BB a
- B ?AB b
- Note that w11 a, so X11 is the set of all
variables that immediately derive a, that is X11
A. Since w22 a, we also have X22 A, and
so on to get - X11 A, X22 A, X33 B, X44 B, X55
B
29- Compute X12 since X11 A and X22 A, X12
consists of all variables on the left side of a
production whose right side is AA. None, so X12
is empty. - Next X23 A A ?BB, B ? X22,B ? X33 so the
required right side is AB, thus X23 S,B - Rest is easy
- X12 ?, X23 S,B, X34 A, X45 A,
- X13 S,B, X24 A, X35 S,B,
- X14 A, X25 S,B,
- X15 S,B
- Since S is in X15, w? L(G)
30S ?AB A ?BB a B ?AB b
31S ?AB A ?BB a B ?AB b
a a b b b
A A B B B
A
S, B
S,B
S,B
32Another Example
- X ? aXb ab
- Step 1 put into CNF
- Apply CYK algorithm to aaabbb
33X ?aXb ab
34S ?AB BC A ?BA a B ?CC b C ?AB a
Another Example
Test for string baaba
35CYK as a Parsing Algorithm
- Applicability of the CYK algorithm as a parser
limited by the computational requirements needed
to find a derivation - For an input string of length n, (n2n)/2 sets
need to be constructed to complete the dynamic
programming table. - Each of these sets may require the consideration
of several decompositions of the associated
substring
36Preview of Undecidable CFL Problems
- Is a given CFG ambiguous?
- Is a given CFG inherently ambiguous?
- Is the intersection of two CFLs empty?
- Are two CFLs the same?
- Is a given CFL equal to S, where Sis the
alphabet of the language?
37The Chomsky Hierarchy
Turing Machine
r
Recursively Enumerable Languages
Context Sensitive Languages
Linear Bounded Automata
Context Free Languages
Regular Languages
Push Down Automata
Finite Automata
38Context-Sensitive Grammars
- The next grammar type, more powerful than CFGs,
is a "somewhat restricted" grammar - A grammar is context-sensitive if all
productions are of the form x?y, where x,yare
in(V ?T) and x y - Fundamental property
- grammar is non-contracting--i.e., the length of
successive sentential forms can never decrease - Why "context-sensitive"?
- All productions can be rewritten in a normal form
xAy?xvy - Effectively, "A can be replaced by vonly in the
context of a preceding xand a followingy"
39Example
S ? aAbc ?abAc ? abBbcc ? aBbbcc ?
aaAbbcc ? aabAbcc ? aabbAcc ? aabbBbccc ?
aabBbbccc ? aaBbbbccc ? aaabbbccc
- CSG for anbncn n 1
- Try to derive a3b3c3
S ? abc aAbc Ab ? bA Ac ? Bbcc bB ? Bb aB
? aa aaA
A and B are "messengers"- an A is created on the
left, travels to the right to the first c,
creates another b and c. Then sends B back to
create the corresponding a. Similar to the way
one would program a TM to accept the language.
40Linear-Bounded Automata
- A limited TM in which tape use is restricted
- Use only part of the tape occupied by the input
- I.e., has an unbounded tape, but the amount that
can be used is a function of the input - Restrict usable part of tape to exactly the cells
taken by the input - LBA is assumed to be nondeterministic
41Relation between CSLs and LBAs
- If a language L is accepted by some linear
bounded automaton, then there is a
context-sensitive grammar that generates L. - Every step in a derivation from a CSG is a
bounded function of w because any CSG G is
non-contracting