Title: CSE 3813 Introduction to Formal Languages and Automata
1CSE 3813Introduction to Formal Languages and
Automata
- Chapter 5
- Context-free Languages
- These class notes are based on material from our
textbook, An Introduction to Formal Languages and
Automata, 4th ed., by Peter Linz, published by
Jones and Bartlett Publishers, Inc., Sudbury, MA,
2006. They are intended for classroom use only
and are not a substitute for reading the textbook.
2Context-free Grammars
Definition 5.1 A grammar G (V, T, S, P) is
said to be context-free if all production rules
in P have the form A ? x where A ? V and x ? (V
? T) A language is said to be context-free iff
there is a context free grammar G such that L
L(G).
3Context-free Grammars
Context-free means that there is a single
variable on the left side of each grammar rule.
You can imagine other kinds of grammatical
rules where this condition does not hold. For
example 1Z1 ? 101 In this rule, the variable Z
goes to 0 only in the context of a 1 on its left
and a 1 to the right. This is a
context-sensitive rule.
4Non-regular languages
- There are non-regular languages that can be
generated by context-free grammars. - The language anbn n ? 0
- is generated by the grammar
- S ? aSb ?
- The language L w na(w) nb(w)
- is generated by the grammar
- S ? SS ? aSb bSa
5Example of a Context-free Grammar
The grammar G (S, a, b, S, P), with
production rules S ? aSa bSb ? is
context-free. This grammar is linear (that is,
there is at most a single variable on the
right-hand side of every rule), but is neither
right-linear nor left-linear (the variable is not
always the rightmost leftmost character on the
right-hand side of the rule), so it is not
regular.
6Example of a Context-free Grammar
Given the grammar G (S, a, b, S, P), with
production rules S ? aSa bSb ? a typical
derivation in this grammar might be S ? aSa ?
aaSaa ? aabSbaa ? aabbaa The language generated
by this grammar is L(G) wwR w ? a,b
7Palindromes
- Palindromes are strings which are spelled the
same way backwards and forwards. - The language of palindromes, PAL, is not regular
- For any two strings x, y a string z can be found
which distinguishes them. - For x, y which are different strings and xy,
if z xreverse is appended to each, then xz is
accepted and yz is not accepted - Therefore there must be an infinite number of
states in any FA accepting PAL, so PAL is not
regular
8Example of a Non-linear Context-free Grammar
Consider the grammar G (S, a, b, S, P),
with production rules S ? aSa SS ? This
grammar is context-free. Why? Is this grammar
linear? Why or why not?
9Regular vs. context-free
Are regular languages context-free? Yes,
because context-free means that there is a single
variable on the left side of each grammar rule.
All regular languages are generated by grammars
that have a single variable on the left side of
each grammar rule. But, as we have seen, not
all context-free grammars are regular. So
regular languages are a proper subset of the
class of context-free languages.
10Derivation
Given the grammar, S ? aaSB ? B ?
bB b the string aab can be derived in
different ways. S ? aaSB ? aaB ? aab S ? aaSB
? aaSb ? aab
11Parse tree
Both derivations on the previous slide correspond
to the following parse (or derivation) tree.
The tree structure shows the rule that is applied
to each nonterminal, without showing the order of
rule applications. Each internal node of the
tree corresponds to a nonterminal, and the leaves
of the derivation tree represent the string of
terminals.
12Derivation
In the derivation S ? aaSB ? aaB ? aab the first
step was to replace S with ?, and then to replace
B with b. We moved from left to right, replacing
the leftmost variable at each step. This is
called a leftmost derivation. Similarly, the
derivation S ? aaSB ? aaSb ? aab is called a
rightmost derivation.
13Leftmost (rightmost) derivation
Definition 5.2 In a leftmost derivation, the
leftmost nonterminal is replaced at each step.
In a rightmost derivation, the rightmost
nonterminal is replaced at each step. Many
derivations are neither leftmost nor
rightmost. If there is a single parse tree,
there is also a single leftmost derivation.
14Parse (derivation) trees
Definition 5.3 Let G (V, T, S, P) be a
context-free grammar. An ordered tree is a
derivation tree for G iff it has the following
properties 1. The root is labeled S 2. Every
leaf has a label from T ? ? 3. Every interior
vertex (not a leaf) has a label from V. 4. If a
vertex has label A ? V, and its children are
labeled (from left to right) a1, a2,..., an, then
P must contain a production of the form A ?
a1a2...an 5. A leaf labeled ? has no siblings
that is, a vertex with a child labeled ? can have
no other children
15Parse (derivation) trees
A partial derivation tree is one in which
property 1 does not necessarily hold and in which
property 2 is replaced by 2a. Every leaf has a
label from V ? T ? ? The yield of the tree is
the string of symbols in the order they are
encountered when the tree is traversed in a
depth-first manner, always taking the leftmost
unexplored branch.
16Parse (derivation) trees
A partial derivation tree yields a sentential
form of the grammar G that the tree is associated
with. A derivation tree yields a sentence of the
grammar G that the tree is associated with.
17Parse (derivation) trees
Theorem 5.1 Let G (V, T, S, P) be a
context-free grammar. Then for every w ? L(G)
there exists a derivation tree of G whose yield
is w. Conversely, the yield of any derivation
tree of G is in L(G). If tG is any partial
derivation tree for G whose root is labeled S,
then the yield of tG is a sentential form of
G. Any w ? L(G) has a leftmost and a rightmost
derivation. The leftmost derivation is obtained
by always expanding the leftmost variable in the
derivation tree at each step, and similarly for
the rightmost derivation.
18Ambiguity
A grammar is ambiguous if there is a string
with two possible parse trees. (A string has
more than one parse tree if and only if it has
more than one leftmost derivation.) English can
be ambiguous. Example Disabled fly to see
Carter.
19Example
V S T , , (, ), 0, 1 P S ? S S
S S (S) 1 0 The string 0 0 1 has
two different parse trees. The derivation begins
like this S What is the leftmost
variable? S What can we replace it with? S S
or S S or (S) or 1 or 0 Pick one
of these at random, say S S
20S ? S S S S (S) 1 0
Here is the parse tree
S S S S S 1 0 0 Our
string is 0 0 1. This parse corresponds to
compute 0 0 first, then add it to 1, which
equals 1
21Example
S ? S S S S (S) 1 0 But there is
another different parse tree that also generates
the string 0 0 1 The derivation begins like
this S What is the leftmost variable? S What
can we replace it with? S S or S S or
(S) or 1 or 0 Pick another one of these
at random, say S S
22S ? S S S S (S) 1 0
Here is the parse tree
S S S 0 S S
0 1 Our string is still 0 0 1,
but this parse corresponds to take 0, and then
multiply it by the sum of 0 1, which equals 0
23S ? S S S S (S) 1 0
We can clearly indicate that the addition is to
be done first. Here is the parse tree
S S S 0 ( S ) S
S 0 1 Our string is now 0
(0 1). This parse corresponds to take 0, and
then multiply it by the sum of 0 1, which
equals 0
24Equivalent grammars
Here is a non-ambiguous grammar that
generates the same language. S ? S A A A
? A B B B ? (S) 1 0 Two grammars that
generate the same language are said to be
equivalent. To make parsing easier, we prefer
grammars that are not ambiguous.
25Ambiguous grammars equivalent grammars
There is no general algorithm for determining
whether a given CFG is ambiguous. There is no
general algorithm for determining whether a given
CFG is equivalent to another CFG.
26Dangling else
x 3 if x gt 2 then if x gt 4 then x
1 else x 5 What value does x have at the
end?
27 Ambiguous grammar ltstatementgt IF lt
expressiongt THEN ltstatementgt IF
ltexpressiongt THEN ltstatementgt ELSE ltstatementgt
ltotherstatementgt
Unambiguous grammar ltstatementgt ltst1gt
ltst2gt ltst1gt IF ltexpressiongt THEN
ltst1gt ELSE ltst1gt ltotherstatementgt
ltst2gt IF ltexpressiongt THEN ltstatementgt
IF ltexpressiongt THEN ltst1gt ELSE ltst2gt
28Ambiguous grammars
Definition 5.6 If L is a context-free language
for which there an unambiguous grammar, then L is
said to be unambiguous. If every grammar that
generates L is ambiguous, then the language is
called inherently ambiguous. Example L
anbncm ? anbmcm with n and m non-negative, is
inherently ambiguous. See p. 144 in your
textbook for discussion.
29Homework Exercise
Show that the following grammar is ambiguous. S
? AB aaB A ? a Aa B ? b Construct an
equivalent grammar that is unambiguous.
30Parsing
- In practical applications, it is usually not
enough to decide whether a string belongs to a
language. It is also important to know how to
derive the string from the language. - Parsing uncovers the syntactical structure of a
string, which is represented by a parse tree.
(The syntactical structure is important for
assigning semantics to the string -- for example,
if it is a program)
31Parsing
Let G be a context-free grammar for C. Let the
string w be a C program. One thing a compiler
does - in particular, the part of the compiler
called the parser - is determine whether w is a
syntactically correct C program. It also
constructs a parse tree for the program that is
used in code generation. There are many
sophisticated and efficient algorithms for
parsing. You may study them in more
advanced classes (for example, on compilers).
32The Decision question for CFLs
If a string w belongs to L(G) generated by a CFG,
can we always decide that it does belong to
L(G)? Yes. Just do top-down parsing, in which
we list all the sequential forms that can be
generated in one step, two steps, three steps,
etc. This is a type of exhaustive search
parsing. Eventually, w will be generated. What
if w does not belong to L(G). Can we always
decide that it doesnt? Not unless we restrict
the kinds of rules we can have in our grammar.
Suppose we ask if w aab is a string in L(G).
If we have ?-rules, such as B ? ?, in G, we might
have a sentential form like aabB4000 and still be
able to end up with aab.
33The Decision question for CFLs
What we need to do is to restrict the kinds of
rules in our CFGs so that each rule, when it is
applied, is guaranteed to either increase the
length of the sentential form generated or to
increase the number of terminals in the
sentential form. That means that we dont want
rules of the following two forms in our CFGs A
? ? A ? B If we have a CFG that lacks these
kinds of rules, then as soon as a sentential form
is generated that is longer than our string, w,
we can abandon any attempt to generate w from
this sentential form.
34The Decision question for CFLs
If the grammar does not have these two kinds of
rules, then, in a finite number of steps,
applying our exhaustive search parsing technique
to G will generate all possible sentential forms
of G with a length ? w. If w has not been
generated by this point, then w is not a string
in the language, and we can stop generating
sentential forms.
35The Decision question for CFLs
Consider the grammar G (S, a, b, S, P),
where P is S ? SS aSb bSa ab ba Looking
at the production rules, it is easy to see that
the length of the sentential form produced by the
application of any rule grows by at least one
symbol during each derivation step. Thus, in ?
w derivation steps, G will produce either
produce a string of all terminals, which may be
compared directly to w, or a sentential form too
long to be capable of producing w. Hence, given
any w ? a, b, the exhaustive search parsing
technique will always terminate in a finite
number of steps.
36The Decision question for CFLs
Theorem 5.2 Assume that G (V, T, S, P) is a
context-free grammar with no rules of the form A
? ? or A ? B, where A, B ? V. Then the
exhaustive search parsing technique can be made
into an algorithm which, for any w ? ?, either
produces a parsing for w or tells us that no
parsing is possible.
37The Decision question for CFLs
Since we dont know ahead of time which
derivation sequences to try, we have to try all
of the possible applications of rules which
result in one of two conditions a string of
all terminals of length w, or a sentential
form of length w 1. The application of any
one rule must result in either replacing a
variable with one or more terminals, or
increasing the length of a sentential form by one
or more characters. The worst case scenario is
applying w rules that increase the length of a
sentential form to w, and then applying w
rules that replace each variable with a terminal
symbol, and ending up with a string of w
terminals that doesnt match w. This takes 2w
operations.
38The Decision question for CFLs
How many sentential forms will we have to
examine? Restricting ourselves to leftmost
derivations, it is obvious that, with P
production rules, applying each rule one time to
S gives us P sentential forms. Example Given
the 5 production rules S ? SS aSb bSa ab
ba, one round of leftmost derivations produces 5
sentential forms S ? SS S ? aSb S ? bSa S
? ab S ? ba
39The Decision question for CFLs
The second round of leftmost derivations produces
15 sentential forms SS ? SSS SS ? aSbS SS ?
bSaS SS ? abS SS ? baS aSb ? aSSb aSb ? aaSbb aSb
? abSab aSb ? aabb aSb ? abab bSa ? bSSa bSa ?
baSba bSa ? bbSaa bSa ? baba bSa ? bbaa ab and
ba dont produce any new sentential forms, since
they consist of all terminals. If they had
contained variables, then the second round of
leftmost derivations would have produced 25, or
P2 sentential forms. Similarly, the third
round of leftmost derivations can produce P3
sentential forms.
40The Decision question for CFLs
We know from our worst case scenario that we
never have to run through more than 2w rounds
of rule applications in any one derivation
sequence before being able to stop the
derivation. Therefore, the total number of
sentential forms that we may have to generate to
decide whether string w belongs to L(G) generated
by grammar G (V, T, S, P) is ? P P2
... P2w Unfortunately, this means that
the work we might have to do to answer the
decision question for CFGs could grow
exponentially with the length of the string.
41The Decision question for CFLs
It can be shown that some more efficient parsing
techniques for CFGs exist. Theorem 5.3 For
every context-free grammar there exists an
algorithm that parses any w ? L(G) in a number of
steps proportional to w3. Your textbook does
not offer a proof for this theorem. Anyway, what
is needed is a linear-time parsing algorithm for
CFGs. Such an algorithm exists for some special
cases of CFGs but not for the class of CFGs in
general.
42S-grammars
Definition 5.4 A context-free grammar G (V,
T, S, P) is said to be a simple grammar or
s-grammar if all of its productions are of the
form A ? ax, where A ? V, a ? T, x ? V, and
any pair (A, a) occurs at most once in
P. Example The following grammar is an
s-grammar S ? aS bSS c The following
grammar is not an s-grammar. Why not? S ? aS
bSS aSS c
43S-grammars
If G is an s-grammar, then any string w in L(G)
can be parsed with an effort proportional to w.
44S-grammars
Lets consider the grammar expressed by the
following production rules S ? aS bSS
c Since G is an s-grammar, all rules have the
form A ? ax. Assume that w abcc. Due to the
restrictive condition that any pair (A, a) may
occur at most once in P, we know immediately
which production rule must have generated the a
in abcc the rule S ? aS. Similarly, there is
only one way to produce the b and the two cs.
So we can parse w in no more than w steps.
45Exercise
Let G be the grammar S ? abSc A A ? cAd
cd 1) Give a derivation of ababccddcc. 2)
Build the parse tree for the derivation of
(1). 3) Use set notation to define L(G).
46Programming languages
- Programming languages are context-free, but not
regular - Programming languages have the following features
that require infinite stack memory - matching parentheses in algebraic expressions
- nested if .. then .. else statements, and nested
loops - block structure
47Programming languages
- Programming languages are often defined using a
convention for specifying grammars called
Backus-Naur form, or BNF. - Example
- ltexpressiongt lttermgt ltexpressiongt lttermgt
48Programming languages
- Backus-Naur form is very similar to the standard
CFG grammar form, but variables are listed within
angular brackets, is used instead of ?, and
X is used to mean 0 or more occurrences of X.
The is still used to mean or. - Pascals if statement
- ltif-statementgt if ltexpressiongt ltthen-clausegt
ltelse-clausegt
49Programming languages
- S-grammars are not sufficiently powerful to
handle all the syntactic features of a typical
programming language - LL grammars and LR grammars (see next chapter)
are normally used for specifying programming
languages. They are more complicated than
s-grammars, but still permit parsing in linear
time. - Some aspects of programming languages (i.e.,
semantics) cannot be handled by context-free
grammars.