Title: Chapter 1 Introduction
1Chapter 1 Introduction
- Emphasis History of the compiler
- Description of programs related to
compilers - Compiling translation process
- Major Data Structures of a
compiler -
2What and Why Compliers?
- Compilers Computer Programs that translate one
language to another. Source language(input) to
target language (output). - Source language high-level language c or
c - Target language object code, machine code
(machine instruction) -
3- Purposes of learning compilers
- 1.Basic knowledge (theoretical techniques
--- automata theory,data structure,discrete
mathematics,machine architecture,assembly) - 2.Tools and practical experience to design
and program an actual compiler - Additional Usage of compiling techniques
- Developing command interpreters, interface
programs -
- TINY the language for the discussion in the
text - C-Minus consist of a small but sufficiently
complex subset of C, It is more extensive than
TINY and suitable for a class project.
41.1 A brief history of compiler
- 1. In the late1940s, the stored-program
computer invented by John von Neumann - Programs were written in machine language, such
as(Intel 8x86 in IBM PCs) - c7 06 0000 0002 means to move number 2 to the
location 0000 - 2. Assembly language numeric codes were replaced
symbolic forms. Mov x, 2 Assembler translate
the symbolic codes and memory location of
assembly language into the corresponding numeric
codes. - Defects of the assembly language
- Difficult to read, write and
understanding - Dependent on the particular
machine.
5- 3. FORTRAN language and its compiler between
1954 and 1957, developed by the team at IBM ,
John Backus.The first compiler was developed - 4.The structure of natural language studied by
Noam Chomsky,The classification of languages
according to the complexity of their grammars and
the power of the algorithms needed to recognize
them. -
6- Four levels of grammars type 0,
type1,type2,type3 - Type 0 Turing machine
- Type 1 context-sensitive grammar
- Type 2context-free grammar, the most useful
of programming language - Type 3 right-linear grammar, regular
expressions and finite automata
7- 5. Parsing problems studied in 1960s and 1970s
- Code improvement techniques (optimization
techniques) improve Compilers efficiency - Compiler-compilers (parser generator ) only in
one part of the compiler process. - YACC written in 1975 by Steve Johnson for the
UNIX system. Lex written in 1975 by Mike Lest. - 6. Recent advances in compiler design
- application of more sophisticated algorithms
for inferring and /or simplifying the information
contained in a program. (with the development of
more sophisticated programming languages that
allow this kind of analysis.) - development of standard windowing environments.
- (interactive development environment. IDE)
81.2 Programs related to compilers
- 1. Interpreters Another language translator.
- It executes the source program immediately.
- Interpreters and Compilers Depending on the
language in use and the situation . - Interpreters BASIC ,LISP and so on.
- Compilers speed execution
- 2.Assemblers
- A translator translates assembly language into
object code -
9- 3. Linkers
- Collects code separately compiled or assembled
in different object files into a file. - Connects the code for standard library
functions. - Connects resources supplied by the operating
system of the computer. - 4. Loaders
- Relocatable the code is not completely fixed .
- Loaders resolve all relocatable address relative
to the starting address. - 5. Preprocessors
- Preprocessors delete comments, include other
files, perform macro substitutions.
10- 6. Editors
- Produce a standard file( structure based
editors) - 7. Debuggers
- Determine execution errors in a compiled
program. - 8. Profilers
- Collect statistics on the behavior of an object
program during execution. - Statistics the number of times each procedure
is called, the percentage of execution time spent
in each procedure. - 9. project managers
- coordinate the files being worked on by
different people. - sccs(source code control system ) and
rcs(revision control system) are project manager
programs on Unix systems.
11- 2.4 From Regular Expression To DFAs
- An algorithm translating a regular expression
into a DFA via NFA. - regular expression NFA DFA
program - 2.4.1 From a regular expression to an NFA
- The idea of thompsons construction
- Use e-transitions to glue together the
machine of each piece of a regular expression to
form a machine that corresponds to the whole
expression.
12- 1. Basic regular expression
- a basic regular expression is of the form a,
e,or f
a
e
13- 2. Concatenation rs
-
- We have connected the accepting state of the
machine of r to the start state of the machine of
s by ane-transition. - The new machine has the start stale of the
machine of r as its start state and the accepting
state of the machine of j as its accepting state. - Clearly, this machine accepts L(rs) L(r)L(s)
and so corresponds to the regular expression rs.
14- 3. Choice among alternatives rs
-
- We have added a new start state and a new
accepting state and connected them as shown
usinge-transitions. - Clearly, this machine accepts the language L(rs)
L(r )UL ( s).
e
e
e
e
15- 4. Repetition construct a machine that
corresponds to r - Given a machine that corresponds to r.
- We added two new states, a start state and an
accepting state. - The repetition in this machine is afforded by
the newe-transition from the accepting state of
the machine of r to its start state. - We most also draw an e-transition from the new
start state to the new accepting state. - This construction is not unique,
simplifications are possible in the many cases.
16- 2.4.2 from an NFA to a DFA
- Given an arbitrary NFA, construct an equivalent
DFA. (i.e., one that accepts precisely the same
strings) - We need some methods for
- (1) Eliminating ?-transitions
- ?-closure the set of all states reachable
by ?-transitions from a state or states. - (2) Eliminating multiple transitions from a state
on a single input character. - Keeping track of the set of states that are
reachable by matching a single character. - Both these processes lead us to consider
sets of states instead of single states. Thus, it
is not surprising that the DFA we construct has
as its states sets of states of the original NFA. - The algorithm is called the subset
construction.
17- The ?-closure of a Set of states
- The ?-closure of a single state s is the set of
states - reachable by a series of zero or more
?-transitions, - and we write this set as .
-
18- Example 2.14 regular a
- 1,2,4, 2, 2,3,4, and 4.
- The ?-closure of a set of states the union of
the ?- - closures of each individual state.
19- The ?-closure of a set of states the union of
- the ?-closures of each individual state.
-
-
- ? 1,2,3?2,3,4
- 1,2,3,4
20- The Subset Construction
- (1) compute the ?-closure of the start state of
M this becomes the start state of . - (2) For this set, and for each subsequent set,
we compute transitions on characters a as
follows. - Given a set S of states and a character a in the
alphabet, - Compute the set
- S?a t for some s in S there is a transition
from s to t on a . - Then, compute , the ?-closure of S?a.
- This defines a new state in the subset
construction, together with a new transition S?
. - (3)Continue with this process until no new states
or transitions are created. - (4) Mark as accepting those states constructed
in this manner that contain an accepting state of
M.
212.4.3 Simulating an NFA using the subset
construction
- As mentioned at section 2.3.
- NFAs can be implemented in similar ways to DFAs,
except that since NFAs are nondeterministic,
there are potentially many different sequences of
transitions that must be tried. - A program that simulates an NFA must store up
transitions that have not yet been tried and
backtrack to them on failure. - An other way of simulating an NFA is to use the
subset construction, but instead of constructing
all the states of the associated DFA, we
construct only the state at each point that is
indicated by the next input character.
222.4.4 Minimizing the number of states in a DFA
- The process we have described of deriving a DFA
algorithmically from a regular expression has the
unfortunate property that the resulting DFA may
be more complex than necessary. - For example 2.15 we derived the DFA for the
regular expression a -
- whereas the DFA will do as well.
a
23- An important result from automata theory states
that, - given any DFA, there is an equivalent DFA
containing a minimum number of states, and, that
this minimum-state DFA is unique (except for
renaming of states). - It is also possible to directly obtain this
minimum-state DFA from any given DFA. -
- Given the algorithm as follows
- (1) It begins with the most optimistic assumption
possible. - It creates two sets, one consisting of all the
accepting states and the other consisting of all
the non-accepting states.
24- (2)Given this partition of the states of the
original DFA, consider the transitions on each
character a of the alphabet. - If all accepting states have transitions on a to
accepting states, then this defines an
a-transition from the new accepting state (the
set of all the old accepting states) to itself. - If all accepting states have transitions on a to
non-accepting states, then this defines an
a-transition from the new accepting state to the
new non-accepting state (the set of all the old
non-accepting stales). - On the other hand, if there are two accepting
states s and t that have transitions on a that
land in different sets, then no a-transition can
be defined for this grouping of the states. We
say that a distinguishes the states s and t - We must also consider error transitions to an
error state that is non-accepting. If there are
accepting states s and t such that s has an
a-transition to another accepting state, while t
has no a-transition at all (i.e., an error
transition), then a distinguishes s and t.
25- (3) If any further sets are split, we must
return and repeat the process from the beginning.
This process continues until either all sets
contain only one element (in which case, we have
shown the original DFA to be minimal) or until no
further splitting of sets occurs.
26Chapter 3 Context-Free Grammars and Parsing
- Main content
- 1?study the theory of context-free grammars
- 2?a general description of the parsing process
- 3?study the basic theory of context-free
grammar(CFG) -
27- Parsing is the task of Syntax Analysis.
- The goal is to determine the syntax, or
structure, of a program. - The syntax is defined by the grammar rules of a
Context-Free Grammar. - The rules of a context-free grammar are
recursive. - The basic data structure of Syntax Analysis is
parse tree or syntax tree. - The syntactic structure of a language must now
also be recursive.
283.1 The parsing process
- The parser may be viewed as a function that takes
as its input the sequence of tokens produced by
the scanner and produces as its output the syntax
tree. - parser
- sequence of tokens syntax
tree - (1) Usually the sequence of tokens is not an
explicit input parameter, but the parser calls a
scanner procedure such as getToken to fetch the
next token from the input as it is needed during
the parsing process. Thus, the parsing step of
the compiler reduces to a call to the parser as
follows - syntaxTree parse( )
29- (2) In a single-pass compiler the parser will
incorporate all the other phases of a compiler,
including code generation, and so no explicit
syntax tree needs to be constructed (the parser
steps themselves will represent the syntax tree
implicitly), and thus a call - parse( )
- (3) More commonly, a compiler will be multi-pass,
in which case the further passes will use the
syntax tree as their input. - The structure of the syntax tree is heavily
dependent on the particular syntactic structure
of the language. This tree is usually defined as
a dynamic data structure, in which each node
consists of a record whose fields include the
attributes needed for the remainder of the
compilation process (i.e., not just those
computed by the parser).
30- One problem that is more difficult for the parser
than the scanner is the treatment of errors. - (1) Error in the scanner generate an error token
and consume the offending character. - (2) Error in the parser The parser, on the other
hand, must not only report an error message, but
it must recover from the error and continue
parsing (to find as many errors as possible).
Sometimes, a parser may perform error repair. - One particularly important aspect of error
recovery is the reporting of meaningful error
messages and the resumption of parsing as close
to the actual error as possible.
313.2 Context-free grammars
- A context-free grammar is a specification for the
syntactic structure of a programming language.
Such a specification is very similar to the
specification of the lexical structure of a
language using regular expressions, except that a
context-free grammar involves recursive rules. - For example
- exp ? exp op exp (exp) number
- op ?
323.2.1 Comparison to regular expression notation
- The context-free grammar
- exp ? exp op exp (exp) number
- op ?
- The regular expression
- number digit digit
- digit 0l2345l6789
- In basic regular expression rules
- (1) Three operations
- choice (given by the vertical bar metasymbol),
concatenation (with no metasymbol), and
repetition (given by the asterisk metasymbol). - (2) The equal sign to represent the definition of
a name for a regular expression - (3) The name in italics to distinguish it from a
sequence of actual characters.
33- In grammars rules
- (1) Names are written in italic (but now in a
different font, so we can tell them from names
for regular expressions). - (2) The vertical bar still appears as the
metasymbol for choice. - (3) Concatenation is also used as a standard
operation. - (4) No metasymbol for repetition (like the of
regular expressions). - (5) Use the arrow symbol ? instead of equality to
express the definitions of names. - (6) The grammar rules use regular expressions as
components. - The notation was developed by John Backus and
adapted by Peter Naur for the Algol60 report.
Thus, grammar rules in this form are usually said
to be in Backus-Naur form, or BNF.
343.2.2 Specification of context-free grammar rules
- Grammar rules are defined over an alphabet, or
set of symbols. the symbols are usually tokens
representing strings of characters. - We will use the regular expressions themselves to
represent the tokens. - (1) In the case where a token is a fixed symbol,
as in the reserved word while or the special
symbols such as or , we write the string
itself in the code font used in Chapter 2. - (2) In the case of tokens such as identifiers and
numbers, which represent more than one string, we
use code font in italics, just as though the
token is a name for a regular expression.
35Given an alphabet, a context-free grammar rule in
BNF consists of a string of symbols.
- (1) The first symbol is a name for a structure.
- (2) The second symbol is the meia-symbol"?".
- (3) This symbol is followed by a string of
symbols, each of which is either a symbol from
the alphabet, a name for a structure, or the
metasymbol " ". - In informal terms, a grammar rule in BNF is
interpreted as follows. - (1) The rule defines the structure whose name is
to the left of the arrow. - (2) The structure is defined to consist of one of
the choices on the right-hand side separated by
the vertical bars. - (3) The sequences of symbols and structure names
within each choice defines the layout of the
structure.
363.2.3 Derivations and the language defined by a
grammar
- How grammar rules determine a "language," or set
of legal strings of tokens. - (34-3)42 corresponds to the legal string of
seven tokens (number - number ) number - (34-342 is not a legal expression,
- because there is a left parenthesis that is not
matched by a right parenthesis and the second
choice in the grammar rule for an exp requires
that parentheses be generated in pairs.
37- Derivatrion
- Grammar rules determine the legal strings of
token symbols by means of derivations. - A derivation is a sequence of replacements of
structure names by choices on the right-hand
sides of grammar rules. - A derivation begins with a single structure name
and ends with a string of token symbols.
38- At each step in a derivation, a single
replacement is made using one choice from a
grammar rule. - exp ? exp op exp (exp) number
- op ? Figure 3.1 a derivation
- (1) exp gt exp op exp exp ? exp op exp
- (2) gt exp op number exp ? number
- (3) gt exp number op?
- (4) gt ( exp ) number exp? ( exp )
- (5) gt exp op exp ) number exp ? exp op
exp - (6) gt (exp op number) number exp? number
- (7) gt (exp - number) number op ? -
- (8) gt (number - number) number exp ? number
- derivation steps use a different arrow from the
arrow meta-symbol in the grammar rules. Because
grammar rules define and derivation steps
construct by replacement.
39- L(G) s exp gt s
- G represents the expression grammar
- s represents an arbitrary string of token symbols
(sometimes called a sentence) - The symbols gt stand for a derivation
consisting of a sequence of replacements as
described earlier. (The asterisk is used to
indicate a sequence of steps, much as it
indicates repetition in regular expressions.) - Grammar rules are sometimes called productions
because they "produce" the strings in L(G) via
derivations.
The set of all strings of token symbols obtained
by derivations from the exp symbol is the
language defined by the grammar of expressions.
40recursive
- right recursive
- Recursion the grammar rule A ? A a a
- or the grammar rule A ? a A a
- generates the language an n an integer gt1
(the set of all strings of one or more a's),
which is the same language as that generated by
the regular expression a. - the string aaaa can be generated by the first
grammar rule with the derivation - A gt Aa gt Aaa gt Aaaa gt aaaa
- left recursive the nonterminal A appears as the
first symbol on the right-hand side of the rule
defining A. - right recursive the nonterminal A appears as the
last symbol on the right-hand side of the
rule defining A.
41- Consider a rule of the form A ? A? ?
- where ? and ? represent arbitrary strings and ?
does not begin with A. This rule generates all
strings of the form ?, ??, ???, ????, ... (all
strings beginning with a ? , followed by 0 or
more ?'s). Thus, this grammar rule is equivalent
in its effect to the regular expression ??.
Similarly, the right recursive grammar rule A ? ?
A ? (where ? does not end in A) generates all
strings ?, ??, ???, ????, ....
423.3 Parse trees and abstract syntax trees Parse
trees
- derivations do not uniquely represent the
structure of the strings they construct. In
general, there are many derivations for the same
string. - A parse tree corresponding to a derivation is a
labeled tree in which the interior nodes are
labeled by nonterminals, the leaf nodes are
labeled by terminals, and the children of each
internal node represent the replacement of the
associated nonterminal in one step of the
derivation.
43- A parse tree corresponds in general to many
derivations, all of which represent the same
basic structure for the parsed string of
terminals. - it is possible to distinguish particular
derivations that are uniquely associated with the
parse tree. - A leftmost derivation a derivation in which
the leftmost nonterminal is replaced at each step
in the derivation. Corresponds to the preorder
numbering of the internal nodes of its associated
parse tree. - A rightmost derivation a derivation in which
the rightmost nonterminal is replaced at each
step in the derivation. Corresponds to the
postorder numbering of the internal nodes of its
associated parse tree.
44- abstract syntax trees, or syntax trees
- 1?Such trees represent abstractions of the actual
source code token sequences, and the token
sequences cannot be recovered from them (unlike
parse trees). Nevertheless they contain all the
information needed for translation, in a more
efficient form than parse trees. - 2?a parse tree is a representation for the
structure of ordinary called concrete syntax when
comparing it to abstract syntax. - 3?abstract syntax can be given a formal
definition using a BNF-like notation, just like
concrete syntax.
453.5 Extended notations EBNF and syntax diagrams
- 3.5.1 EBNF notation
- BNF notation is sometimes extended to include
special notations for repetitive and optional
constructs. - These extensions comprise a notation that is
called extended BNF, or EBNF. - Repetition
- A ? A ? ? (left recursive), and
- A ? ? A ? (right recursive)
- where ? and ? are arbitrary strings of terminals
and non-terminals, and - in the first rule ? does not begin with A and
- in the second ? does not end with A.
- Use the same notation for repetition that regular
expressions use, namely, - the asterisk (also called Kleene closure in
regular expressions). - A ? ? ?
- and
- A ? ? ?
46- EBNF opts to use curly brackets . . . to
express repetition (thus making clear the extent
of the string to be repeated), and we write - A ? ? ?
- and
- A ? ? ?
- The problem with any repetition notation is that
it obscures how the parse tree is to be
constructed, but, as we have seen, we often do
not care.
47Optional constructs in EBNF
- Optional construct are indicated by surrounding
them with square brackets .... - The grammar rules for if-statements with optional
else-parts (Examples 3.4 and 3.6) would be
written as follows in EBNF - statement ? if-stmt other
- if-stmt ? if ( exp ) statement else
statement - exp ? 0 1
- stmt-sequence ? stmt stmt-sequence stmt is
written as - stmt-sequence ? stmt stmt-sequence
483.5.2 Syntax diagrams
- Syntax Diagrams Graphical representations for
visually representing EBNF rules. - An example consider the grammar rule
factor? ( exp ) number the syntax diagram
49- (1) Boxes representing terminals and
non-terminals. - (2) Arrowed lines representing sequencing and
choices. - (3) Non-terminal labels for each diagram
representing the grammar rule defining that
Non-terminal. - (4) A round or oval box is used to indicate
terminals in a diagram. - (5) A square or rectangular box is used to
indicate non-terminals.
503.6 Formal properties of context-free
language3.6.1 A formal definition of
context-free language
- Definition A context-free grammar consists of
the following - 1. A set T of terminals.
- 2. A set N of non-terminals (disjoint from T).
- 3. A set P of productions, or grammar rules, of
the form A ? a, where A is an element of N and a
is an element of (T?N) (a possibly empty
sequence of terminals and non-terminals). - 4. A start symbol S from the set N.
- Let G be a grammar as defined above,
- G (T, N, P, S).
51- A derivation step over G is of the form
- a A ? gt a ? ?,
- where a and ? are elements of (T?N), and
- A ? ? is in P.
- The set of symbols The union T ? N of the sets
of terminals and non-terminals - A sentential form a string a in (T?N).
- a gt?
- ( if and only if there is a sequence of 0 or more
derivation steps (n gt 0) - a1gt a 2 gtgt a n-1gt a n such that a
-a1, and ? a n - (If n 0, then a ?)
52- A derivation over the grammar G is of the form
- S gt w,
- where w ? T (i.e., w is a string of terminals
only, called a sentence),and - S is the start symbol of G.
- The language generated by G, written L(G),
- is defined as the set
- L(G) w ? T there exists a derivation S gt
w of G. - That is, L(G) is the set of sentences derivable
from S. - A leftmost derivation S gtlm w
- is a derivation in which each derivation step
- a A ? gt a ? ?,
- is such that a ? T that is, a consists only of
terminals.
53- A rightmost derivation is one in which each
derivation step a A ? gt a ? ? has the
property that ? ? T. - A parse tree over the grammar G
- is a rooted labeled tree with the following
properties - 1. Each node is labeled with a terminal or a
non-terminal or ?. - 2. The root node is labeled with the start symbol
S. - 3. Each leaf node is labeled with a terminal or
with ?. - 4. Each non-leaf node is labeled with a
non-terminal. - 5. If a node with label A ? N has n children with
labels X1, X2,..., Xn - (which may be terminals or non-terminals),
- then A ? X1X2 ... Xn ? P (a production of the
grammar).
54- Each derivation gives rise to a parse tree.
- In general, many derivations may give rise to the
same parse tree. - Each parse tree has a unique leftmost and
rightmost derivation that give rise to it. - The leftmost derivation corresponds to a preorder
traversal of the parse tree,. - The rightmost derivation corresponds to the
reverse of a postorder traversal of the parse
tree. - A set of strings L is said to be a context-free
language - if there is context-free grammar G such that L
L (G). - A grammar G is ambiguous
- if there exists a string w ? L(G) such that w has
two distinct parse trees (or leftmost or
rightmost derivations).
55- There is a sense, however, in which equality of
left- and right-hand sides in a grammar rule
still holds, but the defining process of the
language that results from this view is
different. - Consider, for example, the following grammar
rule, which is extracted (in simplified form)
from our simple expression grammar - exp? exp exp number
- A non-terminal name like exp defines a set of
strings of terminals, called E - (which is the language, of the grammar if the
non-terminal is the start symbol). - let N be the set of natural numbers
- (corresponding to the regular expression name
number). - Then, the given grammar rule can be interpreted
as the set equation - E (E E) ? N
- This is a recursive equation for the set E
- E N ? (NN) ? (NNN) ? (NNNN)
563.6.3 The chomsky hierarchy and the limits of
syntax as context-free rules
- type 0unrestricted grammar, equivalent to Turing
machines - type 1context sensitive grammar
- type 2context free grammar, equivalent to
pushdown automaton - type 3regular grammar , equivalent to finite
automata - The language classes they construct are also
referred to as the Chomsky hierarchy, after Noam
Chomsky, who pioneered their use to describe
natural languages. - These grammars represent distinct levels of
computational power.
57Chapter 4 Top-Down Parsing
- OUTLINE
- Top-Down Parsing
- It parses an input string of tokens by tracing
out the steps in a leftmost derivation. And the
implied traversal of the parse tree is a preorder
traversal and, thus, occurs from the root to the
leaves. - The example
- number number, and corresponds
to the parse tree - exp
- exp op exp
- number number
- The above parse tree is corresponds to the
leftmost derivations - (1) exp gt exp op exp
- (2) gt number op exp
- (3) gt number exp
- (4) gt number number
58- Two forms of Top-Down Parsers
- Predictive parsers attempts to predict the next
construction in the input string using one or
more look-ahead tokens - Backtracking parsers try different possibilities
for a parse of the input, backing up an arbitrary
amount in the input if one possibility fails. It
is more powerful but much slower, unsuitable for
practical compilers.
59- Two kinds of Top-Down parsing algorithms
- Recursive-descent parsing is quite versatile and
suitable for a handwritten parser. - LL(1) parsing The first L refers to the fact
that it processes the input from left to right
The second L refers to the fact that it traces
out a leftmost derivation for the input string
The number 1 means that it uses only one symbol
of input to predict the direction of the parse. - Look-Ahead Sets
- First and Follow sets are required by both
recursive-descent parsing and LL(1) parsing.
604.1 TOP-DOWN PARSING BY RECURSIVE-DESCENT4.1.1
The Basic Method of Recursive-Descent
- The idea of Recursive-Descent Parsing
- We view the grammar rule for a non-terminal A as
a definition for a procedure to recognize an A - The right-hand side of the grammar for A
specifies the structure of the code for this
procedure.
61- The first example
- The Expression Grammar
- expr ? expr addop termterm
- addop ? -
- term ? term mulop factor factor
- mulop ?
- factor ?(expr) number
- A recursive-descent procedure that recognizes a
factor is as follows - (in pseudo-code)
- Procedure factor
- BEGIN
- Case token of
- ( match( ( )
- expr
- match( ))
- number
- match(number)
- else error
- end case
624.1.2 Repetition and Choice Using EBNF
- Consider the exp in the grammar for simple
arithmetic expression in BNF - A question whether the left associatively
implied by the curly bracket (and explicit in the
original BNF) can still be maintained within this
code. - A working simple calculator in C code
- Notes
- Construction of the syntax tree
634.1.3 Further Decision Problems
- The recursive-descent method is quite powerful
and adequate to construct a complete parse. But
we need more formal methods to deal with complex
situation. - (1) It may be difficult to convert a grammar in
BNF into EBNF form - (2) It is difficult to decide when to use the
choice A ?aand the choice A ?ßif both a andß
begin with non-terminals. First Sets. - (3) It may be necessary to know what token
legally coming from the non-terminal A, in
writing the code for an e-production A?e.Follow
Sets. - (4) It requires computing the First and Follow
sets in order to detect the errors as early as
possible. Such as )3-2), the parse will descend
from exp to term to factor before an error is
reported.
644.2 LL(1) PARSING4.2.1 The Basic Method of LL(1)
Parsing
- An example a simple grammar for the strings of
balanced parentheses - S?(S) Se
- The following table shows the actions of a
top-down parser given this grammar and the string
( ) - Steps Parsing Stack Input Action
- 1 S ( ) S?(S) S
- 2 S)S( ( ) match
- 3 S)S ) S?e
- 4 S) ) match
- 5 S S?e
- 6 accept
- A top-down parser begins by pushing the start
symbol onto the stack. - It accepts an input string if, after a series of
actions, the stack and the input become empty.
65The two actions
- Generate Replace a non-terminal A at the top of
the stack by a string a(in reverse) using a
grammar rule A ?a, and - Match Match a token on top of the stack with the
next input token. -
- The list of generating actions in the above
table - S gt (S)S S?(S) S
- gt ( )S S?e
- gt ( ) S?e
- Which corresponds precisely to the steps in a
leftmost derivation of string ( ). - This is the characteristic of top-down parsing.
- Constructing a parse tree
- Adding node construction actions as each
non-terminal or terminal is push onto the stack.
664.2.2 The LL(1) Parsing Table and Algorithm
674.2.3 left Recursion Removal and Left Factoring
- Two standard techniques for Repetition and Choice
problems - l Left Recursion removal
- exp ? exp addop term term
- (in recursive-descent parsing, EBNF exp?
term addop term) - l Left Factoring
- If-stmt ? if ( exp ) statement
- if ( exp ) statement else
statement - (in recursive-descent parsing, EBNF
- if-stmt? if (exp) statementelse statement)
68- (1) Left Recursion Removal
- Left recursion is commonly used to make
operations left associative, as in the simple
expression grammar, where - exp ? exp addop term term
-
- Immediate left recursion the left recursion
occurs only within the production of a single
non-terminal. - exp ? exp term exp - term term
69- Algorithm for general left recursion removal
- For i1 to m do
- For j1 to i-1 do
- Replace each grammar rule choice of the form Ai?
Ajß by the rule - Ai?a1ßa2ß akß,
- where Aj?a1a2 ak is the current rule for Aj.
- Explanation
- (1) Picking an arbitrary order for all
non-terminals, say, A1,,Am - (2) Eliminates all rules of the form Ai? Aj?with
ji - (3) Every step in such a loop would
only increase the index, and thus the original
index cannot be reached again.
70- Left factoring
- Left factoring is required when two or more
grammar rule choices share a common prefix
string, as in the rule - A?aßa?
- Example
- Stmt-sequence?stmt stmt-sequence
stmt - Stmt?s
- An LL(1) parser cannot distinguish between the
production choices in such a situation. The
solution in this simple case is to factor the a
out on the left and rewrite the rule as two
rules - A?aA
- A?ß?
71- Algorithm for left factoring a grammar
- While there are changes to the grammar do
- For each non-terminal A do
- Let a be a prefix of maximal length
that is shared - By two or more production
choices for A - If a?ethen
- Let A ?a1a2an be all the
production choices for A - And suppose thata1,a2,,ak
sharea, so that - A ?aß1aß2aßkaK1an, the
ßjs share - No common prefix, andaK1,,an do
not share a - Replace the rule A ?a1a2an by
the rules - A ?aAaK1an
- A ?ß1ß2ßk
72- 4.2.4 Syntax Tree Construction in LL(1) Parsing
- It is more difficult for LL(1) to adapt
to syntax tree construction. - (1) The structure of the syntax tree can be
obscured by left factoring and left recursion
removal - (2) The parsing stack represents only
predicated structure, not structure that have
been actually seen. -
- The solution
- (1) An extra stack is used to keep track of
syntax tree nodes, and - (2) actionmarkers are placed in the parsing
stack to indicate when and what actions on the
tree stack should occur.
73- How to compute the arithmetic value of the
expression. - (1) Use a separate stack to store the
intermediate values of the computation, called
the value stack - (2) Schedule two operations on that stack
- A push of a number
- The addition of two numbers.
- (3) PUSH can be performed by the match
procedure, and - (4) ADDITION should be scheduled on the stack,
by pushing a special symbol (such as ) on the
parsing stack.
744.3 FIRST AND FOLLOW SETS
- The LL(1) parsing algorithm is based on the LL(1)
parsing table - The LL(1) parsing table construction involves the
First and Follow sets. -
754.3.1 First Sets
- Definition
- Let X be a grammar symbol( a terminal or
non-terminal) or e. Then First(X) is a set of
terminals or e, which is defined as follows - 1. If X is a terminal or e, then First(X) X
- 2. If X is a non-terminal, then for each
production choice X?X1X2Xn, - First(X) contains First(X1)-e.
- If also for some iltn, all the set
First(X1)..First(Xi) contain e,the first(X)
contains First(Xi1)-e. - IF all the set First(X1)..First(Xn) contain e,
the First(X) contains e.
76- Let a be a string of terminals and non-terminals,
X1X2Xn. First(a) is defined as follows - 1.First(a) contains First(X1)-e
- 2.For each i2,,n, if for all k1,..,i-1,
First(Xk) contains e, then First(a) - contains First(Xk)-e.
- 3. IF all the set First(X1)..First(Xn) contain e,
the First(a) contains e.
77- Algorithm for computing First(A) for all
non-terminal A - For all non-terminal A do First(A)
- While there are changes to any First(A) do
- For each production choice A?X1X2Xn do
- K1 Continuetrue
- While Continue true and kltn do
- Add First(Xk)-e to First(A)
- If e is not in First(Xk) then Continue
false - kk1
- If Continue true then addeto First(A)
78- 4.3.2 Follow Sets
- Definition
- Given a non-terminal A, the set Follow(A) is
defined as follows. - (1) if A is the start symbol, the is in the
Follow(A). - (2) if there is a production B?aA?,then
First(?)-e is in - Follow(A).
- (3) if there is a production B?aA?such
thatein First(?), then - Follow(A) contains Follow(B).
79- Note The symbol is used to mark the end of the
input. - The empty pseudotoken eis never an element of a
follow set. - Follow sets are defined only for non-terminal.
- Follow sets work on the right in production
while First sets work - on the leftin the production.
- Given a grammar rule A ?aB, Follow(B) will
contain Follow(A),the opposite of the situation
for first sets, if A ?Ba,First(A) contains
First(B),except possibly for e.
80- Algorithm for the computation of follow sets
- Follow(start-symbol)
- For all non-terminals A?start-symbol do
follow(A) - While there changes to any follow sets do
- For each production A?X1X2Xn do
- For each Xi that is a non-terminal do
- Add First(Xi1Xi2Xn) e to Follow(Xi)
- Ifeis in First(Xi1Xi2Xn) then
- Add Follow(A) to Follow(Xi)
81 - 4.3.3 Constructing LL(1) Parsing Tables
- The table-constructing rules, which had been
mentioned - If A?ais a production choice, and there is a
derivation -
- agtaß, where a is a token, then add A?ato the
table entry MA,a - (2) If A?ais a production choice, and there
are derivations -
- agteand SgtßAa?, where S is the start
symbol and a is a token (or ), then add A?ato
the table entry MA,a - Clearly, the token a in the rule (1) is in
First(a), and the token a of the rule (2) is in
Follow(A). Thus we can obtain the following
algorithmic construction of the LL(1) parsing
table
82- Repeat the following two steps for each
non-terminal A and production choice A?a. - 1. For each token a in First(a), add A?ato
the entry MA,a. - 2. Ifeis in First(a), for each element a of
Follow(A) ( a token or ), add A?ato MA,a. -
- Theorem
- A grammar in BNF is LL(1) if the following
conditions are satisfied. - 1. For every production A?a1a2an,
First(ai)n First(aj) is empty for all i and j,
1?i,j?n, i?j. - 2. For every non-terminal A such that
First(A) contains e,First(A) nFollow(A) is empty.
834.3.4 Extending the lookahead LL(k) Parsers
- LL(1) parsing The first L refers to the fact
that it processes the input from left to right
The second L refers to the fact that it traces
out a leftmost derivation for the input string
The number 1 means that it uses only one symbol
of input to predict the direction of the parse.
84- The LL(1) parsing method can be extend to k
symbols of look-ahead. - Definitions
- Firstk(a)wk agt w, where, wk is the first
k tokens of the string w if the length of w gt k,
otherwise it is the same as w. - Followk(A)wk SgtaAw, where, wk is the
first k tokens of the string w if the length of w
gt k, otherwise it is the same as w. -
- LL(k) parsing table
- The construction can be performed as that of
LL(1).
85- The complications in LL(k) parsing
- (1) The parsing table become larger
- (2) The parsing table itself does not express
the complete power of LL(k) because the follow
strings do not occur in all contexts. - Another parsing distinguished from LL(k) called
Strong LL(k) parsing or SLL(k) parsing. -
- The LL(k) and SLL(k) parsers are uncommon.
- (1) partially because of the added complex
- (2) primarily because of the fact that a
grammar fails to be LL(1) is in practice likely
not to be LL(k) for any k.
86Chapter 5 Bottom-Up Parsing
- 5.1 OVERVIEW OF BOTTOM-UP PARSING
- A bottom-up parser uses an explicit stack to
perform a parse - The parsing stack contains tokens, nonterminals
as well as some extra state information - The stack is empty at the beginning of a
bottom-up parse, - and will contain the start symbol at the end of a
successful parse. - A schematic for bottom-up parsing
- InputString
- .
- StartSymbol
- accept
- where the parsing stack is on the left,
- the input is in the center, and
- the actions of the parser are on the right.
87- A bottom-up parser has two possible actions
(besides "accept") - 1. Shift a terminal from the front of the input
to the top of the stack. - 2. Reduce a string a at the top of the stack to a
nonterminal A, - given the BNF choice A?a.
- A bottom-up parser is thus sometimes called a
shift-reduce parser. - One further feature of bottom-up parsers is that,
grammars are always augmented with a new start
symbol. -
S' ? S
88- Example 5. 1 The augmented grammar for balanced
parentheses - S' ? S
- S ? (S)Se
- A bottom-up parser of the string ( ) using this
grammar is given in Table 5.1.
89- The handle of the right sentential form
- A string, together with
- The position in the right sentential form where
it occurs, and - The production used to reduce it.
- Determining the next handle is the main task of a
shift-reduce parser.
90- Note
- The string of a handle forms a complete
right-hand side of one production The rightmost
position of the handle string is at the top of
the stack - To be the handle, it is not enough for the string
at the top of the stack to match the right-hand
side of a production. - Indeed, if an e-production is available for
reduction, as in Example 5.1 ,then its right-hand
side (the empty string) is always at the top of
the stack. - Reductions occur only when the resulting string
is indeed a right sentential form. - For example, in step 3 of Table 5.1 a reduction
by S?e could be performed, but the resulting
string( S S ) is not a right sentential form, and
thuseis not the handle at this position in the
sentential form ( S ) .
915.2 FINIT AUTOMATA OF LR(0) ITEMS AND LR(0)
PARSING
- 5.2.1 LR(0) Items
- An LR(0) item of a context-free grammar
- a production choice with a distinguished position
in its right-hand side. We will indicate this
distinguished position by a period. - Example
- if A ? ais a production choice, and if ß and Y
are any two strings of symbols (including the
empty string s) such that ß? a, - then A? ß?is an LR(0) item.
- These are called LR(0) items because they contain
no explicit reference to lookahead.
925.2.2 Finite Automata of Items
- The LR(0) items can be used as the states of a
finite automaton that maintains information about
the parsing stack and the progress of
shift-reduce parse. - This will start out as a nondeterministic finite
automaton. - From this NFA of LR(0) items we can construct the
DFA of sets of LR(0) items using the subset
construction of Chapter 2.
93- The transitions of the NFA of LR(0) items?
- Consider the item A ? a?, and
- suppose?begins with the symbol X, be either a
token or a nonterminal, - so that the item can be written as A ? aX?.
- A transition on the symbol X from the state
represented by this item to the state represented
by the item A ? aX?.
In graphical form we write this as
(1) If X is a token, then this transition
corresponds to a shift of X from the input to the
top of the stack during a parse. (2) In fact,
such a transition will still correspond to the
pushing of X onto the stack during a parse
94- The start state of the NFA should correspond to
the initial state of the parser - The stack is empty, and
- we are about to recognize an S, where S is the
start symbol of the grammar. - Thus, any initial item S ? aconstructed from a
production choice for S could serve as a start
state. - Unfortunately, there may be many such production
choices for S. - The solution is to augment the grammar by a
single production S'?S, where S' is a new
nonterminal. - S' then becomes the start state of the augmented
grammar, and the initial item S' ? S becomes the
start state of the NFA. - This is the reason we have augmented the grammars
of the previous examples. - The NFA will in fact have no accepting states at
all - The purpose of the NFA is to keep track of the
state of a parse, not to recognize strings - The parser itself will decide when to accept, and
the NFA need not contain that information.
95- Example 5.5 In Example 5,3 we listed the eight
LR(0) items of the grammar of Example 5. l . - The NFA, therefore, has eight states it is shown
in Figure 5.1. - Note that every item in the figure with a dot
before the nonterminal S has ane-transition to
every initial item of S. - S' ? S
- S' ? S
- S ? (S)S
- S ? (S)S
- S ? (S)S
- S ? (S)S
- S ? (S)S
- S ?
965.2.3 The LR(0) Parsing Algorithm
- Since the algorithm depends on keeping track of
the current state in the DFA of sets of items, - We must modify the parsing stack to store not
only symbols but also state numbers. - We do this by pushing the new state number onto
the parsing stack after each push of a symbol.
Parsing stack input
97Definition The RL(0) parsing algorithm .
- Let s be the current state (at the top of the
parsing stack).Then actions are defined as
follows - 1. If state s contains any item of the form A ?
aXß, where X is a terminal. Then the action is
to shift the current input token on to the stack.
- (1) If this token is X. and state s contains item
A ? aXß, then the new state to be pushed on the
stack is the state containing the item A ? aXß. - (2) If this token is not X for some item in state
s of the form just described, an error is
declared.
98- 2. If state s contains any complete item (an item
of the form A ? ?), then the action is to reduce
by the rule A ? ?. - A reduction by the rule S ? S, where S is the
start state, is equivalent to acceptance,
provided the input is empty, and error if the
input is not empty. - In all other cases, for new state is computed as
follows - Remove the string ?and all of its corresponding
states from the parsing stack - Correspondingly, back up in the DFA to the state
from which the construction of ?began - Again, by the construction of the DFA, this state
must contain an item of the form B ? aAß. - Push A onto the stack, and push (as the new
state) the state containing the item B ? aAß.
99- Example 5.9 Consider the grammar
100(No Transcript)
101(No Transcript)
1025.3 SLR(1) Parsing
- Definition The RL(0) parsing algorithm .
- Let s be the current state (at the top of the
parsing stack).Then actions are defined as
follows - 1. If state s contains any item of the form A ?
aXß, where X is a terminal. Then the action is
to shift the current input token on to the stack.
- (1) If this token is X. and state s contains item
A ? aXß, then the new state to be pushed on the
stack is the state containing the item A ? aXß. - (2) If this token is not X for some item in state
s of the form just described, an error is
declared.
103- 2. If state s contains any complete item (an
item of the form A ? ?), then the action is to
reduce by the rule A ? ?. - (1) A reduction by the rule S ? S, where S is
the start state, is equivalent to acceptance,
provided the input is empty, and error if the
input is not empty. - (2) In all other cases, for new state is computed
as follows - Remove the string ?and all of its corresponding
states from the parsing stack - Correspondingly, back up in the DFA to the state
from which the construction of ?began - Again, by the construction of the DFA, this state
must contain an item of the form B ? aAß. - Push A onto the stack, and push (as the new
state) the state containing the item B ? aAß.
104(No Transcript)
1055.3.1 The SLR(1) Parsing Algorithm
- Difinition of The SLR(1) parsing algorithm.
- Let s be the current state, actions are defined
as follows . - 1.If state s contains any item of form A ? aXß,
- where X is a terminal, and X is the next token in
the input string, - then to shift the current input token onto the
stack, and - push the new state containing the item A ? aXß.
106- 2. If state s contains the complete item A ? ?,
- and the next token in the inupt string is in
Follow(A), - then to reduce by the rule A ? ?.
- (1) A reduction by the rule S' ?S, is equivalent
to acceptance - this will happen only if the next input token is
. - (2) In all other cases, Remove the string? and
acorresponding states from the parsing stack.
Correspondingly, back up in the DFA to the state
from which the construction of ? began. - This state must contain an item of the form B ?
aAß. - 3. Push A onto the stack, and the state
containing the item B ? aAß. 3. If the next
input token is such that neither of the above two
cases applies, an error is declared
107- A grammar is an SLR(l) grammar if the application
of the above SLR( 1 ) parsing rules results in no
ambiguity. - A grammar is SLR( 1) if and only if, for any
state s, the following two conditions are
satisfied - 1. For any item A ? aXßin s with X a terminal,
- there is no complete item B ? ?. in s with X in
Follow(B). - 2. For any two complete items A ? aand B ?ß in
s, - Follow(A) n Follow(B) is empty.
108- 5.3.2 Disambiguating Rules for Parsing Conflicts
- Parsing conflicts in SLR( l ) parsingcan be of
two kinds - shift-reduce conflicts and reduce-reduce
conflicts. - In the case of shift-reduce conflicts, there is a
natural disambiguating rule, witch is to always
prefer the shift over the reduce. Most
shift-reduce parsers therefore automatically
resolve shift-reduce conflicts by preferring the
shift over the reduce. - The case of reduce-reduce conflicts is more
difficult such conflicts often (but not always)
indicate an error in the design of the grammar.
109- 5.3.3 Limits of SLR(1) Parsing Power
- Example 5. 13 Consider the following grammar
rules for statements. - stmt ? call-stmt assign-stmt
- call-stmt ? identifier
- assign-stmt ? var exp
- var ?var exp identifier
- exp ? var number
- simplify this situation to the following grammar
without changing the basic situation - S ? id V E
- V? id
- E ? V n
- To show how this grammar results in parsing
conflict in SLR(l) parsing, consider the start
state of the DFA of sets of items - S ? S
- S ? id
- S ? V E
- V? id
- This state has a shift transition on id to the
state - S ? id
- V? id
1105.3.4 SLR(k) Grammars
- As with other parsing algorithms, the SLR(1)
parsing algorithm can be extended to SLR(k)
parsing where parsing actions are based on k gt1
symbols of lookahead. - Using the sets Firstk and Followk as defined in
the previous chapter, an SLR(k) parser uses the
following two rules - 1. If state s contains an item of the form A ?
aXß(X a token), and Xw ? Firstk (Xß) are the
next k tokens in the input string, then the
action is to shift the current input token onto
the stack, and the new state to be pushed on the
stack is the state containing the item A ? aXß. - 2. If state s contains the complete item A ? a,
and w ? Followk(A) are the next k tokens in the
input string, then the action is to reduce by the
rule A ? a.
1115.4 General LR(1) and LALR(1) Parsing
- 5.4.1 Finite Automata of LR(1) Items
- The SLR(1) method
- Applies lookaheads after the construction of the
DFA of LR(0) items - The construction of DFA ignores lookaheads.
- The general LR(1) method
- Using a new DFA with the lookaheads built into
its construction - The DFA items are an extension of LR(0) items
- LR(1) items include a single lookahead token in
each item. - A pair consisting of an LR(0) item and a
lookahead token. - LR(1) items using square brackets as
- A ? aß,a
- where A ? aßis an LR(0) item and a is a token
(the lookahead).
112- The definition the transitions between LR(l)
items. - Similar to the LR(0) transitions except keeping
track of lookaheads - As with LR(0) items includinge-transitions, to
build a DFAs states are sets of it