Title: Compiler Construction
1Compiler Construction
2Lexical Analysis
- get next token is a command sent from the
parser to the lexical analyzer. - On receipt of the command, the lexical analyzer
scans the input until it determines the next
token, and returns it.
3Other jobs of the lexical analyzer
- We also want the lexer to
- Strip out comments and white space from the
source code. - Correlate parser errors with the source code
location (the parser doesnt know what line of
the file its at, but the lexer does)
4Tokens, patterns, and lexemes
- A TOKEN is a set of strings over the source
alphabet. - A PATTERN is a rule that describes that set.
- A LEXEME is a sequence of characters matching
that pattern. - E.g. in Pascal, for the statement
- const pi 3.1416
- The substring pi is a lexeme for the token
identifier
5Example tokens, lexemes, patterns
6Tokens
- Together, the complete set of tokens form the set
of terminal symbols used in the grammar for the
parser. - In most languages, the tokens fall into these
categories - Keywords
- Operators
- Identifiers
- Constants
- Literal stirngs
- Punctuation
- Usually the token is represented as an integer.
- The lexer and parser just agree on which integers
are used for each token.
7Token attributes
- If there is more than one lexeme for a token, we
have to save additional information about the
token. - Example the token number matches lexemes 10 and
20. - Code generation needs the actual number, not just
the token. - With each token, we associate ATTRIBUTES.
Normally just a pointer into the symbol table.
8Example attributes
- For C source code
- E M C C
- We have token/attribute pairs
- ltID, ptr to symbol table entry for Egt
- ltAssign_op, NULLgt
- ltID, ptr to symbol table entry for Mgt
- ltMult_op, NULLgt
- ltID, ptr to symbol table entry for Cgt
- ltMult_op, NULLgt
- ltID, ptr to symbol table entry for Cgt
9Lexical errors
- When errors occur, we could just crash
- It is better to print an error message then
continue. - Possible techniques to continue on error
- Delete a character
- Insert a missing character
- Replace an incorrect character by a correct
character - Transpose adjacent characters
10Token specification
- REGULAR EXPRESSIONS (REs) are the most common
notation for pattern specification. - Every pattern specifies a set of strings, so an
RE names a set of strings. - Definitions
- The ALPHABET (often written ?) is the set of
legal input symbols - A STRING over some alphabet ? is a finite
sequence of symbols from ? - The LENGTH of string s is written s
- The EMPTY STRING is a special 0-length string
denoted e
11More definitions strings and substrings
- A PREFIX of s is formed by removing 0 or more
trailing symbols of s - A SUFFIX of s is formed by removing 0 or more
leading symbols of s - A SUBSTRING of s is formed by deleting a prefix
and a suffix from s - A PROPER prefix, suffix, or substring is a
nonempty string x that is, respectively, a
prefix, suffix, or substring of s but with x ? s.
12More definitions
- A LANGUAGE is a set of strings over a fixed
alphabet ?. - Example languages
- Ø (the empty set)
- e
- a, aa, aaa, aaaa
- The CONCATENATION of two strings x and y is
written xy - String EXPONENTIATION is written si, where s0 e
and si si-1s for igt0.
13Operations on languages
- We often want to perform operations on sets of
strings (languages). The important ones are - The UNION of L and M L ? M s s is in L OR
s is in M - The CONCATENATION of L and MLM st s is in
L and t is in M - The KLEENE CLOSURE of L
-
- The POSITIVE CLOSURE of L
14Regular expressions
- REs let us precisely define a set of strings.
- For C identifiers, we might use ( letter _ ) (
letter digit _ ) - Parentheses are for grouping, means OR, and
means Kleene closure. - Every RE defines a language L(r).
15Regular expressions
- Here are the rules for writing REs over an
alphabet ? - e is an RE denoting e , the language
containing only the empty string. - If a is in ?, then a is a RE denoting a .
- If r and s are REs denoting L(r) and L(s), then
- (r)(s) is a RE denoting L(r) ? L(s)
- (r)(s) is a RE denoting L(r) L(s)
- (r) is a RE denoting (L(r))
- (r) is a RE denoting L(r)
16Additional conventions
- To avoid too many parentheses, we assume
- has the highest precedence, and is left
associative. - Concatenation has the 2nd highest precedence, and
is left associative. - has the lowest precedence and is left
associative.
17Example REs
- a b
- ( a b ) ( a b )
- a
- (a b )
- a ab
18Equivalence of REs
19Regular definitions
- To make our REs simpler, we can give names to
subexpressions. A REGULAR DEFINITION is a
sequence - d1 -gt r1
- d2 -gt r2
-
- dn -gt rn
20Regular definitions
- Example for identifiers in C
- letter -gt A B Z a b z
- digit -gt 0 1 9
- id -gt ( letter _ ) ( letter digit _ )
- Example for numbers in Pascal
- digit -gt 0 1 9
- digits -gt digit digit
- optional_fraction -gt . digits e
- optional_exponent -gt ( E ( - e ) digits )
e - num -gt digits optional_fraction optional_exponent
21Notational shorthand
- To simplify out REs, we can use a few shortcuts
- 1. means one or more instances ofa (ab)
- 2. ? means zero or one instance
ofOptional_fraction -gt ( . digits ) ? - 3. creates a character classA-Za-zA-Za-z0-9
- You can prove that these shortcuts do not
increase the representational power of REs, but
they are convenient.
22Token recognition
- We now know how to specify the tokens for our
language. But how do we write a program to
recognize them? - if -gt if
- then -gt then
- else -gt else
- relop -gt lt lt ltgt gt gt
- id -gt letter ( letter digit )
- num -gt digit ( . digit )? ( E (-)? digit )?
23Token recognition
- We also want to strip whitespace, so we need
definitions - delim -gt blank tab newline
- ws -gt delim
24Attribute values
25Transition diagrams
- Transition diagrams are also called finite
automata. - We have a collection of STATES drawn as nodes in
a graph. - TRANSITIONS between states are represented by
directed edges in the graph. - Each transition leaving a state s is labeled with
a set of input characters that can occur after
state s. - For now, the transitions must be DETERMINISTIC.
- Each transition diagram has a single START state
and a set of TERMINAL STATES. - The label OTHER on an edge indicates all possible
inputs not handled by the other transitions. - Usually, when we recognize OTHER, we need to put
it back in the source stream since it is part of
the next token. This action is denoted with a
next to the corresponding state.
26Automated lexical analyzer generation
- Next time we discuss Lex and how it does its job
- Given a set of regular expressions, produce C
code to recognize the tokens.
27Lexical Analysis
28Lexical Analysis Example
29Lexical Analysis With Lex
30Lexical analysis with Lex
31Lex source program format
- The Lex program has three sections, separated by
- declarations
-
- transition rules
-
- auxiliary code
32Declarations section
- Code between and is inserted directly into
the lex.yy.c. Should contain - Manifest constants (define for each token)
- Global variables, function declarations, typedefs
- Outside and , REGULAR DEFINITIONS are
declared.Examples - delim \t\n
- ws delim
- letter A-Za-z
Each definition is a name followed by a
pattern. Declared names can be used in later
patterns, if surrounded by .
33Translation rules section
- Translation rules take the form
- p1 action1
- p2 action2
-
- pn actionn
-
- Where pi is a regular expression and actioni is
a C program fragment to be executed whenever pi
is recognized in the input stream. - In regular expressions, references to regular
definitions must be enclosed in to distinguish
them from the corresponding character sequences.
34Auxiliary procedures
- Arbitrary C code can be placed in this section,
e.g. functions to manipulate the symbol table. - See the complete example lex specification
attached.
35Special characters
- Some characters have special meaning to Lex.
- . in a RE stands for ANY character
- stands for Kleene closure
- stands for positive closure
- ? stands for 0-or-1 instance of
- - produces a character range (e.g. in A-Z)
- When you want to use these characters in a RE,
they must be escaped - e.g. in RE digit(\.digit)? . is escaped
with \
36Lex interface to yacc
- The yacc parser calls a function yylex() produced
by lex. - yylex() returns the next token it finds in the
input stream. - yacc expects the tokens attribute, if any, to be
returned via the global variable yylval. - The declaration of yylval is up to you (the
compiler writer). In our example, we use a union,
since we have a few different kinds of attributes.
37Lookahead in Lex
- Sometimes, we dont know until looking ahead
several characters what the next token is.
Recognition of the DO keyword in Fortran is a
famous example. - DO5I1.25 assigns the value 1.25 to DO5I
- DO5I1,25 is a DO loop
- Lex handles long-term lookahead with
r1/r2 DO/(letterdigit)(letterdigit)
,
(if its followed by letters digits, , more
letters digits, followed by a ,)
Recognize keyword DO
38Finite Automata for Lexical Analysis
39Automatic lexical analyzer generation
- How do Lex and similar tools do their job?
- Lex translates regular expressions into
transition diagrams. - Then it translates the transition diagrams into C
code to recognize tokens in the input stream. - There are many possible algorithms.
- The simplest algorithm is RE -gt NFA -gt DFA -gt C
code.
40Finite automata (FAs) and regular languages
- A RECOGNIZER takes language L and string x as
input, and responds YES if x?L, or NO otherwise. - The finite automaton (FA) is one class of
recognizer. - A FA is DETERMINISTIC if there is only one
possible transition for each ltstate,inputgt pair. - A FA is NONDETERMINISTIC if there is more than
one possible transition some ltstate,inputgt pair. - BUT both DFAs and NFAs recognize the same class
of languages REGULAR languages, or the class of
languages that can be written as regular
expressions.
41NFAs
- A NFA is a 5-tuple lt S, ?, move, s0, F gt
- S is the set of STATES in the automaton.
- ? is the INPUT CHARACTER SET
- move( s, c ) S is the TRANSITION
FUNCTIONspecifying which states S the automaton
can move to on seeing input c while in state s. - s0 is the START STATE.
- F is the set of FINAL, or ACCEPTING STATES
42NFA example
The NFA
has move() function
- and recognizes the language L (ab)abb
- (the set of all strings of as and bs ending
with abb)
43The language defined by a NFA
- An NFA ACCEPTS string x iff there exists a path
from s0 to an accepting state, such that the edge
labels along the path spell out x. - The LANGUAGE DEFINED BY a NFA N, written L(N), is
the set of strings it accepts.
44Another NFA example
45Deterministic FAs (DFAs)
- The DFA is a special case of the NFA except
- No state has an e-transition
- No state has more than one edge leaving it for
the same input character. - The benefit of DFAs is that they are simple to
simulate there is only one choice for the
machines state after each input symbol.
46Algorithm to simulate a DFA
- Inputs string x terminated by EOF DFA D
lt S, ?, move, s0, F gt - Outputs YES if D accepts x NO otherwise
- Method
- s s0
- c nextchar
- while ( c ! EOF )
- s move( s, c )
- c nextchar
-
- if ( s ? F ) return YES
- else return NO
47DFA example
- This DFA accepts L (ab)abb
48RE -gt DFA
- Now we know how to simulate DFAs.
- If we can convert our REs into a DFA, we can
automatically generate lexical analyzers. - BUT it is not easy to convert REs directly into a
DFA. - Instead, we will convert our REs to a NFA then
convert the NFA to a DFA.
49Converting a NFA to a DFA
50NFA -gt DFA
- NFAs are ambiguous we dont know what state a
NFA is in after observing each input. - The simplest conversion method is to have the DFA
track the SUBSET of states the NFA MIGHT be in. - We need three functions for the construction
- e-closure(s) the set of NFA states reachable
from NFA state s on e-transitions alone. - e-closure(T) the set of NFA states reachable
from some state s ? T on e-transitions alone. - move(T,a) the set of NFA states to which there
is a transition on input a from some NFA state s
? T
51Subset construction algorithm
- Inputs a NFA N lt SN, ?, tranN, n0, FN gt
- Outputs a DFA D lt SD, ?, tranD, d0, FD gt
- Method
- add a state d0 to SD
corresponding to e-closure(n0) while
there is an unexpanded state di ? SD - for each input symbol a ? ?
- dj e-closure(move(di,a))
- if dj ? SD,
- add dj to SD
- tranN( di, a ) dj
-
52Examples convert these NFAs
a)
b)
53Converting a RE to a NFA
54RE -gt NFA
- The construction is bottom up.
- Construct NFAs to recognize e and each element a
? ?. - Recursively expand those NFAs for alternation,
concatenation, and Kleene closure. - Every step introduces at most two additional NFA
states. - Therefore the NFA is at most twice as large as
the regular expression.
55RE -gt NFA algorithm
- Inputs A RE r over alphabet ?
- Outputs A NFA N accepting L(r)
- Method Parse r.
If r e, then N is
If r a ? ? , then N is
If r s t, construct N(s) for s and N(t) for t
then N is
56RE -gt NFA algorithm
If r st, construct N(s) for s and N(t) for t
then N is
If r s, construct N(s) for s, then N is
If r ( s ), construct N(s) then let N be N(s).
57Example
- Use the NFA construction algorithm to build a NFA
for r (ab)abb