Title: Lexical Analysis
1Lexical Analysis
- Compiler
- Baojian Hua
- bjhua_at_ustc.edu.cn
2Compiler
compiler
source program
target program
3Front and Back Ends
front end
back end
source program
target program
IR
4Front End
lexical analyzer
source code
tokens
abstract syntax tree
parser
semantic analyzer
IR
5Lexical Analyzer
- The lexical analyzer translates the source
program into a stream of lexical tokens - Source program
- stream of characters
- vary from language to language (ASCII or Unicode,
or ) - Lexical token
- compiler internal data structure that represents
the occurrence of a terminal symbol - vary from compiler to compiler
6Conceptually
lexical analyzer
character sequence
token sequence
7Example
- Recall the min-ML language in code3
- prog -gt decs
- decs -gt dec decs
-
- dec -gt val id exp
- val _ printInt exp
- exp -gt id
- num
- exp exp
- true
- false
- if (exp) then exp else exp
- (exp)
8Example
val x 3 val y 4 val z if (2) then
(x) else y val _ printInt z
lexical analysis
VAL IDENT(x) ASSIGN INT(3) SEMICOLON VAL
IDENT(y) ASSIGN INT(4) SEMICOLON VAL IDENT(z)
ASSIGN IF LPAREN INT(2) RPAREN THEN LPAREN
IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON VAL
UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF
9Lexer Implementation
- Options
- Write a lexer by hand from scratch
- boring, error-prone, and too much work
- see dragon book sec3.4
- Automatic lexer generator
- Quick and easy
10Lexer Implementation
declarative specification
lexical analyzer
11Regular Expressions
- How to specify a lexer?
- Develop another language
- Regular expressions
- Whats a lexer-generator?
- Another compiler
12Basic Definitions
- Alphabet the char set (say ASCII or Unicode)
- String a finite sequence of char from alphabet
- Language a set of strings
- finite or infinite
- say the C language
13Regular Expression (RE)
- Construction by induction
- each c \in alphabet
- a
- empty \eps
-
- for M and N, then MN
- (ab) a, b
- for M and N, then MN
- (ab)(cd) ac, ad, bc, bd
- for M, then M (Kleen closure)
- (ab) \eps, a, aa, b, ab, abb, baa,
14Regular Expression
e -gt c e e e e e
15Example
- Cs indentifier
- starts with a letter (_ counts as a letter)
- followed by zero or more of letter or digit
- () ()
- (_abzABZ) ()
- (_abzABZ)(_abzABZ09)
- (_abzABZ)(_abzABZ09)
- Its really error-prone and tedious
16Syntax Sugar
- More syntax sugar
- a-z abz
- e one or more of e
- e? zero or one of e
- a a itself
- ei, j more than i and less than j of e
- . any char except \n
- All these can be translated into core RE
17Example Revisted
- Cs indentifier
- starts with a letter (_ counts as a letter)
- followed by zero or more of letter or digit
- () ()
- (_abzABZ) ()
- (_abzABZ)(_abzABZ09)
- _a-zA-Z_a-zA-Z0-9
- What about the key word if?
18Ambiguous Rule
- A single RE is not ambiguous
- But in a language, there may be many REs?
- _a-zA-Z_a-zA-Z0-9
- if
- So, for a string, which RE to match?
19Ambiguous Rule
- Two conventions
- Longest match The regular expression that
matches the longest string takes precedence. - Rule Priority The regular expressions
identifying tokens are written down in sequence.
If two regular expressions match the same
(longest) string, the first regular expression in
the sequence takes precedence.
20Lexer Generator History
- Lexical analysis was once a performance
bottleneck - certainly not true today!
- As a result, early research investigated methods
for efficient lexical analysis - While the performance concerns are largely
irrelevant today, the tools resulting from this
research are still in wide use
21History A long-standing goal
- In this early period, a considerable amount of
study went into the goal of creating an automatic
compiler generator (aka compiler-compiler)
declarative compiler specification
compiler
22History Unix and C
- In the mid-1960s at Bell Labs, Ritchie and
others were developing Unix - A key part of this project was the development of
C and a compiler for it - Johnson, in 1968, proposed the use of finite
state machines for lexical analysis and developed
Lex CACM 11(12), 1968 - read the accompanying paper on course page
- Lex realized a part of the compiler-compiler goal
by automatically generating fast lexical analyzers
23The Lex tool
- The original Lex generated lexers written in C (C
in C) - Today every major language has its own lex
tool(s) - sml-lex, ocamllex, JLex, Clex,
- Our topic next
- sml-lex
- concepts and techniques apply to other tools
24SML-Lex Specification
- Lexical specification consists of 3 parts (yet
another programming language)
User Declarations (plain SML types, values,
functions) SML-LEX Definitions (RE
abbreviations, special stuff) Rules
(association of REs with tokens) (each
token will be represented in plain SML)
25User Declarations
- User Declarations
- User can define various values that are available
to the action fragments. - Two values must be defined in this section
- type lexresult
- type of the value returned by each rule action.
- fun eof ()
- called by lexer when end of input stream is
reached. (EOF)
26SML-LEX Definitions
- ML-LEX Definitions
- User can define regular expression abbreviations
- Define multiple lexers to work together. Each is
given a unique name.
digits 0-9 letter a-zA-Z
s lex1 lex2 lex3
27Rules
- Rules
- A rule consists of a pattern and an action
- Pattern in a regular expression.
- Action is a fragment of ordinary SML code.
- Longest match rule priority used for
disambiguation - Rules may be prefixed with the list of lexers
that are allowed to use this rule.
ltlexerListgt regularExp gt (action)
28Rules
- Rule actions can use any value defined in the
User Declarations section, including - type lexresult
- type of value returned by each rule action
- val eof unit -gt lexresult
- called by lexer when end of input stream reached
- special variables
- yytext input substring matched by regular
expression - yypos file position of the beginning of matched
string - continue () doesnt return token recursively
calls lexer
29Example 1
- ( A language called Toy )
- prog -gt word prog
- -gt
- word -gt symbol
- -gt number
- symbol -gt _a-zA-Z_0-9a-zA-Z
- number -gt 0-9
30Example 1
- ( Lexer Toy, see the accompany code for detail
) - datatype token Symbol of string int
- Number of string int
- exception End
- type lexresult unit
- fun eof () raise End
- fun output x
-
- letter _a-zA-Z
- digit 0-9
- ld letterdigit
- symbol letter ld
- number digit
-
- ltINITIALgtsymbol gt(output (Symbol(yytext,
yypos))) - ltINITIALgtnumber gt(output (Number(yytext,
yypos)))
31Example 2
- ( Expression Language
- C-style comment, i.e. / /
- )
- prog -gt stms
- stms -gt stm stms
- -gt
- stm -gt id e
- -gt print e
- e -gt id
- -gt num
- -gt e bop e
- -gt (e)
- bop -gt - /
32Sample Program
33Example 2
- ( All terminals )
- prog -gt stms
- stms -gt stm stms
- -gt
- stm -gt id e
- -gt print e
- e -gt id
- -gt num
- -gt e bop e
- -gt (e)
- bop -gt - /
34Example 2 in Lex
- ( Expression language, see the accompany code
- for detail.
- Part 1 user code
- )
- datatype token
- Id of string int
- Number of string int
- Print of string int
- Plus of string int
- ( all other stuffs )
- exception End
- type lexresult unit
- fun eof () raise End
- fun output x
35Example 2 in Lex, cont
- ( Expression language, see the accompany code
- for detail.
- Part 2 lex definition
- )
-
- letter _a-zA-Z
- digit 0-9
- ld letterdigit
- sym letter ld
- num digit
- ws \ \t
- nl \n
36Example 2 in Lex, cont
- ( Expression language, see the accompany code
- for detail.
- Part 3 rules
- )
-
- ltINITIALgtws gt(continue ())
- ltINITIALgtnl gt(continue ())
- ltINITIALgt gt(output (Plus (yytext, yypos)))
- ltINITIALgt- gt(output (Minus (yytext, yypos)))
- ltINITIALgt gt(output (Times (yytext, yypos)))
- ltINITIALgt/ gt(output (Divide (yytext,
yypos))) - ltINITIALgt( gt(output (Lparen (yytext,
yypos))) - ltINITIALgt) gt(output (Rparen (yytext,
yypos))) - ltINITIALgt gt(output (Assign (yytext,
yypos))) - ltINITIALgt gt(output (Semi (yytext, yypos)))
37Example 2 in Lex, cont
- ( Expression language, see the accompany code
- for detail.
- Part 3 rules cont
- )
- ltINITIALgtprintgt(output (Print(yytext,
yypos))) - ltINITIALgtsym gt(output (Id (yytext, yypos)))
- ltINITIALgtnum gt(output (Number(yytext,
yypos))) - ltINITIALgt/ gt (YYBEGIN COMMENT continue ())
- ltCOMMENTgt/ gt (YYBEGIN INITIAL continue ())
- ltCOMMENTgtnl gt (continue ())
- ltCOMMENTgt. gt (continue ())
- ltINITIALgt. gt (error ())
38Lex Implementation
- Lex accepts regular expressions (along with
others) - So SML-lex is a compiler from RE to a lexer
- Internal
- RE ? NFA ? DFA ? table-driven alog
39Finite-state Automata (FA)
M (?, S, q0, F, ?)
Transition function
Input alphabet
State set
Final states
Initial state
40Transition functions
- DFA
- ? S ? ? ? S
- NFA
- ? S ? ? ? ?(S)
41DFA example
- Which strings of as and bs are accepted?
- Transition function
- (q0,a)?q1, (q0,b)?q0,
- (q1,a)?q2, (q1,b)?q1,
- (q2,a)?q2, (q2,b)?q2
42NFA example
- Transition function
- (q0,a)?q0,q1, (q0,b)?q1, (q1,a)??,
(q1,b)?q0,q1
43RE -gt NFAThompson algorithm
- Break RE down to atoms
- construct small NFAs directly for atoms
- inductively construct larger NFAs from small NFAs
- Easy to implement
- a small recursion algorithm
44RE -gt NFAThompson algorithm
- e -gt ?
- -gt c
- -gt e1 e2
- -gt e1 e2
- -gt e1
?
c
?
e2
e1
45RE -gt NFAThompson algorithm
- e -gt ?
- -gt c
- -gt e1 e2
- -gt e1 e2
- -gt e1
e1
?
?
?
?
e2
?
?
?
e1
?
46Example
-
- letter _a-zA-Z
- digit 0-9
- id letter (letterdigit)
-
- ltINITIALgtif gt (IF (yytext, yypos))
- ltINITIALgtid gt (Id (yytext, yypos))
- ( Equivalent to
- if id
- )
47Example
- ltINITIALgtif gt (IF (yytext, yypos))
- ltINITIALgtid gt (Id (yytext, yypos))
?
f
i
?
?
?
48NFA -gt DFASubset construction algorithm
- ( subset construction workList algorithm )
- q0 lt- e-closure (n0)
- Q lt- q0
- workList lt- q0
- while (workList ! \phi)
- remove q from workList
- foreach (character c)
- t lt- ?-closure (move (q, c))
- Dq, c lt- t
- if (t\not\in Q)
- add t to Q and workList
49NFA -gt DFA?-closure
- ( ?-closure fixpoint algorithm )
- ( Dragon Fig 3.33 gives a DFS-like algorithm.
- Here we give a recursive version. (Simpler)
- )
- X lt- \phi
- fun eps (t)
- X lt- X ? t
- foreach (s \in one-eps(t))
- if (s \not\in X)
- then eps (s)
50NFA -gt DFA ?-closure
- ( ?-closure fixpoint algorithm )
- ( dragon Fig 3.33 gives a DFS-like algorithm.
- Here we give a recursive version. (Simpler)
- )
- fun e-closure (T)
- X lt- T
- foreach (t \in T)
- X lt- X ? eps(t)
51NFA -gt DFA ?-closure
- ( ?-closure fixpoint algorithm )
- ( A BFS-like algorithm.
- )
- X lt- empty
- fun e-closure (T)
- Q lt- T
- X lt- T
- while (Q not empty)
- q lt- deQueue (Q)
- foreach (s \in one-eps(q))
- if (s \not\in X)
- enQueue (Q, s)
- X lt- X ? s
52Example
- ltINITIALgtif gt (IF (yytext, yypos))
- ltINITIALgtid gt (Id (yytext, yypos))
?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
_a-zA-Z0-9
53Example
- q0 0, 1, 5 Q q0
- Dq0, i 2, 3, 6, 7, 8 Q ? q1
- Dq0, _ 6, 7, 8 Q ? q2
- Dq1, f 4, 7, 8 Q ? q3
?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
f
q1
q3
i
_
_a-zA-Z0-9
q0
q2
54Example
- Dq1, _ 7, 8 Q ? q4
- Dq2, _ 7, 8 Q
- Dq3, _ 7, 8 Q
- Dq4, _ 7, 8 Q
?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
q3
f
q1
_
_a-zA-Z0-9
i
_
_
_
q0
q4
_
q2
55Example
- q0 0, 1, 5 q1 2, 3, 6, 7, 8
- q2 6, 7, 8 q3 4, 7, 8 q4 7, 8
?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
q3
f
ld
q1
_a-zA-Z0-9
i
ld-f
q4
q0
ld
ld
q2
letter-i
56Example
- q0 0, 1, 5 q1 2, 3, 6, 7, 8
- q2 6, 7, 8 q3 4, 7, 8 q4 7, 8
?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
q3
f
ld
q1
_a-zA-Z0-9
i
ld-f
q4
q0
ld
ld
q2
letter-i
57Table-driven Algorithm
- Conceptually, an FA is a directed graph
- Pragmatically, many different strategies to
encode an FA - Matrix (adjacency matrix)
- sml-lex
- Array of list (adjacency list)
- Hash table
- Jump table (switch statements)
- flex
- Balance between time and space
58Example
ltINITIALgtif gt (IF (yytext, yypos)) ltINITIALgti
d gt (Id (yytext, yypos))
state\char i f letter-i-f other
q0 q1 q2 q2 error
q1 q4 q3 q4 error
q2 q4 q4 q4 error
q3 q4 q4 q4 error
q4 q4 q4 q4 error
q3
f
ld
q1
state q0 q1 q2 q3 q4
action Id Id IF Id
i
ld-f
q4
q0
ld
ld
q2
letter-i
59DFA MinimizationHopcrofts Algorithm
q3
f
ld
q1
i
ld-f
q4
q0
ld
ld
q2
letter-i
state q0 q1 q2 q3 q4
action Id Id IF Id
60DFA MinimizationHopcrofts Algorithm
q3
f
ld
q1
i
ld-f
q4
q0
ld
ld
q2
letter-i
state q0 q1 q2 q3 q4
action Id Id IF Id
61DFA MinimizationHopcrofts Algorithm
q3
f
q1
i
ld
ld-f
q0
q2, q4
letter-i
ld
state q0 q1 q2, q4 q3
action Id Id IF
62Summary
- A Lexer
- input stream of characters
- output stream of tokens
- Writing lexers by hand is boring, so we use a
lexer generator ml-lex - RE -gt NFA -gt DFA -gt table-driven algo
- Moral dont underestimate your theory classes!
- great application of cool theory developed in
mathematics. - well see more cool apps as the course progresses