Title: Lexical Analysis Part 1
1Lexical AnalysisPart 1
2Lexical Analysis Whats to come
- Programs could be made from characters, and parse
trees would go down to the character level - Machine specific, obfuscates parsing, cumbersome
- Lexical analysis is firewall between program
representation and parsing actions - Prior lexical analysis phase obtains tokens
consisting of a type (ID) and value (the lexeme
matched) - In Principle simple transition diagrams (finite
state automata) characterize each of the things
that can be recognized - In Practice a program combines the multiple
automata definitions into an efficient state
machine
3Lexical Phase
- Simple (non-recursive)
- Efficient (special purpose code)
- Portable (ignore character-set and architecture
differences) - Use JavaCC, lex , flex , etc
- Used in practice with Bison/Yacc , etc.
4Lexical Processing
- Token terminal symbols in a grammar. At the
lexical level this is a symbol constant, and in
print is represented in bold - Pattern set of matching strings. For a keyword
it is a constant. For a variable or value it can
be represented by a regular expression - Lexeme character sequence matched by an instance
of the token
5Lexical Processing
- Token attributes pointer to a symbol-table
entry, may include the lexeme, scope information,
etc. - Languages may have special rules (i.e., PL/1 does
not have Reserved words and Fortran allows
spaces in variables both are obscure design
choices)
6Lexical Analysis sequences
- Expression
- Base base - 0x4 height width
- Token sequence
- Namebase operatortimes namebase operatorminus
hexConstant4 operatortimes nameheight
operatortimes namewidth - Lexical phase returns token and value (yylval ,
yytext, etc)
7Tokens
- Token attributes pointer to a symbol-table
entry, may include the lexeme, scope information,
etc. - Formal specification of tokens by regular
expressions, define alphabet, strings, languages
8Regular Expression Notation
- a an ordinary letter from our alphabet
- e the empty string
- r1 r2 choosing from r1 or r2
- r1r2 concatenation of r1 and r2
- r zero or more times (Kleene closure)
- r one or more times
- r? zero or one occurrence
- a-zA-Z character class (choice)
- . period stands for any single char exc. newline
9Semantics of Regular Expressions
- L(e) e
- L(a) a for all a in S
- L (r1 r2) L(r1) U L (r2)
- L (r1 r2) x,y) x in L(r1 ), y in L(r2 )
- L (R) e U x in L(R ) ,
- x1 x2 x1 ,x2 in L(R )
- x1 . . . xn x1. xn in L(R
)
10For Homework
- Suppose S is a ,b
- What is the regular expression for
- All strings beginning and ending in a?
- All strings with an odd number of as?
- All strings without two consecutive as?
- All strings with an odd number of bs followed by
an even number of as - Whats the description for a Java floating point
number? - Whats the description of variable name in Java?
11Why we care about Regular Expressions
For every regular expression, there is a
deterministic finite-state machine that defines
the same language, and vice versa
12Regular Expressions
- Automaton is a good visual aid
- but is not suitable as a specification (its
textual description is too clumsy) - However regular expressions are a suitable
specification - a compact way to define a language that can be
accepted by an automaton.
13RegExp Use and Construction
- Used as the input to a scanner generator like lex
or flex or JavaCC - define each token, and also
- define white-space, comments, etc
- these do not correspond to tokens, but must be
recognized and ignored. - A NFA can be constructed from a RegExp via
Thompsons Construction
14Thompsons Construction
- There are building blocks for each regular
expression operator - More complex RegExps are constructed by composing
smaller building blocks - Assumes that the NFAs at each step of the
construction will have a single accepting state
15Regular Expressions to NFA (1)
- For each kind of rexp, define an NFA
- Notation NFA for rexp M
16Regular Expressions to NFA (2)
17Regular Expressions to NFA (3)
18Others
- What would be representation for A ?
- What would be representation for A? ?
- What about for a-z ?
19Example of RegExp -gt NFA conversion
- Consider the regular expression
- (10)1
- The NFA is
?
1
?
?
C
E
1
B
A
G
?
H
I
J
0
?
?
?
?
D
F
?
20More Homework Problems
- What is the NFA for the following RE?
- (a(bc)) a
- What is the NFA for the following RE?
- ((ab)c) (a b c)
21Lexical Analyzer
- Can be programmed in a high-level language.
- Can be generated using tools like LEX/Flex
- Integrate these tools with C/C or Java code
- In Java there are other tools Jflex for example
22How can a tool like LEX or JAVACC work?
- Translate regular expressions to
Non-deterministic Finite Automata (NFA) - Easier expressive form than the DFA
- Automata theory tells us how to optimize
- Run the automata
- Simulate NFA, or
- Translate NFA to DFA a new DFA where each state
corresponds to a set of NFA states (see pgages
28-29 pf Appel for set construction) - Have DFA move between states in simulation of the
NFAs states
23Non-deterministic FA
- NFA is modified to allow zero, one or MORE
transitions from a state on the same input symbol - Easier to express complex patterns as NFA
- Harder to mechanically simulate NFS what
transition do we make on input (simulate all of
them, then confirm it worked) - DFA and NFA are functionally equivalent.
24DFA with null moves
- The model of NFA can be extended to include
transitions on ltnullgt input. - Change the state without reading any symbol from
the input stream. - e-closure(q) set of all states reachable from q
without reading any input symbol (following the
null edges)
25eClosure Operator
- The eClosure operator is defined as eClosure(s)
s U states reachable from s using e
transitions. - Example eClosure(1) 1,3
a
?
start
1
5
3
a
a/b
b
2
4
26RE to FA
- If we write expression as RE (easy for people)
how do we turn it into an FA (something a machine
can simulate) - Use Thompsons Construction
- At most twice as many states as there are symbols
and operators in the regular expression. - Results in a NFA (needs a non-deterministic
computer to run most efficiently, hmm.)
27NFA to DFA
- Build super states in a DFA where each super
state represents the set of transitions that the
NFA could make from a state on a symbol - e-closure(q) states that can be arrived at from
q with just null transitions - move(S, a) states that can be reached on
scanning a symbol a (from the input) - e-closure(S) states that can be reached with E
transitions from states in S
28NFA to DFA (cont.)
- Subset Construction (alg 3.2)
- Find e-closure(q0)
- while ( S in FAStates is unmarked)
-
- mark S
- for each a in alphabet
- T e-closure ( move(S, a) )
- if (T ? FAStates)
- FAStates.include( T )
- FATranS, a T
-
29FA v.s. NFA
- NFA is smaller O(r) space but more time for
simulation O(rx) time even with the nice
properties of Thompsons construction - DFA is faster O(x) time, but is not space
efficient, O(2r) space
30NFA t DFA
- What is the difference between the two?
- Is there a single DFA for a corresponding NFA?
- Why do we want to do this anyway?
31Subset Construction for NFA-gt DFA
- Compute A eClosure(start)
- Compute the set of states reachable from A on
transition a, call this new set S - Compute eClosure(S) this is the new state and
label it with the next available label - Continue for all possible transitions from the
current state for all applicable elements of S - Repeat steps 2-4 for each new state
32Example a cb
e
a
c
e
e
b
1
2
3
6
4
5
e
33References
- Compilers Principles, Techniques and Tools, Aho,
Sethi, Ullman Chapter 3 - http//www.cs.columbia.edu/lerner/CS4115
- Modern Compiler Implementation in Java, Andrew
Appel, Cambridge University Press, 2003