COS 320 Compilers - PowerPoint PPT Presentation

About This Presentation
Title:

COS 320 Compilers

Description:

To lex an entire programming language, we need many rules. but ambiguities arise: ... Writing lexers by hand is boring, so we use a lexer generator: ml-lex ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 45
Provided by: DPW9
Category:
Tags: cos | compilers | lex

less

Transcript and Presenter's Notes

Title: COS 320 Compilers


1
COS 320Compilers
  • David Walker

2
Outline
  • Last Week
  • Introduction to ML
  • Today
  • Lexical Analysis
  • Reading Chapter 2 of Appel

3
The Front End
  • Lexical Analysis Create sequence of tokens from
    characters
  • Syntax Analysis Create abstract syntax tree from
    sequence of tokens
  • Type Checking Check program for well-formedness
    constraints

stream of characters
stream of tokens
abstract syntax
Lexer
Parser
Type Checker
4
Lexical Analysis
  • Lexical Analysis Breaks stream of ASCII
    characters (source) into tokens
  • Token An atomic unit of program syntax
  • i.e., a word as opposed to a sentence
  • Tokens and their types

Type ID REAL SEMI LPAREN NUM IF
Characters Recognized foo, x, listcount 10.45,
3.14, -2.1 ( 50, 100 if
Token ID(foo), ID(x), ... REAL(10.45),
REAL(3.14), ... SEMI LPAREN NUM(50), NUM(100) IF
5
Lexical Analysis Example
x ( y 4.0 )
6
Lexical Analysis Example
x ( y 4.0 ) ID(x)
Lexical Analysis
7
Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN
Lexical Analysis
8
Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN LPAREN ID(y) PLUS
REAL(4.0) RPAREN SEMI
Lexical Analysis
9
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work

10
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work
  • Use a Lexer Generator
  • Quick and easy. Good for lazy compiler writers.

Lexer Specification
11
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work
  • Use a Lexer Generator
  • Quick and easy. Good for lazy compiler writers.

Lexer Specification
Lexer
lexer generator
12
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work
  • Use a Lexer Generator
  • Quick and easy. Good for lazy compiler writers.

stream of characters
Lexer Specification
Lexer
lexer generator
stream of tokens
13
  • How do we specify the lexer?
  • Develop another language
  • Well use a language involving regular
    expressions to specify tokens
  • What is a lexer generator?
  • Another compiler ....

14
Some Definitions
  • We will want to define the language of legal
    tokens our lexer can recognize
  • Alphabet a collection of symbols (ASCII is an
    alphabet)
  • String a finite sequence of symbols taken from
    our alphabet
  • Language of legal tokens a set of strings
  • Language of ML keywords set of all strings
    which are ML keywords (FINITE)
  • Language of ML tokens set of all strings which
    map to ML tokens (INFINITE)
  • A language can also be a more general set of
    strings
  • eg ML Language set of all strings
    representing correct ML programs (INFINITE).

15
Regular Expressions Construction
  • Base Cases
  • For each symbol a in alphabet, a is a RE denoting
    the set a
  • Epsilon (e) denotes
  • Inductive Cases (M and N are REs)
  • Alternation (M N) denotes strings in M or N
  • (a b) a, b
  • Concatenation (M N) denotes strings in M
    concatenated with strings in N
  • (a b) (a c) aa, ac, ba, bc
  • Kleene closure (M) denotes strings formed by any
    number of repetitions of strings in M
  • (a b ) e, a, b, aa, ab, ba, bb, ...

16
Regular Expressions
  • Integers begin with an optional minus sign,
    continue with a sequence of digits
  • Regular Expression
  • (- e) (0 1 2 3 4 5 6 7 8
    9)

17
Regular Expressions
  • Integers begin with an optional minus sign,
    continue with a sequence of digits
  • Regular Expression
  • (- e) (0 1 2 3 4 5 6 7 8
    9)
  • So writing (0 1 2 3 4 5 6 7 8
    9) and even worse (a b c ...) gets
    tedious...

18
Regular Expressions (REs)
  • common abbreviations
  • a-c (a b c)
  • . any character except \n
  • \n new line character
  • a one or more
  • a? zero or one
  • all abbreviations can be defined in terms of the
    standard REs

19
Ambiguous Token Rule Sets
  • A single RE is a completely unambiguous
    specification of a token.
  • call the association of an RE with a token a
    rule
  • To lex an entire programming language, we need
    many rules
  • but ambiguities arise
  • multiple REs or sequences of REs match the same
    string
  • hence many token sequences possible

20
Ambiguous Token Rule Sets
  • Example
  • Identifier tokens a-z a-z0-9
  • Sample keyword tokens if, then, ...
  • How do we tokenize
  • foobar gt ID(foobar) or ID(foo) ID(bar)
  • if gt ID(if) or IF

21
Ambiguous Token Rule Sets
  • We resolve ambiguities using two conventions
  • Longest match The regular expression that
    matches the longest string takes precedence.
  • Rule Priority The regular expressions
    identifying tokens are written down in sequence.
    If two regular expressions match the same
    (longest) string, the first regular expression in
    the sequence takes precedence.

22
Ambiguous Token Rule Sets
  • Example
  • Identifier tokens a-z a-z0-9
  • Sample keyword tokens if, then, ...
  • How do we tokenize
  • foobar gt ID(foobar) or ID(foo) ID(bar)
  • use longest match to disambiguate
  • if gt ID(if) or IF
  • keyword rules have higher priority than
    identifier rule

23
Lexer Implementation
  • Implementation Options
  • Write Lexer from scratch
  • Boring and error-prone
  • Use Lexical Analyzer Generator
  • Quick and easy
  • ml-lex is a lexical analyzer generator for ML.
  • lex and flex are lexical analyzer generators for
    C.

24
ML-Lex Specification
  • Lexical specification consists of 3 parts

User Declarations (plain ML types, values,
functions) ML-LEX Definitions (RE
abbreviations, special stuff) Rules
(association of REs with tokens) (each
token will be represented in plain ML)
25
User Declarations
  • User Declarations
  • User can define various values that are available
    to the action fragments.
  • Two values must be defined in this section
  • type lexresult
  • type of the value returned by each rule action.
  • fun eof ()
  • called by lexer when end of input stream is
    reached.

26
ML-LEX Definitions
  • ML-LEX Definitions
  • User can define regular expression abbreviations
  • Define multiple lexers to work together. Each is
    given a unique name.

DIGITS 0-9 LETTER a-zA-Z
s LEX1 LEX2 LEX3
27
Rules
  • Rules
  • A rule consists of a pattern and an action
  • Pattern in a regular expression.
  • Action is a fragment of ordinary ML code.
  • Longest match rule priority used for
    disambiguation
  • Rules may be prefixed with the list of lexers
    that are allowed to use this rule.

ltlexer_listgt regular_expression gt (action.code)
28
Rules
  • Rule actions can use any value defined in the
    User Declarations section, including
  • type lexresult
  • type of value returned by each rule action
  • val eof unit -gt lexresult
  • called by lexer when end of input stream reached
  • special variables
  • yytext input substring matched by regular
    expression
  • yypos file position of the beginning of matched
    string
  • continue () doesnt return token recursively
    calls lexer

29
A Simple Lexer
datatype token Num of int Id of string IF
THEN ELSE EOF type lexresult token
( mandatory ) fun eof () EOF
( mandatory ) fun itos s case
Int.fromString s of SOME x gt x NONE gt raise
fail NUM 1-90-9 ID a-zA-Z
(a-zA-Z NUM) if gt (IF) then gt
(THEN) else gt (ELSE) NUM gt (Num (itos
yytext)) ID gt (Id yytext)
30
Using Multiple Lexers
  • Rules prefixed with a lexer name are matched only
    when that lexer is executing
  • Initial lexer is called INITIAL
  • Enter new lexer using
  • YYBEGIN LEXERNAME
  • Aside Sometimes useful to process characters,
    but not return any token from the lexer. Use
  • continue ()

31
Using Multiple Lexers
type lexresult unit ( mandatory ) fun
eof () () ( mandatory
) s COMMENT ltINITIALgt if gt
() ltINITIALgt a-z gt () ltINITIALgt (
gt (YYBEGIN COMMENT continue ()) ltCOMMENTgt
) gt (YYBEGIN INITIAL continue
()) ltCOMMENTgt \n . gt (continue ())
32
A (Marginally) More Exciting Lexer
type lexresult string
( mandatory ) fun eof ()
(print End of file\n EOF) (
mandatory ) s COMMENT INT 1-9
0-9 ltINITIALgt if gt
(IF) ltINITIALgt then gt (THEN) ltINITIALgt
INT gt ( INT( yytext )
) ltINITIALgt ( gt (YYBEGIN COMMENT
continue ()) ltCOMMENTgt ) gt (YYBEGIN
INITIAL continue ()) ltCOMMENTgt \n . gt
(continue ())
33
Implementing ML-Lex
  • By compiling, of course
  • convert REs into non-deterministic finite
    automata
  • convert non-deterministic finite automata into
    deterministic finite automata
  • convert deterministic finite automata into a
    blazingly fast table-driven algorithm
  • you did mostly everything but possibly the last
    step in your favorite algorithms class
  • need to deal with disambiguation rule priority
  • need to deal with multiple lexers

34
Refreshing your memory RE gt NDFA gt DFA
  • Lex rules
  • if gt (Tok.IF)
  • a-za-z0-9 gt (Tok.Id)

35
Refreshing your memory RE gt NDFA gt DFA
  • Lex rules
  • if gt (Tok.IF)
  • a-za-z0-9 gt (Tok.Id)
  • NDFA

a-z0-9
a-z
1
4
Tok.Id
i
2
3
f
Tok.IF
36
Refreshing your memory RE gt NDFA gt DFA
  • Lex rules
  • if gt (Tok.IF)
  • a-za-z0-9 gt (Tok.Id)
  • NDFA DFA

a-z0-9
a-z0-9
a-z
1
4
a-hj-z
1
Tok.Id
4
Tok.Id
a-eg-z0-9
i
i
a-z0-9
2
3
a-z0-9
2,4
3,4
f
Tok.IF
f
Tok.IF
Tok.Id
(could be Tok.Id decision made by rule priority)
37
Table-driven algorithm
  • NDFA

a-z0-9
a-hj-z
1
4
Tok.Id
a-eg-z0-9
i
a-z0-9
2,4
3,4
f
Tok.IF
Tok.Id
38
Table-driven algorithm
  • NDFA (states conveniently renamed)

a-z0-9
a-hj-z
S1
S4
Tok.Id
a-eg-z0-9
i
a-z0-9
S2
S3
f
Tok.IF
Tok.Id
39
Table-driven algorithm
  • DFA
    Transition Table

S1 S2 S3 S4
a-z0-9
a
a-hj-z
S1
b
S4
Tok.Id
a-eg-z0-9
...
i
a-z0-9
S2
S3
i
f
Tok.IF
Tok.Id
...
40
Table-driven algorithm
  • DFA
    Transition Table

S1 S2 S3 S4
a-z0-9
a
a-hj-z
S1
b
S4
Tok.Id
a-eg-z0-9
...
i
a-z0-9
S2
S3
i
f
Tok.IF
Tok.Id
...
Final State Table
S1 S2 S3 S4
41
Table-driven algorithm
  • DFA
    Transition Table

S1 S2 S3 S4
a-z0-9
a
a-hj-z
S1
b
S4
Tok.Id
a-eg-z0-9
...
i
a-z0-9
S2
S3
i
f
Tok.IF
Tok.Id
...
  • Algorithm
  • Start in start state
  • Transition from one state to next
  • using transition table
  • Every time you reach a potential final
  • state, remember it position in stream
  • When no more transitions apply, revert
  • to last final state seen position
  • Execute associated rule code

Final State Table
S1 S2 S3 S4
42
Dealing with Multiple Lexers
  • Lex rules
  • ltINITIALgt if gt (Tok.IF)
  • ltINITIALgt a-za-z0-9 gt (Tok.Id)
  • ltINITIALgt ( gt (YYBEGIN
    COMMENT continue ())
  • ltCOMMENTgt ) gt (YYBEGIN INITIAL continue
    ())
  • ltCOMMENTgt . gt (continue ())

43
Dealing with Multiple Lexers
  • Lex rules
  • ltINITIALgt if gt (Tok.IF)
  • ltINITIALgt a-za-z0-9 gt (Tok.Id)
  • ltINITIALgt ( gt (YYBEGIN
    COMMENT continue ())
  • ltCOMMENTgt ) gt (YYBEGIN INITIAL continue
    ())
  • ltCOMMENTgt . gt (continue ())

(
COMMENT
INITIAL
)
a-za-z0-9
.
44
Summary
  • A Lexer
  • input stream of characters
  • output stream of tokens
  • Writing lexers by hand is boring, so we use a
    lexer generator ml-lex
  • lexer generators work by converting REs through
    automata theory to efficient table-driven
    algorithms.
  • Moral dont underestimate your theory classes!
  • great application of cool theory developed in the
    70s.
  • well see more cool apps as the course progresses
Write a Comment
User Comments (0)
About PowerShow.com