Lexical Analysis

About This Presentation

Title:

Lexical Analysis

Description:

Might be sequence of lines (VMS) Character set: ASCII. ISO Latin-1. ISO 10646 (16-bit = unicode) Ada, Java. Others (EBCDIC, JIS, etc) The Output ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 47

Provided by: robertberr

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lexical Analysis

1
Lexical Analysis
2
The Input

Read string input
Might be sequence of characters (Unix)
Might be sequence of lines (VMS)
Character set
ASCII
ISO Latin-1
ISO 10646 (16-bit unicode) Ada, Java
Others (EBCDIC, JIS, etc)

3
The Output

A series of tokens kind, location, name (if any)
Punctuation ( ) ,
Operators -
Keywords begin end if while try
catch
Identifiers Square_Root
String literals press Enter to
continue
Character literals x
Numeric literals
Integer 123
Floating_point 4_5.23e2
Based representation 16ac

4
Free form vs Fixed form

Free form languages (all modern ones)
White space does not matter. Ignore these
Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important
Fixed format languages (historical)
Layout is critical
Fortran, label in cols 1-6
COBOL, area A B
Lexical analyzer must know about layout to find
tokens

5
Punctuation Separators

Typically individual special characters such as
( .. (two dots)
Sometimes double characters lexical scanner
looks for longest token
(, / -- comment openers in various
languages
Returned just as identity (kind) of token
And perhaps location for error messages and
debugging purposes

6
Operators

Like punctuation
No real difference for lexical analyzer
Typically single or double special chars
Operators - lt
Operations gt
Returned as kind of token
And perhaps location

7
Keywords

Reserved identifiers
E.g. BEGIN END in Pascal, if in C, catch in C
Maybe distinguished from identifiers
E.g. mode vs mode in Algol-68
Returned as kind of token
With possible location information
Oddity unreserved keywords in PL/1
IF IF THEN THEN THEN 1
Handled as identifiers (parser disambiguates)

8
Identifiers

Rules differ
Length, allowed characters, separators
Need to build a names table
Single entry for all occurrences of Var1
Language may be case insensitive same entry for
VAR1, vAr1, Var1
Typical structure hash table
Lexical analyzer returns token kind
And key (index) to table entry
Table entry includes location information

9
Organization of names table

Most common structure is hash table
With fixed number of headers
Chain according to hash code
Serial search on one chain
Hash code computed from characters (e.g. sum mod
table size).
No hash code is perfect! Expect collisions.
Avoid any arbitrary limits on table or chain size.

10
String Literals

Text must be stored
Actual characters are important
Not like identifiers must preserve casing
Character set issues uniform internal
representation
Table needed
Lexical analyzer returns key into table
May or may not be worth hashing to avoid
duplicates

11
Character Literals

Similar issues to string literals
Lexical Analyzer returns
Token kind
Identity of character
Cannot assume character set of host machine, may
be different

12
Numeric Literals

need a table to store numeric value
E.g. 123 0123 01_23 (Ada)
But cannot use predefined type for values
Because may have different bounds
Floating point representations much more complex
Denormals, correct rounding
Very delicate to compute correct value.
Host / target issues

13
Handling Comments

Comments have no effect on program
Can be eliminated by scanner
But may need to be retrieved by tools
Error detection issues
E.g. unclosed comments
Scanner skips over comments and returns next
meaningful token

14
Case Equivalence

Some languages are case-insensitive
Pascal, Ada
Some are not
C, Java
Lexical analyzer ignores case if needed
This_Routine THIS_RouTine
Error analysis may need exact casing
Friendly diagnostics follow users conventions

15
Performance Issues

Speed
Lexical analysis can become bottleneck
Minimize processing per character
Skip blanks fast
I/O is also an issue (read large blocks)
We compile frequently
Compilation time is important
Especially during development
Communicate with parser through global variables

16
General Approach

Define set of token kinds
An enumeration type (tok_int, tok_if, tok_plus,
tok_left_paren, tok_assign etc).
Or a series of integer definitions in more
primitive languages
Some tokens carry associated data
E.g. key for identifier table
May be useful to build tree node
For identifiers, literals etc

17
Interface to Lexical Analyzer

Either Convert entire file to a file of tokens
Lexical analyzer is separate phase
Or Parser calls lexical analyzer to supply next
token
This approach avoids extra I/O
Parser builds tree incrementally, using
successive tokens as tree nodes

18
Relevant Formalisms

Type 3 (Regular) Grammars
Regular Expressions
Finite State Machines
Equivalent in expressive power
Useful for program construction, even if
hand-written

19
Regular Grammars

Regular grammars
Non-terminals (arbitrary names)
Terminals (characters)
Productions limited to the following
Non-terminal terminal
Non-terminal terminal Non-terminal
Treat character class (e.g. digit) as terminal
Regular grammars cannot count cannot express
size limits on identifiers, literals
Cannot express proper nesting (parentheses)

20
Regular Grammars

grammar for real literals with no exponent
digit 0 1 2 3 4 5 6 7
8 9
REAL digit REAL1
REAL1 digit REAL1 (arbitrary
size)
REAL1 . INTEGER
INTEGER digit INTEGER (arbitrary size)
INTEGER digit
Start symbol is REAL

21
Regular Expressions

Regular expressions (RE) defined by an alphabet
(terminal symbols) and three operations
Alternation RE1 RE2
Concatenation RE1 RE2
Repetition RE (zero or more REs)
Language of REs regular grammars
Regular expressions are more convenient for some
applications

22
Specifying REs in Unix Tools

Single characters a b c d \x
Alternation bcd b-z abcd
Any character .
(period)
Match sequence of characters x y
Concatenation abcd-q
Optional RE 0-9(\.0-9)?

23
Finite State Machines

A language defined by a grammar is a (possibly
infinite) set of strings
An automaton is a computation that determines
whether a given string belongs to a specified
language
A finite state machine (FSM) is an automaton that
recognize regular languages (regular expressions)
Simplest automaton memory is single number
(state)

24
Specifying an FSM

A set of labeled states
Directed arcs between states labeled with
character
One or more states may be terminal (accepting)
A distinguished state is start
Automaton makes transition from state S1 to S2
If and only if arc from S1 to S2 is labeled with
next character in input
Token is legal if automaton stops on terminal
state

25
Building FSM from Grammar

One state for each non-terminal
A rule of the form
Nt1 terminal
Generates transition from S1 to final state
A rule of the form
Nt1 terminal Nt2
Generates transition from S1 to S2 on an arc
labeled by the terminal

26
Graphic representation

digit
digit
S
Int
letter
letter
letter
underscore
digit
id
digit
27
Building FSMs from REs

Every RE corresponds to a grammar
For all regular expressions
A natural translation to FSM exists
Alternation often leads to non-deterministic
machines

28
Non-Deterministic FSM

A non-deterministic FSM
Has at least one state
With two arcs to two distinct states
Labeled with the same character
Example from start state, a digit can begin an
integer literal or a real literal
Implementation requires backtracking
Nasty ?

29
Deterministic FSM

For all states S
For all characters C
There is at most one arc from any state S that is
labeled with C
Much easier to implement
No backtracking ?

30
From NFSM to DFSM

There is an algorithm for converting a
non-deterministic machine to a deterministic one
Result may have exponentially more states
Intuitively need new states to express
uncertainty about token int or real
Algorithm is efficient in practice (e.g. grep)
Other algorithms for minimizing number of states
of FSM, for showing equivalence, etc.

31
Implementing the Scanner

Three methods
Hand-coded approach
draw DFSM, then implement with loop and case
statement
Hybrid approach
define tokens using regular expressions, convert
to NFSM, apply algorithm to obtain minimal DSFM
Hand-code resulting DFSM
Automated approach
Use regular grammar as input to lexical scanner
generator (e.g. LEX)

32
Hand-coding

Normal coding techniques
Scan over white space and comments till non-blank
character found.
Branch depending on first character
If digit, scan numeric literal
If character, scan identifier or keyword
If operator, check next character (, etc.)
Need table to determine character type
efficiently
Return token found
Write aggressive efficient code gotos, global
variables

33
Using grammar and FSM

Start with regular grammar or RE
Typically found in the language reference
example (Ada)
Chapter 2. Lexical Elements
Digit 0 1 2 3 4 5 6 7 8 9
decimal-literal integer .integerexponent
integer digit underline digit
exponent E integer E - integer

34
Using grammar and FSM

Create one state for each non-terminal
Label edges according to productions in grammar
Each state becomes a label in the program
Code for each state is a switch on next
character, corresponding to edges out of current
state
If no possible transition on next character,
then
If state is accepting, return the corresponding
token
If state is not accepting, report error

35
Hand-coded version

Each state is encoded as follows
ltltstate1gtgt case Next_Character is when a gt
goto state3 when b gt goto state1 when
others gt End_of_token_processing end
case
ltltstate2gtgt
No explicit mention of state of automaton

36
Translating from FSM to code

variable holds current state
loop case State is when state1 gt
ltltstate1gtgt case Next_Character is
when a gt State state3 when b gt
State state1 when others gt
End_token_processing end case when
state2 end case
end loop

37
Automatic scanner construction

LEX builds a transition table, indexed by state
and by character.
Code gets transition from table
Tab array (State, Character) of
State
begin
while More_Input loop
Curstate Tab (Curstate,
Next_Char)
if Curstate Error_State then
end loop

38
Automatic FSM Generation

Our example, FLEX
See home page for manual in HTML
FLEX is given
A set of regular expressions
Actions associated with each RE
It builds a scanner
Which matches REs and executes actions

39
Flex General Format

Input to Flex is a set of rules
Regexp actions (C statements)
Regexp actions (C statements)
Flex scans the longest matching Regexp
And executes the corresponding actions

40
An Example of a Flex scanner

DIGIT 0-9ID a-za-z0-9DIGIT
printf (an integer s (d)\n,
yytext, atoi (yytext)) DIGIT.
DIGIT printf (a float
s (g)\n, yytext, atof
(yytext))ifthenbeginendprocedurefunction
printf (a keyword
s\n, yytext))

41
Flex Example (continued)

ID printf (an identifier s\n,
yytext)-/ printf (an
operator s\n, yytext)
--.\n / eat Ada style comment /
\t\n / eat white space /
. printf (unrecognized
character)

42
Assembling the flex program

include ltmath.hgt / for atof /
ltltflex text we gave goes heregtgt
main (argc, argv) int argc
char argv
yyin fopen (argv1, r)
yylex()

43
Running flex

flex is an executable program
The input is lexical grammar as described
The output is a running C program
For Ada fans
Look at aflex (www.adapower.com)
For C fans
flex can run in C mode
Generates appropriate classes

44
Choice Between Methods?

Hand written scanners
Typically much faster execution
Easy to write (standard structure)
Preferable for good error recovery
Flex approach
Simple to Use
Easy to modify token language

45
The GNAT Scanner

Hand written (scn.adb/scn.ads)
Each call does
Optimal scan past blanks/comments etc.
Processing based on first character
Call special routines for major classes
Namet.Get_Name for identifier (hashing)
Keywords recognized by special hash
Strings (scn-slit.adb)
complication with , and, etc. (string or
operator?)
Numeric literals (scn-nlit.adb)
complication with based literals 16FFF

46
Historical oddities

Because early keypunch machines were unreliable,
FORTRAN treats blanks as optional lexical
analysis and parsing are intertwined.
DO10I1.6 3 tokens
identifier operator literal
DO10I 1.6
DO10I1,6 7
tokens
Keyword stmt id operator literal comma
literal
DO 10 I 1
, 6
Celebrated NASA failure caused by this bug (?)