Lexical Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Lexical Analysis

Description:

Might be sequence of lines (VMS) Character set: ASCII. ISO Latin-1. ISO 10646 (16-bit = unicode) Ada, Java. Others (EBCDIC, JIS, etc) The Output ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 47
Provided by: robertberr
Learn more at: https://cs.nyu.edu
Category:
Tags: analysis | lexical | vms

less

Transcript and Presenter's Notes

Title: Lexical Analysis


1
Lexical Analysis
2
The Input
  • Read string input
  • Might be sequence of characters (Unix)
  • Might be sequence of lines (VMS)
  • Character set
  • ASCII
  • ISO Latin-1
  • ISO 10646 (16-bit unicode) Ada, Java
  • Others (EBCDIC, JIS, etc)

3
The Output
  • A series of tokens kind, location, name (if any)
  • Punctuation ( ) ,
  • Operators -
  • Keywords begin end if while try
    catch
  • Identifiers Square_Root
  • String literals press Enter to
    continue
  • Character literals x
  • Numeric literals
  • Integer 123
  • Floating_point 4_5.23e2
  • Based representation 16ac

4
Free form vs Fixed form
  • Free form languages (all modern ones)
  • White space does not matter. Ignore these
  • Tabs, spaces, new lines, carriage returns
  • Only the ordering of tokens is important
  • Fixed format languages (historical)
  • Layout is critical
  • Fortran, label in cols 1-6
  • COBOL, area A B
  • Lexical analyzer must know about layout to find
    tokens

5
Punctuation Separators
  • Typically individual special characters such as
    ( .. (two dots)
  • Sometimes double characters lexical scanner
    looks for longest token
  • (, / -- comment openers in various
    languages
  • Returned just as identity (kind) of token
  • And perhaps location for error messages and
    debugging purposes

6
Operators
  • Like punctuation
  • No real difference for lexical analyzer
  • Typically single or double special chars
  • Operators - lt
  • Operations gt
  • Returned as kind of token
  • And perhaps location

7
Keywords
  • Reserved identifiers
  • E.g. BEGIN END in Pascal, if in C, catch in C
  • Maybe distinguished from identifiers
  • E.g. mode vs mode in Algol-68
  • Returned as kind of token
  • With possible location information
  • Oddity unreserved keywords in PL/1
  • IF IF THEN THEN THEN 1
  • Handled as identifiers (parser disambiguates)

8
Identifiers
  • Rules differ
  • Length, allowed characters, separators
  • Need to build a names table
  • Single entry for all occurrences of Var1
  • Language may be case insensitive same entry for
    VAR1, vAr1, Var1
  • Typical structure hash table
  • Lexical analyzer returns token kind
  • And key (index) to table entry
  • Table entry includes location information

9
Organization of names table
  • Most common structure is hash table
  • With fixed number of headers
  • Chain according to hash code
  • Serial search on one chain
  • Hash code computed from characters (e.g. sum mod
    table size).
  • No hash code is perfect! Expect collisions.
  • Avoid any arbitrary limits on table or chain size.

10
String Literals
  • Text must be stored
  • Actual characters are important
  • Not like identifiers must preserve casing
  • Character set issues uniform internal
    representation
  • Table needed
  • Lexical analyzer returns key into table
  • May or may not be worth hashing to avoid
    duplicates

11
Character Literals
  • Similar issues to string literals
  • Lexical Analyzer returns
  • Token kind
  • Identity of character
  • Cannot assume character set of host machine, may
    be different

12
Numeric Literals
  • need a table to store numeric value
  • E.g. 123 0123 01_23 (Ada)
  • But cannot use predefined type for values
  • Because may have different bounds
  • Floating point representations much more complex
  • Denormals, correct rounding
  • Very delicate to compute correct value.
  • Host / target issues

13
Handling Comments
  • Comments have no effect on program
  • Can be eliminated by scanner
  • But may need to be retrieved by tools
  • Error detection issues
  • E.g. unclosed comments
  • Scanner skips over comments and returns next
    meaningful token

14
Case Equivalence
  • Some languages are case-insensitive
  • Pascal, Ada
  • Some are not
  • C, Java
  • Lexical analyzer ignores case if needed
  • This_Routine THIS_RouTine
  • Error analysis may need exact casing
  • Friendly diagnostics follow users conventions

15
Performance Issues
  • Speed
  • Lexical analysis can become bottleneck
  • Minimize processing per character
  • Skip blanks fast
  • I/O is also an issue (read large blocks)
  • We compile frequently
  • Compilation time is important
  • Especially during development
  • Communicate with parser through global variables

16
General Approach
  • Define set of token kinds
  • An enumeration type (tok_int, tok_if, tok_plus,
    tok_left_paren, tok_assign etc).
  • Or a series of integer definitions in more
    primitive languages
  • Some tokens carry associated data
  • E.g. key for identifier table
  • May be useful to build tree node
  • For identifiers, literals etc

17
Interface to Lexical Analyzer
  • Either Convert entire file to a file of tokens
  • Lexical analyzer is separate phase
  • Or Parser calls lexical analyzer to supply next
    token
  • This approach avoids extra I/O
  • Parser builds tree incrementally, using
    successive tokens as tree nodes

18
Relevant Formalisms
  • Type 3 (Regular) Grammars
  • Regular Expressions
  • Finite State Machines
  • Equivalent in expressive power
  • Useful for program construction, even if
    hand-written

19
Regular Grammars
  • Regular grammars
  • Non-terminals (arbitrary names)
  • Terminals (characters)
  • Productions limited to the following
  • Non-terminal terminal
  • Non-terminal terminal Non-terminal
  • Treat character class (e.g. digit) as terminal
  • Regular grammars cannot count cannot express
    size limits on identifiers, literals
  • Cannot express proper nesting (parentheses)

20
Regular Grammars
  • grammar for real literals with no exponent
  • digit 0 1 2 3 4 5 6 7
    8 9
  • REAL digit REAL1
  • REAL1 digit REAL1 (arbitrary
    size)
  • REAL1 . INTEGER
  • INTEGER digit INTEGER (arbitrary size)
  • INTEGER digit
  • Start symbol is REAL

21
Regular Expressions
  • Regular expressions (RE) defined by an alphabet
    (terminal symbols) and three operations
  • Alternation RE1 RE2
  • Concatenation RE1 RE2
  • Repetition RE (zero or more REs)
  • Language of REs regular grammars
  • Regular expressions are more convenient for some
    applications

22
Specifying REs in Unix Tools
  • Single characters a b c d \x
  • Alternation bcd b-z abcd
  • Any character .
    (period)
  • Match sequence of characters x y
  • Concatenation abcd-q
  • Optional RE 0-9(\.0-9)?

23
Finite State Machines
  • A language defined by a grammar is a (possibly
    infinite) set of strings
  • An automaton is a computation that determines
    whether a given string belongs to a specified
    language
  • A finite state machine (FSM) is an automaton that
    recognize regular languages (regular expressions)
  • Simplest automaton memory is single number
    (state)

24
Specifying an FSM
  • A set of labeled states
  • Directed arcs between states labeled with
    character
  • One or more states may be terminal (accepting)
  • A distinguished state is start
  • Automaton makes transition from state S1 to S2
  • If and only if arc from S1 to S2 is labeled with
    next character in input
  • Token is legal if automaton stops on terminal
    state

25
Building FSM from Grammar
  • One state for each non-terminal
  • A rule of the form
  • Nt1 terminal
  • Generates transition from S1 to final state
  • A rule of the form
  • Nt1 terminal Nt2
  • Generates transition from S1 to S2 on an arc
    labeled by the terminal

26
Graphic representation

digit
digit
S
Int
letter
letter
letter
underscore
digit
id
digit
27
Building FSMs from REs
  • Every RE corresponds to a grammar
  • For all regular expressions
  • A natural translation to FSM exists
  • Alternation often leads to non-deterministic
    machines

28
Non-Deterministic FSM
  • A non-deterministic FSM
  • Has at least one state
  • With two arcs to two distinct states
  • Labeled with the same character
  • Example from start state, a digit can begin an
    integer literal or a real literal
  • Implementation requires backtracking
  • Nasty ?

29
Deterministic FSM
  • For all states S
  • For all characters C
  • There is at most one arc from any state S that is
    labeled with C
  • Much easier to implement
  • No backtracking ?

30
From NFSM to DFSM
  • There is an algorithm for converting a
    non-deterministic machine to a deterministic one
  • Result may have exponentially more states
  • Intuitively need new states to express
    uncertainty about token int or real
  • Algorithm is efficient in practice (e.g. grep)
  • Other algorithms for minimizing number of states
    of FSM, for showing equivalence, etc.

31
Implementing the Scanner
  • Three methods
  • Hand-coded approach
  • draw DFSM, then implement with loop and case
    statement
  • Hybrid approach
  • define tokens using regular expressions, convert
    to NFSM, apply algorithm to obtain minimal DSFM
  • Hand-code resulting DFSM
  • Automated approach
  • Use regular grammar as input to lexical scanner
    generator (e.g. LEX)

32
Hand-coding
  • Normal coding techniques
  • Scan over white space and comments till non-blank
    character found.
  • Branch depending on first character
  • If digit, scan numeric literal
  • If character, scan identifier or keyword
  • If operator, check next character (, etc.)
  • Need table to determine character type
    efficiently
  • Return token found
  • Write aggressive efficient code gotos, global
    variables

33
Using grammar and FSM
  • Start with regular grammar or RE
  • Typically found in the language reference
  • example (Ada)
  • Chapter 2. Lexical Elements
  • Digit 0 1 2 3 4 5 6 7 8 9
  • decimal-literal integer .integerexponent
  • integer digit underline digit
  • exponent E integer E - integer

34
Using grammar and FSM
  • Create one state for each non-terminal
  • Label edges according to productions in grammar
  • Each state becomes a label in the program
  • Code for each state is a switch on next
    character, corresponding to edges out of current
    state
  • If no possible transition on next character,
    then
  • If state is accepting, return the corresponding
    token
  • If state is not accepting, report error

35
Hand-coded version
  • Each state is encoded as follows
  • ltltstate1gtgt case Next_Character is when a gt
    goto state3 when b gt goto state1 when
    others gt End_of_token_processing end
    case
  • ltltstate2gtgt
  • No explicit mention of state of automaton

36
Translating from FSM to code
  • variable holds current state
  • loop case State is when state1 gt
  • ltltstate1gtgt case Next_Character is
    when a gt State state3 when b gt
    State state1 when others gt
    End_token_processing end case when
    state2 end case
  • end loop

37
Automatic scanner construction
  • LEX builds a transition table, indexed by state
    and by character.
  • Code gets transition from table
  • Tab array (State, Character) of
    State
  • begin
  • while More_Input loop
  • Curstate Tab (Curstate,
    Next_Char)
  • if Curstate Error_State then
    end loop

38
Automatic FSM Generation
  • Our example, FLEX
  • See home page for manual in HTML
  • FLEX is given
  • A set of regular expressions
  • Actions associated with each RE
  • It builds a scanner
  • Which matches REs and executes actions

39
Flex General Format
  • Input to Flex is a set of rules
  • Regexp actions (C statements)
  • Regexp actions (C statements)
  • Flex scans the longest matching Regexp
  • And executes the corresponding actions

40
An Example of a Flex scanner
  • DIGIT 0-9ID a-za-z0-9DIGIT
    printf (an integer s (d)\n,
    yytext, atoi (yytext)) DIGIT.
    DIGIT printf (a float
    s (g)\n, yytext, atof
    (yytext))ifthenbeginendprocedurefunction
    printf (a keyword
    s\n, yytext))

41
Flex Example (continued)
  • ID printf (an identifier s\n,
    yytext)-/ printf (an
    operator s\n, yytext)
  • --.\n / eat Ada style comment /
  • \t\n / eat white space /
  • . printf (unrecognized
    character)

42
Assembling the flex program
  • include ltmath.hgt / for atof /
  • ltltflex text we gave goes heregtgt
  • main (argc, argv) int argc
  • char argv
  • yyin fopen (argv1, r)
  • yylex()

43
Running flex
  • flex is an executable program
  • The input is lexical grammar as described
  • The output is a running C program
  • For Ada fans
  • Look at aflex (www.adapower.com)
  • For C fans
  • flex can run in C mode
  • Generates appropriate classes

44
Choice Between Methods?
  • Hand written scanners
  • Typically much faster execution
  • Easy to write (standard structure)
  • Preferable for good error recovery
  • Flex approach
  • Simple to Use
  • Easy to modify token language

45
The GNAT Scanner
  • Hand written (scn.adb/scn.ads)
  • Each call does
  • Optimal scan past blanks/comments etc.
  • Processing based on first character
  • Call special routines for major classes
  • Namet.Get_Name for identifier (hashing)
  • Keywords recognized by special hash
  • Strings (scn-slit.adb)
  • complication with , and, etc. (string or
    operator?)
  • Numeric literals (scn-nlit.adb)
  • complication with based literals 16FFF

46
Historical oddities
  • Because early keypunch machines were unreliable,
    FORTRAN treats blanks as optional lexical
    analysis and parsing are intertwined.
  • DO10I1.6 3 tokens
  • identifier operator literal
  • DO10I 1.6
  • DO10I1,6 7
    tokens
  • Keyword stmt id operator literal comma
    literal
  • DO 10 I 1
    , 6
  • Celebrated NASA failure caused by this bug (?)
Write a Comment
User Comments (0)
About PowerShow.com