Lexical and Syntax Analysis Chapter 4 - PowerPoint PPT Presentation

About This Presentation
Title:

Lexical and Syntax Analysis Chapter 4

Description:

Lexical and Syntax Analysis Chapter 4 Compilation Language Specification Compiler passes Compiler passes Lexical analyzer Comparison with Lexical Analysis Lexical ... – PowerPoint PPT presentation

Number of Views:360
Avg rating:3.0/5.0
Slides: 86
Provided by: BarbaraH173
Category:

less

Transcript and Presenter's Notes

Title: Lexical and Syntax Analysis Chapter 4


1
Lexical and Syntax AnalysisChapter 4
2
  • Compilation
  • Translating from high-level language to machine
    code is organized into several phases or passes.
  • In the early days passes communicated through
    files, but this is no longer necessary.

3
  • Language Specification
  • We must first describe the language in question
    by giving its specification.
  • Syntax
  • Defines symbols (vocabulary)
  • Defines programs (sentences)
  • Semantics
  • Gives meaning to sentences.
  • The formal specifications are often the input to
    tools that build translators automatically.

4
  • Compiler passes

String of characters
String of tokens
Abstract syntax tree
Abstract syntax tree
Abstract syntax tree
Abstract syntax tree
Medium-level intermediate code
Low-level intermediate code
Medium-level intermediate code
Low-level intermediate code
Low-level intermediate code
Executable/object code
5
  • Compiler passes

source program
front end
Lexical scanner
Parser
semantic analyzer
symbol table manager
error handler
Translator
Optimizer
back end
Final assembly
target program
6
  • Lexical analyzer
  • Also called a scanner or tokenizer
  • Converts stream of characters into a stream of
    tokens
  • Tokens are
  • Keywords such as for, while, and class.
  • Special characters such as , -, (, and lt
  • Variable name occurrences
  • Constant occurrences such as 1, 0, true.

7
Comparison with Lexical Analysis
Phase Input Output
Lexer Sequence of characters Sequence of tokens
Parser Sequence of tokens Parse tree
8
  • Lexical analyzer
  • The lexical analyzer is usually a subroutine of
    the parser.
  • Each token is a single entity. A numerical code
    is usually assigned to each type of token.

9
  • Lexical analyzer
  • Lexical analyzers perform
  • Line reconstruction
  • delete comments
  • delete white spaces
  • perform text substitution
  • Lexical translation translation of lexemes -gt
    tokens
  • Often additional information is affiliated with a
    token.

10
  • Parser
  • Performs syntax analysis
  • Imposes syntactic structure on a sentence.
  • Parse trees are used to expose the structure.
  • These trees are often not explicitly built
  • Simpler representations of them are often used
  • Parsers, accepts a string of tokens and builds a
    parse tree representing the program

11
  • Parser
  • The collection of all the programs in a given
    language is usually specified using a list of
    rules known as a context free grammar.

12
Parser
  • A grammar has four components
  • A set of tokens known as terminal symbols
  • A set of variables or non-terminals
  • A set of productions where each production
    consists of a non-terminal, an arrow, and a
    sequence of tokens and/or non-terminals
  • A designation of one of the nonterminals as the
    start symbol.

13
Symbol Table Management
  • The symbol table is a data structure used by all
    phases of the compiler to keep track of user
    defined symbols and keywords.
  • During early phases (lexical and syntax analysis)
    symbols are discovered and put into the symbol
    table
  • During later phases symbols are looked up to
    validate their usage.

14
Symbol Table Management
  • Typical symbol table activities
  • add a new name
  • add information for a name
  • access information for a name
  • determine if a name is present in the table
  • remove a name
  • revert to a previous usage for a name (close a
    scope).

15
Symbol Table Management
  • Many possible Implementations
  • linear list
  • sorted list
  • hash table
  • tree structure

16
Symbol Table Management
  • Typical information fields
  • print value
  • kind (e.g. reserved, typeid, varid, funcid, etc.)
  • block number/level number
  • type
  • initial value
  • base address
  • etc.

17
  • Abstract Syntax Tree
  • The parse tree is used to recognize the
    components of the program and to check that the
    syntax is correct.
  • As the parser applies productions, it usually
    generates the component of a simpler tree (known
    as Abstract Syntax Tree).
  • The meaning of the component is derived out of
    the way the statement is organized in a subtree.

18
  • Semantic Analyzer
  • The semantic analyzer completes the symbol table
    with information on the characteristics of each
    identifier.
  • The symbol table is usually initialized during
    parsing.
  • One entry is created for each identifier and
    constant.
  • Scope is taken into account. Two different
    variables with the same name will have different
    entries in the symbol table.
  • The semantic analyzer completes the table using
    information from declarations.

19
  • Semantic Analyzer
  • The semantic analyzer does
  • Type checking
  • Flow of control checks
  • Uniqueness checks (identifiers, case labels,
    etc.)
  • One objective is to identify semantic errors
    statically. For example
  • Undeclared identifiers
  • Unreachable statements
  • Identifiers used in the wrong context.
  • Methods called with the wrong number of
    parameters or with parameters of the wrong type.

20
  • Semantic Analyzer
  • Some semantic errors have to be detected at run
    time. The reason is that the information may not
    be available at compile time.
  • Array subscript is out of bonds.
  • Variables are not initialized.
  • Divide by zero.

21
Error Management
  • Errors can occur at all phases in the compiler
  • Invalid input characters, syntax errors, semantic
    errors, etc.
  • Good compilers will attempt to recover from
    errors and continue.

22
  • Translator
  • The lexical scanner, parser, and semantic
    analyzer are collectively known as the front end
    of the compiler.
  • The second part, or back end starts by generating
    low level code from the (possibly optimized) AST.

23
Translator
  • Rather than generate code for a specific
    architecture, most compilers generate
    intermediate language
  • Three address code is popular.
  • Really a flattened tree representation.
  • Simple.
  • Flexible (captures the essence of many target
    architectures).
  • Can be interpreted.

24
Translator
  • One way of performing intermediate code
    generation
  • Attach meaning to each node of the AST.
  • The meaning of the sentence the meaning
    attached to the root of the tree.

25
  • XIL
  • An example of Medium level intermediate language
    is XIL. XIL is used by IBM to compile FORTRAN, C,
    C, and Pascal for RS/6000.
  • Compilers for Fortran 90 and C have been
    developed using XIL for other machines such as
    Intel 386, Sparc, and S/370.

26
Optimizers
  • Intermediate code is examined and improved.
  • Can be simple
  • changing aa1 to increment a
  • changing 35 to 15
  • Can be complicated
  • reorganizing data and data accesses for cache
    efficiency
  • Optimization can improve running time by orders
    of magnitude, often also decreasing program size.

27
Code Generation
  • Generation of real executable code for a
    particular target machine.
  • It is completed by the Final Assembly phase
  • Final output can either be
  • assembly language for the target machine
  • object code ready for linking
  • The target machine can be a virtual machine
    (such as the Java Virtual Machine, JVM), and the
    real executable code is virtual code (such as
    Java Bytecode).

28
Compiler Overview
Source Program
IF (altb) THEN c1d
Lexical Analyzer
IF
(
ID a
lt
ID b
THEN
ID c

CONST 1

ID d
Token Sequence
a
Syntax Analyzer
cond_expr
lt
b
Syntax Tree
IF_stmt
lhs
c
list
1
assign_stmt
rhs
Semantic Analyzer

d
GE a, b, L1 MUlT 1, d, c L1
3-Address Code
GE a, b, L1 MOV d, c L1
Code Optimizer
loadi R1,a cmpi R1,b jge L1 loadi R1,d storei
R1,c L1
Optimized 3-Addr. Code
Code Generation
Assembly Code
29
Lexical Analysis
30
What is Lexical Analysis?
  • The lexical analyzer deals with small-scale
    language constructs, such as names and numeric
    literals. The syntax analyzer deals with the
    large-scale constructs, such as expressions,
    statements, and program units.
  • - The syntax analysis portion consists of two
    parts
  • 1. A low-level part called a lexical analyzer
    (essentially a pattern matcher).
  • 2. A high-level part called a syntax analyzer,
    or parser.
  • The lexical analyzer collects characters into
    logical groupings and assigns internal codes to
    the groupings according to their structure.

31
Lexical Analyzer in Perspective
32
Lexical Analyzer in Perspective
  • LEXICAL ANALYZER
  • Scan Input
  • Remove white space,
  • Identify Tokens
  • Create Symbol Table
  • Insert Tokens into AST
  • Generate Errors
  • Send Tokens to Parser
  • PARSER
  • Perform Syntax Analysis
  • Actions Dictated by Token Order
  • Update Symbol Table Entries
  • Create Abstract Rep. of Source
  • Generate Errors

33
Lexical analyzers extract lexemes from a given
input string and produce the corresponding tokens.
  • Sum oldsum value /100
  • Token Lexeme
  • IDENT sum
  • ASSIGN_OP
  • IDENT oldsum
  • SUBTRACT_OP -
  • IDENT value
  • DIVISION_OP /
  • INT_LIT 100
  • SEMICOLON

34
Basic Terminology
  • What are Major Terms for Lexical Analysis?
  • TOKEN
  • A classification for a common set of strings
  • Examples Include ltIdentifiergt, ltnumbergt, etc.
  • PATTERN
  • The rules which characterize the set of strings
    for a token
  • LEXEME
  • Actual sequence of characters that matches
    pattern and is classified by a token
  • Identifiers x, count, name, etc

35
Basic Terminology
36
Token Definitions
Suppose S ts the string banana
Prefix ban, banana Suffix ana,
banana Substring nan, ban, ana,
banana Subsequence bnan, nn
37
Token Definitions
letter ? A B C Z a b
z digit ? 0 1 2 9 id ? letter (
letter digit )
Shorthand Notation one or more
r r ? r r r ? zero or
one r?r ? range set range of
characters (replaces )
A-Z A B C Z
id ? A-Za-zA-Za-z0-9
38
Token Recognition
Assume Following Tokens if, then,
else, re-loop, id, num
What language construct are they used for ?
Given Tokens, What are Patterns ?
Grammarstmt ? if expr then stmt if expr
then stmt else stmt ?expr ? term re-loop term
termterm ? id num
if ? if then ? then else ?
else Re-loop ? lt lt gt gt ltgt id
? letter ( letter digit ) num ? digit (.
digit ) ? ( E( -) ? digit ) ?
What does this represent ?
39
What Else Does Lexical Analyzer Do?
Scan away b, nl, tabs Can we Define Tokens For
These?
blank ? b tab ? T newline ?
M delim ? blank tab newline ws ?
delim
40
Symbol Tables
Note Each token has a unique token identifier
to define category of lexemes
41
Building a Lexical Analyzer
  • There are three approaches to building a lexical
    analyzer
  • 1. Write a formal description of the token
    patterns of the language using a descriptive
    language. Tool on UNIX system called lex
  • 2. Design a state transition diagram that
    describes the token patterns of the language and
    write a program that implements the diagram.
  • 3. Design a state transition diagram and
    hand-construct a table-driven implementation of
    the state diagram.

42
Diagrams for Tokens
  • Transition Diagrams (TD) are used to represent
    the tokens
  • Each Transition Diagram has
  • States Represented by Circles
  • Actions Represented by Arrows between states
  • Start State Beginning of a pattern
    (Arrowhead)
  • Final State(s) End of pattern (Concentric
    Circles)
  • Deterministic - No need to choose between 2
    different actions

43
Example Transition Diagrams
44
State diagram to recognize names, reserved words,
and integer literals
45
Reasons to use BNF to Describe Syntax
  • Provides a clear syntax description
  • The parser can be based directly on the BNF
  • Parsers based on BNF are easy to maintain

46
Reasons to Separate Lexical and Syntax Analysis
  • Simplicity - less complex approaches can be used
    for lexical analysis separating them simplifies
    the parser
  • Efficiency - separation allows optimization of
    the lexical analyzer
  • Portability - parts of the lexical analyzer may
    not be portable, but the parser always is portable

47
Summary of Lexical Analysis
  • A lexical analyzer is a pattern matcher for
    character strings
  • A lexical analyzer is a front-end for the
    parser
  • Identifies substrings of the source program that
    belong together - lexemes
  • Lexemes match a character pattern, which is
    associated with a lexical category called a token
  • - sum is a lexeme its token may be IDENT

48
Semantic AnalysisIntro to Type Checking
49
The Compiler So Far
  • Lexical analysis
  • Detects inputs with illegal tokens
  • Parsing
  • Detects inputs with ill-formed parse trees
  • Semantic analysis
  • The last front end phase
  • Catches more errors

50
Whats Wrong?
  • Example 1
  • int in x
  • Example 2
  • int i 12.34

51
Why a Separate Semantic Analysis?
  • Parsing cannot catch some errors
  • Some language constructs are not context-free
  • Example All used variables must have been
    declared (i.e. scoping)
  • Example A method must be invoked with arguments
    of proper type (i.e. typing)

52
What Does Semantic Analysis Do?
  • Checks of many kinds
  • All identifiers are declared
  • Types
  • Inheritance relationships
  • Classes defined only once
  • Methods in a class defined only once
  • Reserved identifiers are not misused
  • And others . . .
  • The requirements depend on the language

53
Scope
  • Matching identifier declarations with uses
  • Important semantic analysis step in most
    languages

54
Scope (Cont.)
  • The scope of an identifier is the portion of a
    program in which that identifier is accessible
  • The same identifier may refer to different things
    in different parts of the program
  • Different scopes for same name dont overlap
  • An identifier may have restricted scope

55
Static vs. Dynamic Scope
  • Most languages have static scope
  • Scope depends only on the program text, not
    run-time behavior
  • C has static scope
  • A few languages are dynamically scoped
  • Lisp, COBOL
  • Current Lisp has changed to mostly static scoping
  • Scope depends on execution of the program

56
Class Definitions
  • Class names can be used before being defined
  • We cant check this property
  • using a symbol table
  • or even in one pass
  • Solution
  • Pass 1 Gather all class names
  • Pass 2 Do the checking
  • Semantic analysis requires multiple passes
  • Probably more than two

57
Types
  • What is a type?
  • The notion varies from language to language
  • Consensus
  • A set of values
  • A set of operations on those values
  • Classes are one instantiation of the modern
    notion of type

58
Why Do We Need Type Systems?
  • Consider the assembly language fragment
  • addi r1, r2, r3
  • What are the types of r1, r2, r3?

59
Types and Operations
  • Certain operations are legal for values of each
    type
  • It doesnt make sense to add a function pointer
    and an integer in C
  • It does make sense to add two integers
  • But both have the same assembly language
    implementation!

60
Type Systems
  • A languages type system specifies which
    operations are valid for which types
  • The goal of type checking is to ensure that
    operations are used with the correct types
  • Enforces intended interpretation of values,
    because nothing else will!
  • Type systems provide a concise formalization of
    the semantic checking rules

61
What Can Types do For Us?
  • Can detect certain kinds of errors
  • Memory errors
  • Reading from an invalid pointer, etc.
  • Violation of abstraction boundaries
  • class FileSystem
  • open(x String) File

class Client f(fs FileSystem)
File fdesc lt- fs.open(foo) -- f
cannot see inside fdesc !
62
Type Checking Overview
  • Three kinds of languages
  • Statically typed All or almost all checking of
    types is done as part of compilation (C and Java)
  • Dynamically typed Almost all checking of types
    is done as part of program execution (Scheme)
  • Untyped No type checking (machine code)

63
The Type Wars
  • Competing views on static vs. dynamic typing
  • Static typing proponents say
  • Static checking catches many programming errors
    at compile time
  • Avoids overhead of runtime type checks
  • Dynamic typing proponents say
  • Static type systems are restrictive
  • Rapid prototyping easier in a dynamic type system

64
The Type Wars (Cont.)
  • In practice, most code is written in statically
    typed languages with an escape mechanism
  • Unsafe casts in C, Java
  • Its debatable whether this compromise represents
    the best or worst of both worlds

65
Type Checking and Type Inference
  • Type Checking is the process of verifying fully
    typed programs
  • Type Inference is the process of filling in
    missing type information
  • The two are different, but are often used
    interchangeably

66
Rules of Inference
  • We have seen two examples of formal notation
    specifying parts of a compiler
  • Regular expressions (for the lexer)
  • Context-free grammars (for the parser)
  • The appropriate formalism for type checking is
    logical rules of inference

67
Why Rules of Inference?
  • Inference rules have the form
  • If Hypothesis is true, then Conclusion is true
  • Type checking computes via reasoning
  • If E1 and E2 have certain types, then E3 has a
    certain type
  • Rules of inference are a compact notation for
    If-Then statements

68
From English to an Inference Rule
  • The notation is easy to read (with practice)
  • Start with a simplified system and gradually add
    features
  • Building blocks
  • Symbol Ù is and
  • Symbol Þ is if-then
  • xT is x has type T

69
From English to an Inference Rule (2)
  • If e1 has type Int and e2 has type Int,
    then e1 e2 has type Int
  • (e1 has type Int Ù e2 has type Int) Þ
    e1 e2 has type Int
  • (e1 Int Ù e2 Int) Þ e1 e2 Int

70
From English to an Inference Rule (3)
  • The statement
  • (e1 Int Ù e2 Int) Þ e1 e2 Int
  • is a special case of
  • ( Hypothesis1 Ù . . . Ù Hypothesisn ) Þ
    Conclusion
  • This is an inference rule

71
Notation for Inference Rules
  • By tradition inference rules are written
  • Type rules can also have hypotheses and
    conclusions of the form
  • e T
  • means it is provable that . . .

Hypothesis1 Hypothesisn
Conclusion
72
Two Rules
i is an integer
i Int
Int
e1 Int e2 Int
e1 e2 Int
Add
73
Two Rules (Cont.)
  • These rules give templates describing how to type
    integers and expressions
  • By filling in the templates, we can produce
    complete typings for expressions

74
Example 1 2
1 is an integer 2 is an integer
1 Int 2 Int
1 2 Int 1 2 Int 1 2 Int
75
Soundness
  • A type system is sound if
  • Whenever e T
  • Then e evaluates to a value of type T
  • We only want sound rules
  • But some sound rules are better than others

i is an integer
i Object
76
Type Checking Proofs
  • Type checking proves facts e T
  • Proof is on the structure of the AST
  • Proof has the shape of the AST
  • One type rule is used for each kind of AST node
  • In the type rule used for a node e
  • Hypotheses are the proofs of types of es
    sub-expressions
  • Conclusion is the proof of type of e
  • Types are computed in a bottom-up pass over the
    AST

77
Rules for Constants

false Bool
Bool
s is a string constant
s String
String
78
Two More Rules
e Bool
not e Bool
Not
e1 Bool e2 T
while e1 loop e2 pool Object
Loop
79
A Problem
  • What is the type of a variable reference?
  • The local, structural rule does not carry enough
    information to give x a type.

x is an identifier
x ?
Var
80
Notes
  • The type environment gives types to the free
    identifiers in the current scope
  • The type environment is passed down the AST from
    the root towards the leaves
  • Types are computed up the AST from the leaves
    towards the root

81
Expressiveness of Static Type Systems
  • A static type system enables a compiler to detect
    many common programming errors
  • The cost is that some correct programs are
    disallowed
  • Some argue for dynamic type checking instead
  • Others argue for more expressive static type
    checking
  • But more expressive type systems are also more
    complex

82
Dynamic And Static Types
  • The dynamic type of an object is the class C that
    is used in the new C expression that creates
    the object
  • A run-time notion
  • Even languages that are not statically typed have
    the notion of dynamic type
  • The static type of an expression is a notation
    that captures all possible dynamic types the
    expression could take
  • A compile-time notion

83
Dynamic And Static Types
  • The typing rules use very concise notation
  • They are very carefully constructed
  • Virtually any change in a rule either
  • Makes the type system unsound
  • (bad programs are accepted as well typed)
  • Or, makes the type system less usable
  • (perfectly good programs are rejected)
  • But some good programs will be rejected anyway
  • The notion of a good program is undecidable

84
Type Systems
  • Type rules are defined on the structure of
    expressions
  • Types of variables are modeled by an environment
  • Types are a play between flexibility and safety

85
End of Lecture 6
Write a Comment
User Comments (0)
About PowerShow.com