Title: CSC 415: Translators and Compilers
1CSC 415 Translators and Compilers
2Course Outline
- Translators and Compilers
- Language Processors
- Compilation
- Syntactic Analysis
- Contextual Analysis
- Run-Time Organization
- Code Generation
- Interpretation
- Major Programming Project
- Project Definition and Planning
- Implementation
- Weekly Status Reports
- Project Presentation
3Project
- Implement a Compiler for the Programming Language
Triangle - Appendix B Informal Specification of the
Programming Language Triangle - Appendix D Class Diagrams for the Triangle
Compiler - Present Project Plan
- What and How
- Weekly Status Reports
- Work accomplished during the reporting period
- Deliverable progress, as a percentage of
completion - Problem areas
- Planned activities for the next reporting period
4Chapter 1 Introduction to Programming Languages
- Programming Language A formal notation for
expressing algorithms. - Programming Language Processors Tools to enter,
edit, translate, and interpret programs on
machines. - Machine Code Basic machine instructions
- Keep track of exact address of each data item and
each instruction - Encode each instruction as a bit string
- Assembly Language Symbolic names for operations,
registers, and addresses.
5Programming Languages
- High Level Languages Notation similar to
familiar mathematical notation - Expressions , -, , /
- Data Types truth variables, characters,
integers, records, arrays - Control Structures if, case, while, for
- Declarations constant values, variables,
procedures, functions, types - Abstraction separates what is to be performed
from how it is to be performed - Encapsulation (or data abstraction) group
together related declarations and selectively
hide some
6Programming Languages
- Any system that manipulates programs expressed in
some particular programming language - Editors enter, modify, and save program text
- Translators and Compilers Translates text from
one language to another. Compiler translates a
program from a high-level language to a low-level
language, preparing it to be run on a machine - Checks program for syntactic and contextual
errors - Interpreters Runs program without compliation
- Command languages
- Database query languages
7Programming Languages Specifications
- Syntax
- Form of the program
- Defines symbols
- How phrases are composed
- Contextual constraints
- Scope determine scope of each declaration
- Type
- Semantics
- Meaning of the program
8Representation
- Syntax
- Backus-Naur Form (BNF) context-free grammar
- Terminal symbols (gt, while, )
- Non-terminal symbols (Program, Command,
Expression, Declaration) - Start symbol (Program)
- Production rules (defines how phrases are
composed from terminals and sub-phrases) - Nab.
- Syntax Tree
- Used to define language in terms of strings and
terminal symbols
9Representation
- Semantics
- Abstract Syntax
- Concentrate on phrase structure alone
- Abstract Syntax Tree
10Contextual Constraints
- Scope
- Binding
- Static determined by language processor
- Dynamic determined at run-time
- Type
- Statically language processor can detect all
errors - Dynamically type errors cannot be detected until
run-time
Will assume static binding and statically typed
11Semantics
- Concerned with meaning of program
- Behavior when run
- Usually specified informally
- Declarative sentences
- Could include side effects
- Correspond to production rules
12Mini-Triangle Syntax
- single-Command
- single-Command
- Command single-Command
- V-name Expression
- Identifier ( Expression )
- if Expression then single-Command
- else single-Command
- while Expression do single-Command
- let Declaration in single-Command
- begin Command end
- primary-Expression
- Expression Operator primary-Expression
- Program
- Command
- Single-Command
- Expression
13Mini-Triangle Syntax
- Integer-Literal
- V-name
- Operator primary-Expression
- ( Expression )
- Identifier
- single-Declaration
- Declaration single-Declaration
- const Identifier Expression
- var Identifier Type-denoter
- Identifier
- - / lt gt \
- Letter Identifier Letter Identifier Digit
- Digit Integer-Literal Digit
- ! Graphic eol
- Primary-Expression
- V-name
- Declaration
- Single-Declaration
- Type-Denoter
- Operator
- Identifier
- Integer-Literal
- Comment
14Syntax Tree let var y Integer in y y 1
Program
single-Command
single-Command
Expression
Declaration
Expression
primary-Expression
primary-Expression
single-Declaration
Type-denoter
V-name
V-name
Integer-Literal
Identifier
Identifier
Identifier
Identifier
Operator
y
var
y
y
let
Integer
in
1
15Representation
- Semantics
- Abstract Syntax
- Concentrate on phrase structure alone
- Abstract Syntax Tree
16Mini-Triangle Abstract Syntax
Label Program AssignCommand CallCommand Sequential
Command IfCommand WhileCommand LetCommand Integer
Expression VnameExpression UnaryExpression BinaryE
xpression
- Command
- V-name Expression
- Identifier ( Expression )
- Command Command
- if Expression then Command
- else Command
- while Expression do Command
- let Declaration in Command
- Integer-Literal
- V-name
- Operator Expression
- Expression Operator Expression
- Program
- Command
- Expression
17Mini-Triangle Abstract Syntax
Label SimpleVname ConstDeclaration VarDeclaration
SequentialDeclaration SimpleTypeDenoter
- Identifier
- const Identifier Expression
- var Identifier Type-denoter
- Declaration Declaration
- Identifier
- V-name
- Declaration
- Type-Denoter
18Abstract Syntax Tree let var y Integer in y
y 1
Program
LetCommand
AssignmentCommand
BinaryExpression
VarDeclaration
Expression
IntegerExpression
VnameExpression
SimpleTypeDenoter
SimpleVname
SimpleVname
Integer-Literal
Identifier
Identifier
Identifier
Identifier
Operator
y
y
y
Integer
1
19Mini-Triangle Semantics
- A command C is executed in order to update
variables (this includes input and output) - The assignment statement V E is executed as
follows. The expression E is evaluated to yield
a value v then v is assigned to the
value-or-variable-name V. - The call-command I (E) is executed as follows.
The expression E is evaluated to yield a value v
then the procedure bound to I is called with v as
its argument. - The sequence command C1 C2 is executed as
follows. First C1 is executed then C2 is
executed. - The if-command if E then C1 else C2 is executed
as follows. The expression E is evaluated to
yield a truth-value t If t is true, C1 is
executed if t is false, C2 is executed. - The while-command while E do C is executed as
follows. The expression E is evaluated to yield
a truth-value t if t is true, C is executed, and
then the while-command is executed again if t is
false, execution of the while-command is
completed. - The let-command let D in C is executed as
follows. The declaration D is elaborated to
produce bindings b C is executed, in the
environment of the let-command overlaid by the
bindings b. The bindings b have no effect
outside the let-command.
20Chapter 2 Language Processors
- Translators and Compilers
- Interpreters
- Real and Abstract Machines
- Interpretive Compilers
- Portable Compilers
- Bootstrapping
- Case Study The Triangle Language Processor
21Translators Compilers
- Translator a program that accepts any text
expressed in one language (the translators
source language), and generates a
semantically-equivalent text expressed in another
language (its target language) - Chinese-into-English
- Java-into-C
- Java-into-x86
- X86 assembler
22Translators Compilers
- Assembler translates from an assembly language
into the corresponding machine code - Generates one machine code instruction per source
instruction - Compiler translates from a high-level language
into a low-level language - Generates several machine-code instructions per
source command.
23Translators Compilers
- Disassembler translates a machine code into the
corresponding assembly language - Decompiler translates a low-level language into
a high-level language
Question Why would you want a disassembler or
decompiler?
24Translators Compilers
- Source Program the source language text
- Object Program the target language text
Compiler
Syntax Check
Context Constraints
- Object program semantically equivalent to source
program - If source program is well-formed
25Translators Compilers
- Why would you want to do
- Java-into-C translator
- C-into-Java translator
- Assembly-language-into-Pascal decompiler
26Translators Compilers
P Program Name
L Implementation Language
M Target Machine
For this to work, L must equal M, that is, the
implementation language must be the same as the
machine language
S Source Language
T Target Language
L Translators Implementation Language
S-into-T Translator is itself a program that runs
on machine L
27Translators Compilers
- Translating a source program P
- Expressed in language T,
- Using an S-into-T translator
- Running on machine M
28Translators Compilers
sort
sort
sort
Java
x86
Java
x86
x86
x86
- Translating a source program sort
- Expressed in language Java,
- Using an Java-into-x86 translator
- Running on an x86 machine
The object program is running on the same machine
as the compiler
29Translators Compilers
sort
sort
sort
Java
PPC
Java
PPC
PPC
download
x86
- Translating a source program sort
- Expressed in language Java,
- Using an Java-into-PPC translator
- Running on an x86 machine
- Downloaded to a PPC machine
Cross Compiler The object program is running on
a different machine than the compiler
30Translators Compilers
sort
sort
sort
Java
Java
C
C
C
x86
x86
- Translating a source program sort
- Expressed in language Java,
- Using an Java-into-C translator
- Running on an x86 machine
- Then translating the C program
- Using an C-into x86 compiler
- Running on an x86 machine
- Into x86 object program
Two-stage Compiler The source program is
translated to another language before being
translated into the object program
31Translators Compilers
- Translator Rules
- Can run on machine M only if it is expressed in
machine code M - Source program must be expressed in translators
source language S - Object program is expressed in the translators
target language T - Object program is semantically equivalent to the
source program
32Interpreters
- Accepts any program (source program) expressed in
a particular language (source language) and runs
that source program immediately - Does not translate the source program into object
code prior to execution
33Interpreters
Interpreter
Fetch Instruction
Analyze Instruction
Program Complete
Execute Instruction
- Source program starts to run as soon as the first
instruction is analyzed
34Interpreters
- When to Use Interpretation
- Interactive mode want to see results of
instruction before entering next instruction - Only use program once
- Each instruction expected to be executed only
once - Instructions have simple formats
- Disadvantages
- Slow up to 100 times slower than in machine code
35Interpreters
- Examples
- Basic
- Lisp
- Unix Command Language (shell)
- SQL
36Interpreters
S interpreter expressed in language L
Program P expressed in language S, using
Interpreter S, running on machine M
Program graph written in Basic running on a Basic
interpreter executed on an x86 machine
37Real and Abstract Machines
- Hardware emulation Using software to execute one
set of machine code on another machine - Can measure everything about the new machine
except its speed - Abstract machine emulator
- Real machine actual hardware
An abstract machine is functionally equivalent to
a real machine if they both implement the same
language L
38Real and Abstract Machines
New Machine Instruction (nmi) interpreter written
in C
nmi interpreter expressed in machine code M
nmi interpreter written in C
The nmi interpreter is translated into machine
code M using the C compiler
Compiler to translate C program into M machine
code
39Interpretive Compilers
- Combination of compiler and interpreter
- Translate source program into an intermediate
language - It is intermediate in level between the source
language and ordinary machine code - Its instructions have simple formats, and
therefore can be analyzed easily and quickly - Translation from the source language into the
intermediate language is easy and fast
An interpretive compiles combines fast
compilation with tolerable running speed
40Interpretive Compilers
Java into JVM translator running on machine M
JVM code interpreter running on machine M
A Java program P is first translated into
JVM-code, and then the JVM-code object program is
interpreted
41Portable Compilers
- A program is portable if it can be compiled and
run on any machine, without change - A portable program is more valuable than an
unportable one, because its development cost can
be spread over more copies - Portability is measured by the proportion of code
that remains unchanged when it is moved to a
dissimilar machine - Language affects protability
- Assembly language 0 portable
- High level language approaches 100 portability
42Portable Compilers
- Language Processors
- Valuable and widely used programs
- Typically written in high-level language
- Pascal, C, Java
- Part of language processor is machine dependent
- Code generation part
- Language processor is only about 50 portable
- Compiler that generates intermediate code is more
portable than a compiler that generates machine
code
43Portable Compilers
Java
JVM
Java
Rewrite interpreter in C
44Bootstrapping
- The language processor is used to process itself
- Implementation language is the source language
- Bootstrapping a portable compiler
- A portable compiler can be bootstrapped to make a
true compiler one that generates machine code
by writing an intermediate-language-into-machine-c
ode translator - Full bootstrap
- Writing the compiler in itself
- Using the latest version to upgrade the next
version - Half bootstrap
- Compiler expressed in itself but targeted for
another machine - Bootstrapping to improve efficiency
- Upgrade the compiler to optomize code generation
as well as to improve compile efficiency
45Bootstrapping
Bootstrap an interpretive compiler to generate
machine code
First, write a JVM-coded-into-M translator in Java
Next, compile translator using existing
interpreter
Use translator to translate itself
Two stage Java-into-M compiler
Translate Java-into-JVM-code translator into
machine code
46Bootstrapping
Full bootstrap
v2
v1
Convert the C version of Ada-S into Ada-S version
of Ada-S
Write Ada-S compiler in C
v1
v2
v3
Extend Ada-S compiler to (full) Ada compiler
47Bootstrapping
Half bootstrap
48Bootstrapping
Bootstrap to improve efficiency
49Chapter 3 Compilation
- Phases
- Syntactic Analysis
- Contextual Analysis
- Code Generation
- Passes
- Multi-pass Compilation
- One-pass Compilation
- Compiler Design Issues
- Case Study The Triangle Compiler
50Phases
- Syntactic Analysis
- The source program is parsed to check whether it
conforms to the source languages syntax, and to
determine its phrase structure - Contextual Analysis
- The parsed program is analyzed to check whether
it conforms to the source language's contextual
constraints - Code Generation
- The checked program is translated to an object
program, in accordance with the semantics of the
source and target languages
51Phases
Source Program
Syntactic Analysis
Error Report
AST
Contextual Analysis
Error Report
Decorated AST
Code Generation
Object Program
52Syntactic Analysis
- To determine the source programs phrase
structure - Parsing
- Contextual analysis and code generation must know
how the program is composed - Commands, expressions, declarations,
- Check for conformance to the source languages
syntax - Construct suitable representation of its phrase
structure (AST) - AST
- Terminal nodes corresponding to identifiers,
literals, and operators - Sub trees representing the phases of the source
program - Blanks and comments not in AST (no meaning)
- Punctuation and brackets not in AST (only
separate and enclose)
53Contextual Analysis
- Analyzes the parsed program
- Scope rules
- Type rules
- Produces decorated AST
- AST with information gathered during contextual
analysis - Each applied occurrence of an identifier is
linked ot the corresponding declaration - Each expression is decorated by its type T
54Code Generation
- The final translation of the checked program to
an object program - After syntactic and contextual analysis is
completed - Treatment of identifiers
- Constants
- Binds identifier to value
- Replace each occurrence of identifier with value
- Variables
- Binds identifier to some memory address
- Replace each occurrence of identifier by address
- Target language
- Assembly language
- Machine code
55Passes
- Multi-pass compilation
- Traverses the program or AST several times
- One-pass compilation
- Single traverse of program
- Contextual analysis and code generation are
performed on the fly during syntactic analysis
56Compiler Design Issues
- Speed
- Compiler run time
- Space
- Storage size of compiler files generated
- Modularity
- Multi-pass compiler more modular than one-pass
compiler - Flexibility
- Multi-pass compiler is more flexible because it
generates an AST that can be traversed in any
order by the other phases - Semantics-preserving transformations
- To optimize code must have multi-pass compiler
- Source language properties
- May restrict compiler choice some language
constructs may require multi-pass compilers
57Simple Triangle Program
! This program is useless ! Except for
illustration. let var n Integer var c
char in begin c n n 1 end
58Abstract Syntax Tree
Program
(1)
LetCommand
(4)
SequentialDeclaraation
SequentialDeclaraation
(5)
(2)
AssignmentCommand
(3)
AssignmentCommand
VarDeclaration
VarDeclaration
(7)
Character Expression
BinaryExpression
SimpleTypeDenoter
SimpleTypeDenoter
(8)
(9)
IntegerExpression
VnameExpression
Identifier
Identifier
Identifier
Identifier
SimpleVname
SimpleVname
SimpleVname
Identifier
Character Literal
(6)
Integer-Literal
Identifier
Identifier
Operator
c
n
n
Char
1
Integer
c
n
59Abstract Syntax Tree
Program
(1)
LetCommand
(4)
SequentialDeclaraation
SequentialDeclaraation
(5)
(2)
AssignmentCommand
(3)
AssignmentCommand
VarDeclaration
VarDeclaration
(7)
Character Expression
BinaryExpression
SimpleTypeDenoter
SimpleTypeDenoter
(8)
(9)
IntegerExpression
VnameExpression
Identifier
Identifier
Identifier
Identifier
SimpleVname
SimpleVname
SimpleVname
Identifier
Character Literal
(6)
Integer-Literal
Identifier
Identifier
Operator
c
n
n
Char
1
Integer
c
n
60Chapter 4 Syntactic Analysis
- Sub-phases of Syntactic Analysis
- Grammars Revisited
- Parsing
- Abstract Syntax Trees
- Scanning
- Case Study Syntactic Analysis in the Triangle
Compiler
61Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Parser Semantic Analyzer
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
62Syntactic Analysis
- Main function
- Parse source program to discover its phrase
structure - Recursive-descent parsing
- Constructing an AST
- Scanning to group characters into tokens
63Sub-phases of Syntactic Analysis
- Scanning (or lexical analysis)
- Source program transformed to a stream of tokens
- Identifiers
- Literals
- Operators
- Keywords
- Punctuation
- Comments and blank spaces discarded
- Parsing
- To determine the source programs phrase structure
- Source program is input as a stream of tokens
(from the Scanner) - Treats each token as a terminal symbol
- Representation of phrase structure
- AST
64Lexical Analysis A Simple Example
Main() int a, b, c char number5 / get
user inputs / A atoi ( gets(number)) B
atoi (gets(number)) / calculate value for c
/ C 2(ab) a(ab) / print results
/ Printf(d,c)
- Scan the file character by character and group
characters into words and punctuation (tokens),
remove white space and comments - Some tokens for this example
- main
- (
- )
-
- int
- a
- ,
- b
- ,
- c
-
65Creating Tokens Mini-Triangle Example
Input Converter
character string
. . . .
l
e
t
S
v
a
r
y
I
n
t
e
g
e
r
i
n
S
S
S
Scanner
Ident.
colon
Ident.
Ident.
becomes
Ident.
op.
Intlit.
eot
let
var
in
1
y
Integer
y
y
let
var
in
66Tokens in Triangle
- // literals, identifiers, operators...
- INTLITERAL 0, "ltintgt",
- CHARLITERAL 1, "ltchargt",
- IDENTIFIER 2, "ltidentifiergt",
- OPERATOR 3, "ltoperatorgt",
- // reserved words - must be in alphabetical
order... - ARRAY 4, "array",
- BEGIN 5, "begin",
- CONST 6, "const",
- DO 7, "do",
- ELSE 8, "else",
- END 9, "end",
- FUNC 10, "func",
- IF 11, "if",
- IN 12, "in",
- LET 13, "let",
- OF 14, "of",
- PROC 15, "proc",
// punctuation... DOT 21, ".",
COLON 22, "", SEMICOLON 23, "",
COMMA 24, ",", BECOMES 25, "",
IS 26, // brackets... LPAREN 27,
"(", RPAREN 28, ")", LBRACKET
29, ", RBRACKET 30, "", LCURLY
31, "", RCURLY 32, "", // special
tokens... EOT 33, "", ERROR 34
"lterrorgt"
67Grammars Revisited
- Context free grammars
- Generates a set of sentences
- Each sentence is a string of terminal symbols
- An unambiguous sentence has a unique phrase
structure embodied in its syntax tree - Develop parsers from context-free grammars
68Regular Expressions
- A regular expression (RE) is a convenient
notation for expressing a set of stings of
terminal symbols - Main features
- separates alternatives
- indicates that the previous item may be
represented zero or more times - ( and ) are grouping parentheses
69Regular Expression Basics
- e The empty string a special string of length 0
- Regular expression operations
- separates alternatives
- indicates that the previous item may be
represented zero or more times (repetition) - ( and ) are grouping parentheses
70Regular Expression Basics
- Algebraic Properties
- is commutative and associative
- rs sr
- r(st) (rs)t
- Concatenation is associative
- (rs)t r(st)
- Concatenation distributes over
- r(st) rsrt
- (st)r srtr
- e is the identity for concatenation
- e r r
- r e r
- is idempotent
- r r
- r (r e)
71Regular Expression Basics
- Common Extensions
- r one or more of expression r, same as rr
- rk k repetitions of r
- r3 rrr
- r the characters not in the expression r
- \t\n
- r-z range of characters
- 0-9a-z
- r? Zero or one copy of expression (used for
fields of an expression that are optional)
72Regular Expression Example
- Regular Expression for Representing Months
- Examples of legal inputs
- January represented as 1 or 01
- October represented as 10
- First Try 01e0-9
- Matches all legal inputs? Yes
- 1, 2, 3, , 10, 11, 12, 01, 02, , 09
- Matches any illegal inputs? Yes
- 0, 00, 18
73Regular Expression Example
- Regular Expression for Representing Months
- Examples of legal inputs
- January represented as 1 or 01
- October represented as 10
- Second Try 1-9(01-9)(10-2)
- Matches all legal inputs? Yes
- 1, 2, 3, , 10, 11, 12, 01, 02, , 09
- Matches any illegal inputs? No
74Regular Expression Example
- Regular Expression for Floating Point Numbers
- Examples of legal inputs
- 1.0, 0.2, 3.14159, -1.0, 2.7e8, 1.0E-6
- Assume that a 0 is required before numbers less
than 1 and does not prevent extra leading zeros,
so numbers such as 0011 or 0003.14159 are legal - Building the regular expression
- Assume
- Digit ? 0123456789
- Handle simple decimals such as 1.0, 0.2, 3.14159
- Digit.digit
- Add an optional sign (only minus, no plus)
- (- e)digit.digit or -?digit.digit
75Regular Expression Example
- Regular Expression for Floating Point Numbers
(cont.) - Building the regular expression (cont.)
- Format for the exponent
- (Ee)(-)?(digit)
- Adding it as an optional expression to the
decimal part - (- e)digit.digit((Ee)(-)?(digit))?
76Extended BNF
- Extended BNF (EBNF)
- Combination of BNF and RE
- NX, where N is a nonterminal symbol and X is
an extended RE, i.e., an RE constructed from both
terminal and nonterminal symbols - EBNF
- Right hand side may use . , (, )
- Right hand side may contain both terminal and
nonterminal symbols
77Example EBNF
- Expression primary-Expression (Operator
primary-Expression) - Primary-Expression Identifier
- ( Expression )
- Identifier abcde
- Operator -/
- Generates
- e
- a b
- a b c
- a (b c)
- a (b c) / d
- a (b (c (d e)))
78Grammar Transformations
- Left Factorization
- XY XZ is equivalent to X(Y Z)
- single-Command V-name Expression
- if Expression then single-Command
- if Expression then single-Command
- else single-Command
- single-Command V-name Expression
- if Expression then single-Command
- (e else single-Command)
79Grammar Transformations
- Elimination of left recursion
- N X NY is equivalent to NX(Y)
- Identifier Letter
- Identifier Letter
- Identifier Digit
- Identifier Letter
- Identifier (Letter Digit)
- Identifier Letter(Letter Digit)
80Grammar Transformations
- Substitution of nonterminal symbols
- Given NX, we can substitute each occurrence
of N with X - iff NX is nonrecursive and is the only
production rule for N - single-Command for Control-Variable
Expression To-or-Downto - Expression do single-Command
-
- Control-Variable Identifier
- To-or-Downto to
- down
- single-Command for Identifier Expression
(todownto) - Expression do single-Command
-
81Scanning (Lexical Analysis)
- The purpose of scanning is to recognize tokens in
the source program. Or, to group input
characters (the source program text) into tokens. - Difference between parsing and scanning
- Parsing groups terminal symbols, which are
tokens, into larger phrases such as expressions
and commands and analyzes the tokens for
correctness and structure - Scanning groups individual characters into tokens
82Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Parser Semantic Analyzer
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
83Creating Tokens Mini-Triangle Example
Input Converter
character string
. . . .
l
e
t
S
v
a
r
y
I
n
t
e
g
e
r
i
n
S
S
S
Scanner
Ident.
colon
Ident.
Ident.
becomes
Ident.
op.
Intlit.
eot
let
var
in
1
y
Integer
y
y
let
var
in
84What Does a Scanner Do?
- Hand keywords (reserve words)
- Recognizes identifiers and keywords
- Match explicitly
- Write regular expression for each keyword
- Identifier is any alpha numeric string which is
not a keyword - Match as an identifier, perform lookup
- No special regular expressions for keywords
- When an identifier is found, perform lookup into
preloaded keyword table
How does Triangle handle keywords? Discuss in
terms of efficiency and ease to code.
85What Does a Scanner Do?
- Remove white space
- Tabs, spaces, new lines
- Remove comments
- Single line
- -- Ada comment
- Multi-line, start and end delimiters
- Pascal comment
- / c comment /
- Nested
- Runaway comments
- Nonterminated comments cant be detected till end
of file
86What Does a Scanner Do?
- Perform look ahead
- Multi-character tokens
- 1..10 vs. 1.10
- ,
- lt, lt
- etc
- Challenging input languages
- FORTRAN
- Keywords not reserved
- Blanks are not a delimiter
- Example (comma vs. decimal)
- DO10I1,5 start of a do loop (equivalent to a C
for loop) - DO10I1.5 an assignment statement, assignment to
variable DO10I
87What Does a Scanner Do?
- Challenging input languages (cont.)
- PL/I, keywords not reserved
- IF THEN THEN THEN ELSE ELSE ELSE THEN
88What Does a Scanner Do?
- Error Handling
- Error token passed to parser which reports the
error - Recovery
- Delete characters from current token which have
been read so far, restart scanning at next unread
character - Delete the first character of the current lexeme
and resume scanning form next character. - Examples of lexical errors
- 3.25e bad format for a constant
- Var1 illegal character
- Some errors that are not lexical errors
- Mistyped keywords
- Begim
- Mismatched parenthesis
- Undeclared variables
89Scanner Implementation
- Issues
- Simpler design parser doesnt have to worry
about white space, etc. - Improve compiler efficiency allows the
construction of a specialized and potentially
more efficient processor - Compiler portability is enhanced input alphabet
peculiarities and other device-specific anomalies
can be restricted to the scanner
90Scanner Implementation
- What are the keywords in Triangle?
- How are keywords and identifiers implemented in
Triangles? - Is look ahead implemented in Triangle?
- If so, how?
91Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Semantic Analyzer
Parser
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
92Parsing
- Given an unambiguous, context free grammar,
parsing is - Recognition of an input string, i.e., deciding
whether or not the input string is a sentence of
the grammar - Parsing of an input string, i.e., recognition of
the input string plus determination of its phrase
structure. The phrase structure can be
represented by a syntax tree, or otherwise.
Unambiguous is necessary so that every sentence
of the grammar will form exactly one syntax tree.
93Parsing
- The syntax of programming language constructs are
described by context-free grammars. - Advantages of unambiguous, context-free grammars
- A precise, yet easy-to understand, syntactic
specification of the programming language - For certain classes of grammars we can
automatically construct an efficient parser that
determines if a source program is syntactically
well formed. - Imparts a structure to a programming language
that is useful for the translation of source
programs into correct object code and for the
detection of errors. - Easier to add new constructs to the language if
the implementation is based on a grammatical
description of the language
94Parsing
- Check the syntax (structure) of a program and
create a tree representation of the program - Programming languages have non-regular constructs
- Nesting
- Recursion
- Context-free grammars are used to express the
syntax for programming languages
95Context-Free Grammars
- Comprised of
- A set of tokens or terminal symbols
- A set of non-terminal symbols
- A set of rules or productions which express the
legal relationships between symbols - A start or goal symbol
- Example
- expr ? expr digit
- expr ? expr digit
- expr ? digit
- digit ? 0129
- Tokens -,,0,1,2,,9
- Non-terminals expr, digit
- Start symbol expr
96Context-Free Grammars
- expr ? expr digit
- expr ? expr digit
- expr ? digit
- digit ? 0129
Example input 3 8 - 2
97Checking for Correct Syntax
- Given a grammar for a language and a program, how
do you know if the syntax of the program is
legal? - A legal program can be derived from the start
symbol of the grammar
Grammar must be unambiguous and context-free
98Deriving a String
- The derivation begins with the start symbol
- At each step of a derivation the right hand side
of a grammar rule is used to replace a
non-terminal symbol - Continue replacing non-terminals until only
terminal symbols remain
Rule 2
Rule 1
Rule 4
expr ? expr digit ? expr 2 ? expr digit - 2
Rule 3
Rule 4
Rule 4
? expr 8-2 ? digit 8-2 ? 38 -2
99Rightmost Derivation
- The rightmost non-terminal is replaced in each
step
Rule 4
expr digit ? expr 2
Rule 2
expr 2 ? expr digit - 2
Rule 4
expr digit - 2 ? expr 8-2
Rule 3
expr 8-2 ? digit 8-2
Rule 4
digit 8-2 ? 38 -2
100Leftmost Derivation
- The leftmost non-terminal is replaced in each step
Rule 2
expr digit ? expr digit digit
Rule 3
expr digit digit ? digit digit digit
Rule 4
digit digit digit ? 3 digit digit
Rule 4
3 digit digit ? 3 8 digit
Rule 4
3 8 digit ? 3 8 2
101Leftmost Derivation
- The leftmost non-terminal is replaced in each step
expr
1
1
Rule 2
expr digit ? expr digit digit
6
2
2
expr
-
digit
Rule 3
expr digit digit ? digit digit digit
3
3
5
expr
digit
Rule 4
digit digit digit ? 3 digit digit
4
2
Rule 4
3 digit digit ? 3 8 digit
5
4
digit
8
Rule 4
3 8 digit ? 3 8 2
6
3
102Bottom-Up Parsing
- Parser examines terminal symbols of the input
string, in order from left to right - Reconstructs the syntax tree from the bottom
(terminal nodes) up (toward the root node) - Bottom-up parsing reduces a string w to the start
symbol of the grammar. - At each reduction step a particular sub-string
matching the right side of a production is
replaced by the symbol on the left of that
production, and if the sub-string is chosen
correctly at each step, a rightmost derivation is
traced out in reverse.
103Bottom-Up Parsing
- Types of bottom-up parsing algorithms
- Shift-reduce parsing
- At each reduction step a particular sub-string
matching the right side of a production is
replaced by the symbol on the left of that
production, and if the sub-string is chosen
correctly at each step, a rightmost derivation is
traced out in reverse. - LR(k) parsing
- L is for left-to-right scanning of the input, the
R is for constructing a right-most derivation in
reverse, and the k is for the number of input
symbols of look-ahead that are used in making
parsing decisions.
104Bottom-Up Parsing Example38-2
105Bottom-Up Parsing Example38-2
106Bottom-Up Parsing Exampleabbcde
a
b
b
c
d
e
A
a
b
b
c
d
e
Abbcde ? aAbcde
A
a
b
b
c
d
e
aAbcde
107Bottom-Up Parsing Exampleabbcde
A
A
a
b
b
c
d
e
aAbcde ? aAde
A
A
a
b
b
c
d
e
aAde
108Bottom-Up Parsing Exampleabbcde
A
B
A
a
b
b
c
d
e
aAde ? aABe
A
B
A
a
b
b
c
d
e
aABe
109Bottom-Up Parsing Exampleabbcde
S
A
B
A
a
b
b
c
d
e
aABe ? S
110Bottom-Up Parsing Examplethe cat sees a rat.
the
cat
sees
a
rat
.
Noun
.
the
cat
sees
a
rat
the cat sees a rat. ? the Noun sees a rat.
Noun
the
cat
sees
a
rat
.
the Noun sees a rat.
111Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
the
cat
sees
a
rat
.
the Noun sees a rat. ? Subject sees a rat.
Subject
Noun
.
the
cat
sees
a
rat
Subject sees a rat.
112Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
Verb
.
the
cat
sees
a
rat
Subject sees a rat. ? Subject Verb a rat.
Subject
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a rat.
113Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a rat. ? Subject Verb a Noun.
Subject
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a Noun.
114Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a Noun. ? Subject Verb Object.
What would happened if we choose Subject ? a
Noun instead of Object ? a Noun?
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb Object.
115Bottom-Up Parsing Examplethe cat sees a rat.
Sentence
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb Object.
116Top-Down Parsing
- The parser examines the terminal symbols of the
input string, in order from left to right. - The parser reconstructs its syntax tree from the
top (root node) down (towards the terminal
nodes).
An attempt to find the leftmost derivation for an
input string
117Top-Down Parsers
- General rules for top-down parsers
- Start with just a stub for the root node
- At each step the parser takes the left most stub
- If the stub is labeled by terminal symbol t, the
parser connects it to the next input terminal
symbol, which must be t. (If not, the parser has
detected a syntactic error.) - If the stub is labeled by nonterminal symbol N,
the parser chooses one of the production rules
N X1Xn, and grows branches from the node
labeled by N to new stubs labeled X1,, Xn (in
order from left to right). - Parsing succeeds when and if the whole input
string is connected up to the syntax tree.
118Top-Down Parsing
- Two forms
- Backtracking parsers
- Guesses which rule to apply, back up, and changes
choices if it can not proceed - Predictive Parsers
- Predicts which rule to apply by using look-ahead
tokens
Backtracking parsers are not very efficient. We
will cover Predictive parsers
119Predictive Parsers
- Many types
- LL(1) parsing
- First L is scanning the input form left to right
second L is for producing a left-most derivation
1 is for using one input symbol of look-ahead - Table driven with an explicit stack to maintain
the parse tree - Recursive decent parsing
- Uses recursive subroutines to traverse the parse
tree
120Predictive Parsers (Lookahead)
- Lookahead in predictive parsing
- The lookahead token (next token in the input) is
used to determine which rule should be used next - For example
7
term
num
121Predictive Parsers (Lookahead)
7
term
num
3
7
term
num
num
3
-
term
122Predictive Parsers (Lookahead)
num
term
7
3
num
-
term
2
num
term
7
3
num
-
term
e
2
123Recursive-Decent Parsing
- Top-down parsing algorithm
- Consists of a group of methods (programs) parseN,
one for each nonterminal symbol N of the grammar. - The task of each method parseN is to parse a
single N-phrase - These parsing methods cooperate to parse complete
sentences
124Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
the
cat
sees
a
rat
.
- Decide which production rule to apply. Only one,
1. - This step created four stubs.
125Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
126Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
127Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
128Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
129Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
130Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
131Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
- ParseSentence
- ParseSubject
- ParseObject
- ParseVerb
- ParseNoun
132Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
- ParseSentence
- parseSubject
- parseVerb
- parseObject
- parseEnd
Sentence ?
Subject
Verb
Object
.
133Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
Subject ?
- ParseSubject
- if input I
- accept
- else if input a
- accept
- parseNoun
- else if input the
- accept
- parseNoun
- else error
I
a
Noun
the
Noun
134Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
- ParseNoun
- if input cat
- accept
- else if input mat
- accept
- else if input rat
- accept
- else error
Noun ?
cat
mat
rat
135Recursive-Descent Parser for Micro-English
Object ?
- ParseObject
- if input me
- accept
- else if input a
- accept
- parseNoun
- else if input the
- accept
- parseNoun
- else error
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
me
a
Noun
the
Noun
136Recursive-Descent Parser for Micro-English
- ParseVerb
- if input like
- accept
- else if input is
- accept
- else if input see
- accept
- else if input sees
- accept
- else error
Verb ?
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
like
is
see
sees
137Recursive-Descent Parser for Micro-English
- ParseEnd
- if input .
- accept
- else error
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
.
138Systematic Development of a Recursive-Descent
Parser
- Given a (suitable) context-free grammar
- Express the grammar in EBNF, with a single
production rule for each nonterminal symbol, and
perform any necessary grammar transformations - Always eliminate left recursion
- Always left-factorize whenever possible
- Transcribe each EBNF production rule NX to a
parsing method parseN, whose body is determined
by X - Make the parser consist of
- A private variable currentToken
- Private parsing methods developed in previous
step - Private auxiliary methods accept and acceptIt,
both of which call the scanner - A public parse method that calls parseS, where S
is the start symbol of the grammar), having first
called the scanner to store the first input token
in currentToken
139Quote of the Week
- C makes it easy to shoot yourself in the foot
C makes it harder, but when you do, it blows
away your whole leg. - Bjarne Stroustrup
140Quote of the Week
- Did you really say that?
-
- Dr. Bjarne Stroustrup
-
- Yes, I did say something along the lines of C
makes it easy to shoot yourself in the foot C
makes it harder, but when you do, it blows your
whole leg off. What people tend to miss is that
what I said about C is to a varying extent true
for all powerful languages. As you protect people
from simple dangers, they get themselves into new
and less obvious problems. Someone who avoids
the simple problems may simply be heading for a
not-so-simple one. One problem with very
supporting and protective environments is that
the hard problems may be discovered too late or
be too hard to remedy once discovered. Also, a
rare problem is harder to find than a frequent
one because you don't suspect it. -
- I also said, "Within C, there is a much smaller
and cleaner language struggling to get out." For
example, that quote can be found on page 207 of
The Design and Evolution of C. And no, that
smaller and cleaner language is not Java or C.
The quote occurs in a section entitled "Beyond
Files and Syntax". I was pointing out that the
C semantics is much cleaner than its syntax. I
was thinking of programming styles, libraries and
programming environments that emphasized the
cleaner and more effective practices over archaic
uses focused on the low-level aspects of C.
141Converting EBNF Production Rules to Parsing
Methods
- For production rule NX
- Convert production rule to parsing method named
parseN - Private void parseN ()
- Parse X
-
- Refine parseE to a dummy statement
- Refine parse t (where t is a terminal symbol) to
accept(t) or acceptIt() - Refine parse N (where N is a non terminal symbol)
to a call of the corresponding parsing method - parseN()
- Refine parse X Y to
-
- parseX
- parseY
-
- Refine parse XY
- Switch (currentToken.kind)
- Cases in starterX
- Parse X
- Break
142Converting EBNF Production Rules to Parsing
Methods
- For X Y
- Choose parse X only if the current token is one
that can start an X-phrase - Choose parse Y only if the current token is one
that can start an Y-phrase - startersX and startersY must be disjoint
- For X
- Choose
- while (currentToken.kind is in starters