Title: Compiler
1Introduction
2References
- Textbook
- Compilers Principles, Techniques, and Tools,
Alfred V.Aho, Ravi Sethi, and Jeffrey D. Ullman,
Second Edition Addison-Wesley ,2007 - References
- Programming Language Processors in Java.
Compilers and Interpreters, D.A. Watt and D.F.
Brown, Pearson Education Ltd. - Assessment 25 Coursework 75 Final Exam
3Objectives
- To introduce principles, techniques, and tools
for compiler construction - To obtaining the knowledge what a compiler does
and how to build one.
4Course Outline
- 1. Introduction, Structure of a Compiler
- 2. Lexical Analysis Tokens, Regular Expressions
- 3. Parsing Context-free grammars, predictive
- 4. Abstract Syntax Semantic actions, abstract
parse trees - 5. Semantic Analysis Symbol tables, bindings,
type-checking - 6. Stack Frames Representation and Abstraction
- 7. Intermediate Code Representation trees,
translation
5Why we need to know compilers?
- All software is written in a programming
language. - Learning about compilers will teach you a lot
about the programming languages you already know. - Seeing the development of a compiler gives you a
feeling for how programs work. - A great example of interplay between theory and
practice. - Many algorithms and models you will use in
compilers are fundamental, and will be useful to
you elsewhere - automata, regular expressions (lexing)
- context-free grammars, trees (parsing)
- hash tables (symbol table)
- dynamic programming, graph coloring (code gen.)
6Why study compilers?
- Compilers Improve Programming Productivity
- To enhance understanding of programming languages
- To have an in-depths knowledge of low-level
machine executables - To write compilers and interpreters for various
programming languages and domain-specific
languages Examples Java, JavaScript, C, C, C,
Modula-3, Scheme, ML, Tcl/Tk, Database Query
Lang., Mathematica, Matlab, Shell-Command-Language
s, Awk, Perl, your .mailrc file, HTML, TeX,
PostScript, Kermit scripts, ..... - To learn interesting compiler theory and
algorithms. - To learn the beauty of programming in modern
programming language - To learn how to use them well.
- To learn how to write them.
- To illuminate programming language design. .As
an example of a large software system. - To motivate interest in formal language theory.
7Computer Organization
Application
Compiler
Operating System
Hardware
8History, Programming Languages
- Machine coding (binary programming punch holes)
(first generation) - The computers native language, binary digits
(0s, 1s) - 0100 0001 0110 1110 0100 0001
- 0001 0010 1100 0100 0000 1101
- Programming in machine code is
- very slow,
- error prone,
- requires a detailed knowledge of the relevant
computer architecture, - difficult to understand other peoples code,
- code becomes obsolete if the machine is
changed. - Assembly Language (second generation)
- One-to-one correspondence to machine language
- MOV AX, 5h
- MOV DX, 3h
- ADD A
- Assembler translates assembly language programs
into machine language
9History, Programming Languages (High-Level
Languages)
- Procedural Languages (third generation)
- Instructions translate into machine language
instructions - Uses common words rather than abbreviated
mnemonics - C, C, Java, Fortran, QuickBasic
- A 3
- B A 2 - 1
- D A / B A5
- Compiler - translates the entire program at once
- Interpreter - translates and executes one source
program statement at a time
10History, Programming Languages (High-Level
Languages)
- Nonprocedural Languages (fourth generation)
- Allows the user to specify the desired result
without having to specify the detailed procedures
needed for achieving the result. - Standard Query Language (SQL)
- Natural Language Programming Languages
- (fifth generation (intelligent) languages).
- Translates natural languages into a structured,
machine-readable form
11High-Level Languages
- Expressions such as , -, , /
- Data Types simple types (e.g. Boolean, int,
float) as well as composite structures (records)
and arrays - - can be defined by the programmer
- Control Structures allow programming of
selective computation as well as iterative
computation - Declaration introduce identifiers to indicate
const. Values, variables, procedures etc. - Abstraction separation of concerns i.e. break a
problem up and deal with sub-sets - Encapsulation (data abstraction) grouping
relevant relations and selectively hiding
specific information (e.g. classes)
12Why high-level languages?
- Understandability (readability)
- Naturalness (languages for different
applications) - Portability (machine-independent)
- Efficient to use (development time)
13Language Processors
- Editors ( to enter text) they can process text
based on the logical structure of the text. - Translator translates text from one language to
another - Compiler translates from a high-level language
to low-level language - Interpreter takes a text (in a particular
language) and runs it immediately - Assembler translates from an assembly language
into the corresponding machine code. assembly
languages easier to produce as output and is
easier to debug
14Language Processors
- Simulator, Emulator Machine code is interpreted
? machine code - e.g. Simulate a processor on an
existing processor. - Preprocessor Extended high-level language ?
high-level Language. Preprocessors Sometimes
called before the actual compilation process e.g.
Remove comments, include the text of other
files, and perform macro substitutions (replace
shorthand notation with longer piece of text) - Natural language translators
- e.g. Chinese ? English
15Assembler
- The Assembler is responsible for translating the
target codeusually assembly codeinto an
executable machine code. - The assembly code is a mnemonic version of
machine code in which - 1. Names are used instead of binary codes for
operations (Code Table). - 2. Names are used for operands instead of memory
locations (Symbol Tables). - Assembly level programming
- - improves the productivity,
- - is less error prone,
- - is somewhat easier to understand,
- - code is as efficient as the machine code.
- but
- - it requires detailed knowledge of a computer
architecture, - - code is machine dependent,
- - code is obsolete when a machine is changed.
- It became soon apparent that we need to do the
programming in a machine independent language
(HLL)
16Compilers Interpreters
- Interpreters are another class of translators
- Compiler translates a program once and for all
into target language. C - Interpreter effectively translates a source
program every time it is run. Basic - Compilers and interpreters (highbred) are used
together Java - Java compiled into Java byte code,
- byte code interpreted by a Java Virtual Machine
(JVM).
17What is a Compiler?
- A compiler is program that reads a program
written in one language (source language) and
translates it into an equivalent program in
another language (target language) .
Compiler
Target Program
Source Program
Error
18Compiler
- Source programs Many possible source languages,
from traditional, to application specific
languages. - Programming languages (High-level)
- Modeling languages
- Document description languages
- Database query languages
- Target programs Another programming language,
often the machine language of a particular
computer system. - High-level programming language
- Low-level programming language (assembler or
machine code) - Application-specific target language
-
- Error messages Essential for program development
19Do we need Compilers?
- Machines understand only 1s and 0s. High-level
languages, make it easier for the user to program
in, but not for the machine to understand. - Once the programmer has written and edited the
program (in an Editor), it needs to be translated
into machine language (1s and 0s) before it can
be executed. - compilers are used to do this conversion
20Where are compilers used?
- Implementation of programming languages
- C, C, Java, Lisp, Prolog, SML, Haskell, Ada,
Fortran. - Document processing
- DVI ? PostScript,
- Word documents ? PDF
- Natural language processing
- NL ? database query language ? database commands
- Hardware design
- silicon compilers, CAD data ? machine
operations, equipment lists - Report generation
- CAD data ? list of parts,
- All kinds of input/output translations
- various UNIX text filters, . . .
21Interpreter
- Given the program source code and the run-time
input, Interpret the source code directly, i.e.
parse and simulate it, statement by statement
(syntax-directed interpretation) - UNIX shells (command line interpreter)
- Early interpreters for BASIC, LISP, APL
- Good for debugging
- Very slow But ok for small scripts
22Compiler / Translator and Interpreter
- A translator is used to produce an equivalent
program in another language (e.g. from C to
Pascal) - Compiler is a translator that generally takes in
a higher level language (e.g. C) and transforms
it into a low level language - (usually object or machine code).
- Compiler/Translator produce the entire output
code before executing - Interpreter compiles and executes a statement at
a time before moving on to the next statement
23Compiler / Translator and Interpreter
compiler
- The machine-language target program produced by a
compiler is usually much faster than an
interpreter at mapping inputs to outputs . - An interpreter, however, can usually give better
error diagnostics than a compiler - because it executes the source program statement
by statement.
24Interpreters versus Compilers
The tradeoffs between compilation and
interpretation?
- Compilers typically offer more advantages when
- programs are deployed in a production setting
- programs are repetitive
- the instructions of the programming language are
complex - Interpreters typically are a better choice when
- we are in a development/testing/debugging stage
- programs are run once and then discarded
- the instructions of the language are simple
- the execution speed is overshadowed by other
factors - e.g. on a web server where communications costs
are much higher than execution speed
25Hybrid compiler / interpreter
26How does Java work?
A benefit of this arrangement in Java is that
bytecodes compiled on one machine can be
interpreted on another machine, perhaps across a
network.
27Program execution
- Three phases of execution
- Compile time"
- 1. Source program ? object program (compiling)
- 2. Linking, loading ? absolute program
- "Run-time
- Large programs are often compiled in pieces, so
the relocatable machine code may have to be
linked together with other relocatable object
files and library files into the code that
actually runs on the machine. - The linker resolves external memory addresses,
where the code in one file may refer to a
location in another file. - The loader then puts together all of the
executable object files - into memory for execution
- 3. Input ? output
28Loader and Linker
- The machine code generated by the Assembler can
be executed only if allocated in Main Memory
starting from the address 0. - Since this is not possible the Loader will alter
the relocatable addresses of the code to place
both instructions and data in the right place in
Main Memory. - The starting free address, L, in Main Memory to
allocate the program is called the Relocation
Factor. - The Loader must
- 1. Add to each relocatable address the relocation
factor L - 2. Leave unaltered each absolute addresse.g.,
address of I/O devices. - The Linker links together the different
files/modules of a single program and, possibly,
adds library files.
29The phases of a compiler
Lexical analyser
Syntax analyser
Symbol table manager
Error Handler
Semantic analyser
Intermediate code generator
Code optimizer
Code generator
30Analysis-Synthesis Model of Compilation
- There are two parts of compilation
- Part1, Analysis breaks up the source program
into constituent pieces and creates an
intermediate representation of the source
program. - Part2, Synthesis constructs the desired target
program from the intermediate representation. It
requires the most specialized techniques
31Part1, Analysis of the Source Program
- Analysis consists of three phases
- Lexical (Linear or Scanning) read from
left-to-right and grouped into tokens that are
sequences of characters having a collective
meaning. - Syntax Analysis (Hierarchical or Parsing)
characters or tokens are grouped hierarchically
into nested collections with collective meaning. - Semantic Analysis certain checks are performed
to ensure that the components of a program fit
together meaningfully
32Lexical Analysis (Linear Analysis/ Scanning)
- Input Sequence of characters
- Output Tokens (basic symbols, groups of
successive characters which belong together
logically). - Translate the input program, entered as a
sequence of characters, into a sequence of words
or symbols (tokens). For example, the keyword for
should be treated as a single entity, not as a 3
character string. - position initial rate 60
- The assignment statement would be grouped into
the following tokens - 1. The identifier position
- 2. The assignment symbol
- 3. The identifier initial
- 4. The plus sign
- 5. The identifier rate
- 6. The multiplication sign
- 7. The number 60
- Note the blank separating the characters of
these tokens would normally be eliminated during
lexical analysis
33Lexical Analysis
S o m e o n e b r e a k s t h e i c e
final initial rate 60
Lexical Analysis
Lexical Analysis
id1 id2 id3 60
Someone breaks the ice
34Syntax Analysis (Hierarchical Analysis or
Parsing)
- Input Sequence of tokens
- Output Parse tree, error messages
- It involves grouping the tokens of the source
program into grammatical phrases that are used
by the compiler to synthesize output. Usually,
the grammatical phrases of the source program are
represented by a parse tree such as the
following - Determine the structure of the program, for
example, identify the components of each
statement and expression and check for syntax
errors.
35Syntax Analysis
Someone breaks the ice
id1 id2 id3 60
Syntax Analysis
Syntax Analysis
sentence
subject
verb
object
Someone breaks the ice
36Semantic Analysis
- Input Parse tree symbol table
- Output annotated tree (abstract tree with
attributes) symbol table variables information on
their type ... - Checks the source program for semantic errors and
gathers type information for subsequent code
generation phase - It uses the hierarchy structure determined by the
syntax-analysis phase - Check that the program is reasonable, for
example, that it does not include references to
undefined variables. - An important component of semantic analysis is
type checking
37Semantic Analysis
Someone plays the piano
(meaningful)
Semantic Analysis
The piano plays someone
(meaningless)
38Part2, Synthesis
- Internal form
- Intermediate Code Generation as a program for an
abstract machine. It should be easy to produce
and easy to translate into the target program. - Internal form, hopefully improved
- Code Optimization attempts to improve the
intermediate code. The program can be fixed
during the code optimization phase. - Machine code/assembly code Generation memory
locations are selected for each of the variables
used by the program. Intermediate instructions
are each translated into a sequence of machine
instructions that perform the same task. A
crucial aspect is the assignment of variables to
registers.
39Intermediate Code Generation
Intermediate Code Generation
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
40Code Optimization
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Code Optimization
temp1 id3 60.0 id1 id2 temp1
41Code Optimization
temp1 id3 60.0 id1 id2 temp1
Code Generator
MOVF rate, R2 MULF 60, R2 MOVF initial, R1
ADDF R2, R1 MOVF R1, position
42Symbol Table
- Help for other phases during compilation
- A symbol table is a data structure containing a
record for each identifier, with fields for the
attributes of the identifier. The data structure
allows us to find the record for each identifier
quickly and to store or retrieve data from that
record quickly.
43Error Handler
- Discover an error.
- Write an error message.
- Correct the error (or guess, very difficult!)
- Restart from the error (try to continue)
- Each phase can encounter errors. However, after
detecting an error, a phase must somehow deal
with that error, so that compilation can proceed,
allowing further errors in the source program to
be deducted.
44Examples of error messages
- Lexical analysis
- Faulty sequence of characters
which does not result in a token, - e.g.Ö, 5EL, K, string
- Syntax analysis
- Syntax error (e.g. missing semicolon), (4 (y
5) - 12)) - Semantic analysis
- Type conflict, e.g. HEJ5
- Code optimization
- Uninitialized variables, anomaly detection.
- Code generation
- Too large integers, run out of memory.
- Table management
- Double declaration, table overflow.
- A good compiler finds an error at the earliest
occasion. - Usually, some errors are left to run time array
index out of bounds
45Inside the Compiler
Sequence of character
scanner
Lexical Analysis
sequence of tokens
Syntactic Analysis/ Parsing
parser
Abstract Syntax Tree (AST)
Contextual Analysis/ checking Static Semantics
checker
verified/ annotated AST
Optimization and Code Generation
Optimizer code generator
46Language Processing System
Performs Macro-processing, File inclusion,
Rational reprocessor, Language extension
skeletal source program
Converts mnemonics (assembly code) into object
code. Two- pass assembler 1. denote storage
locations for identifiers in symbol table 2.
translate code into machine code, translate
locations into addresses
Preprocessor
source program
Split into 6 phases. Produces assembly code. Some
compilers include the assembler too.
Compiler
target assembly program
Assembler
Reads file, placing relocatable addresses into
proper locations in memory
Links other object library files with object
code
relocatable machine code
Loader/Linker
47The Phases of a Compiler
48Compiler Pass
- compiler often finds it convenient to process the
entire source program several times before
generating code - Each of these repetitions is called a pass
- A collection of phases is done only once (single
pass) or multiple times (multi pass) - Single pass usually requires everything to be
defined before being used in source program - Multi pass compiler may have to keep entire
program representation in memory - A multi pass compiler makes several passes
over the program. The output of a preceding
phase is stored in a data structure and used by
subsequent phases.
49Single Pass Compiler
A single pass compiler makes a single pass over
the source text, parsing, analyzing and
generating code all at once.
Dependency diagram of a typical Single Pass
Compiler
Compiler Driver
calls
Syntactic Analyzer
calls
calls
Contextual Analyzer
Code Generator
50Compiler passes
Dependency diagram of a typical Multi Pass
Compiler
Compiler Driver
calls
calls
calls
Syntactic Analyzer
Contextual Analyzer
Code Generator
51Type checking, identify operators operands
Decompose statement into tokens
Lexical analyser
Detects errors, Reports errors
Parsing, check order of tokens with grammar,
create Abstract Syntax Tree
Syntax analyser
Symbol table manager
Error Handler
Semantic analyser
Stores record for each identifier and its
attributes
Improve speed, efficiency
Intermediate code generator
Code optimizer
Generates final assembly code
First translation create temp. sub-result
variables
Code generator