Comp 104: Operating Systems Concepts - PowerPoint PPT Presentation

About This Presentation
Title:

Comp 104: Operating Systems Concepts

Description:

Comp 104: Operating Systems Concepts Introduction to Compilers * – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 111
Provided by: KatieAt
Category:

less

Transcript and Presenter's Notes

Title: Comp 104: Operating Systems Concepts


1
Comp 104 Operating Systems Concepts
  • Introduction to Compilers

2
Today
  • Compilers
  • Definition
  • Structure
  • Passes
  • Lexical Analysis
  • Symbol table
  • Access methods

3
Compilers
  • Definition
  • A compiler is a program which translates a
    high-level source program into a lower-level
    object program (target)

SOURCEPROG.
ANALYSEDPROG.
OBJECTPROG.
analysis
synthesis
4
History
  • Late 1940ies (post-von Neumann)
  • Programs were written in machine code
  • C7 06 0000 0002 (move the number 2 to location
    0000 (hex)
  • Highly complex, tedious and prone to error
  • Assemblers appeared
  • Machine instructions given as mnemonics
  • MOV X,2 (assuming X has the value 0000 (hex))
  • Greatly improved the speed and accuracy of
    writing code
  • But still non-trivial, and non-portable to new
    processors
  • Needed a mathematical notation
  • Fortran appeared between 1954-57
  • X 2
  • Exploited context free grammars (Chomsky) and
    finite state automatata

5
Compiler
  • Responsible for converting source code into
    executable code.
  • Analyses the code to determine the functionality
  • Synthesises executable code for a given processor
  • Optimises code to improve performance, or exploit
    specific processor instructions
  • Assumes various data structures
  • Tokens
  • Variables, language keywords, syntactic
    constructs etc
  • Symbol Table
  • Relates user defined entities (variables,
    methods, classes etc) with their associated
    values or internal structures
  • Literal Table
  • Stores constants, strings, etc. Used to reduce
    the size of the resulting code
  • Syntax/Parse Tree
  • The resulting structure formed through the
    analysis of the code
  • Intermediate Code
  • Intermediate representation between different
    phases of the compilation

6
Phases and other tools
  • Interpreters
  • Unlike compilers, code is executed immediately
  • Slow execution, used more for scripting or
    functional languages
  • Assemblers
  • Constructs final machine code from processor
    specific Assembly code
  • Often used as last phase of a compilation process
    to produce binary executable.
  • Linkers
  • Collates separately compiled objects into a
    single file, including shared library objects or
    system calls.
  • Preprocessors
  • Called prior to the compilation process to
    perform macro substitutions
  • E.g. RATFOR preprocessor, or cpp for C code
  • Profilers
  • Collects statistics about the behaviour of a
    program and can be used to improve the
    performance of the code.

7
Analysis and Synthesis
  • Analysis
  • checks that program constructs are legal and
    meaningful
  • builds up information about objects declared
  • Synthesis
  • takes analysed program and generates code
    necessary for its execution
  • Compilation based on language definition, which
    comprises
  • syntax
  • semantics

8
Compiler Structure
source program (character stream)
tokens
parser
SYMBOL TABLE
scanner
IR (parse tree)
semantic routines
optimiser
IR (tuples)
IR Intermediate Representation
code generator
target code
9
Compiler Organisation
  • Each of compiler tasks described previously (in
    Compiler Structure) is a phase
  • Phases can be organised into a number of passes
  • a pass consists of one or more phases acting on
    some representation of the complete program
  • representations produced between source and
    target are Intermediate Representations (IRs)

10
Single Pass Compilers
  • One pass compilers very common because of their
    simplicity
  • No IRs all phases of compiler interleaved
  • Compilation driven by parser
  • Scanner acts as subroutine of parser, returning a
    token on each call
  • As each phrase recognised by parser, it calls
    semantic routines to process declarations, check
    for semantic errors and generate code
  • Code not as efficient as multi-pass

11
Multi-Pass Compilers
  • Number of passes depends on number of IRs and on
    any optimisations
  • Multi-pass allows complete separation of phases
  • more modular
  • easier to develop
  • more portable
  • Main forms of IR
  • Abstract Syntax Tree (AST)
  • Intermediate Code (IC)
  • Postfix
  • Tuples
  • Virtual Machine Code

12
Compiler Implementation
  • Compilers often written in HLLs for ease of
    maintenance, portability, etc.
  • e.g. Pascal compiler written in C, runs on
    machine X
  • Problem always need both compilers available
  • To alter compiler
  • Make necessary changes
  • Re-compile using C compiler
  • To move to machine Y
  • Re-write code generator to produce code for Y
  • Compile compiler on machine Y (using Ys C
    compiler)

13
Bootstrapping
  • Suppose our compiler is written in the language
    it compiles
  • e.g. C compiler written in C language
  • We can then run compiler through itself!
  • Bootstrapping
  • To alter compiler
  • Make necessary changes
  • Run compiler through itself
  • To move to machine Y
  • Re-write code generator to produce code for Y
  • Run compiler through itself to generate version
    of compiler that will run directly on Y

14
The Scanner (Lexical Analyser)
  • Converts groups of characters into tokens
    (lexemes)
  • tokens usually represented as integers
  • white space and comments are skipped
  • Each token may be accompanied by a value
  • could be a pointer to further information
  • As identifiers encountered, entered into a symbol
    table
  • used to collect info. about declared objects
  • Scanners often hand-coded for efficiency, but may
    be automatically generated (e.g. Lex)

15
Example
TOKEN VALUE
symboltable
begin
int a
begin int a float b a 1 b 1.2 a b
1 print (a 2) end
a
float b
b
a 1
b 1.2
16
Symbol Table Access
  • The symbol table is used by most compiler phases
  • Even used post-compilation (debugging)
  • Structure of table and algorithms used can make
    difference between a slow and fast compiler
  • Methods
  • Sequential lookup
  • Binary chop and binary tree
  • Hash addressing
  • Hash chaining

17
Sequential Lookup
  • Table is just a vector of names
  • Search sequentially from beginning
  • If name not found, add to end
  • Advantages
  • Very simple to implement
  • Disadvantages
  • Inefficient
  • For table with N names, requires N/2 comparisons
    on average
  • Can slow down a compiler by a factor of 10 or more

18
Binary Chop
  • Keep names in alphabetical order
  • To find name
  • Compare with middle element to determine which
    half
  • Compare with middle element again to narrow down
    to quarter, etc.
  • Advantage
  • Much more efficient than sequential
  • log2N-1 comparisons on average
  • Disadvantage
  • Adding a new name means shifting up every name
    above it

19
Question
  • If the symbol table for a compiler is size 4096,
    how many comparisons on average need to be made
    when performing a lookup using the binary chop
    method?
  • 2
  • 11
  • 12
  • 16
  • 31

Answer b 11 as there are log2N-1 comparisons
on average
20
Binary Tree
  • Each node contains pointer to 2 sub-trees
  • Left sub-tree contains all names lt current
  • Right sub-tree has all names gt current
  • Advantages
  • In best case, search time can be as good as
    binary chop
  • Adding a new name is simple and efficient
  • Disadvantages
  • Efficiency depends on how balanced the tree is
  • Tree can easily become unbalanced
  • In worst case, method as bad as sequential
    lookup!
  • May need to do costly re-balancing occasionally

21
Hash Addressing
  • To determine position in table, apply a hash
    function, returning a hash key
  • Example fn Sum of character codes modulo N,
    where N is table size (prime)
  • Advantages
  • Can be highly efficient
  • Even similar names can generate totally different
    hash keys
  • Disadvantages
  • Requires hash function producing good
    distribution
  • Possibility of collisions
  • May require re-hashing mechanism, possibly
    multiple times

22
Hash Chaining
  • As before, but link together names having same
    hash key

hash(fred)
fred
jim
  • Number of comparisonsneeded very small

array of pointers
23
Question
  • Concerning compilation, which of the following is
    NOT a method for symbol table access?
  • Sequential lookup
  • Direct lookup
  • Binary chop
  • Hash addressing
  • Hash chaining

Answer b Direct Lookup
24
Reserved Words
  • Words like for, while, if, etc. are
    reserved words
  • Could use binary chop on a table of reserved
    words first if not there, search symbol table
  • Simpler to pre-hash all reserved words into the
    symbol table and use one lookup mechanism

25
Today
  • Parsing
  • Context-free grammar BNF
  • Example The Micro language
  • Parse Tree
  • Abstract syntax tree

25
26
Parser (Syntax Analyser)
  • Reads tokens and groups them into units as
    specified by language grammari.e. it recognises
    syntactic phrases
  • Parser must produce good errors and be able to
    recover from errors

26
27
Scanning and Parsing
source file
sum x1 x2
input stream
sum x1 x2
Regular expressions define tokens
Scanner
tokens
BNF rules define grammar elements
Parser
sum x1 x2
parse tree
27
28
Syntax
  • Defines the structure of legal statements in the
    language
  • Usually specified formally using a context-free
    grammar (CFG)
  • Notation most widely used is Backus-Naur Form
    (BNF), or extended BNF
  • A CFG is written as a set of rules (productions)
  • In extended BNF
  • ... means zero or many
  • ... means zero or one

28
29
Backus Naur Form
  • Backus Naur Form (BNF) iw a standard notation for
    expressing syntax as a set of grammar rules.
  • BNF was developed by Noam Chomsky, John Backus,
    and Peter Naur.
  • First used to describe Algol.
  • BNF can describe any context-free grammar.
  • Fortunately, computer languages are mostly
    context-free.
  • Computer languages remove non-context-free
    meaning by either
  • (a) defining more grammar rules or
  • (b) pushing the problem off to the semantic
    analysis phase.

29
30
A Context-Free Grammar
  • A grammar is context-free if all the syntax rules
    apply regardless of the symbols before or after
    (the context).
  • Example

(1) sentence gt noun-phrase verb-phrase
. (2) noun-phrase gt article noun (3) article gt
a the (4) noun gt boy girl cat
dog (5) verb-phrase gt verb noun-phrase (6) verb
gt sees pets bites Terminal symbols 'a'
'the' 'boy' 'girl' 'sees' 'pets' 'bites'
30
31
A Context-Free Grammar
A sentence that matches the productions (1) - (6)
is valid.
a girl sees a boy a girl sees a girl a girl sees
the dog the dog pets the girl a boy bites the
dog a dog pets the boy ...
To eliminate unwanted sentences without imposing
context sensitive grammar, specify semantic
rules "a boy may not bite a dog"
31
32
Backus Naur Form
  • Grammar Rules or Productions define symbols.

assignment_stmt id expression
The nonterminal symbol being defined.
The definition (production)
Nonterminal Symbols anything that is defined on
the left-side of some production. Terminal
Symbols things that are not defined by
productions. They can be literals, symbols, and
other lexemes of the language defined by lexical
rules. Identifiers id A-Za-z_\w Delimi
ters Operators - /
32
33
Backus Naur Form (2)
  • Different notations (same meaning)
  • assignment_stmt id expression term
  • ltassignment-stmtgt gt ltidgt ltexprgt lttermgt
  • AssignmentStmt ? id expression term
  • , gt, ? mean "consists of" or "defined
    as"
  • Alternatives ( " " )
  • Concatenation

expression gt expression term expression -
term term
number gt DIGIT number DIGIT
33
34
Alternative Example
  • The following BNF syntax is an example of how an
    arithmetic expression might be constructed in a
    simple language
  • Note the recursive nature of the rules

34
35
Syntax for Arithmetic Expr.
ltexpressiongt lttermgt ltaddopgt lttermgt
ltexpressiongt ltaddopgt lttermgt
lttermgt ltprimarygt lttermgt ltmultopgt ltprimarygt
ltprimarygt ltdigitgt ltlettergt ( ltexpressiongt
)
ltdigitgt 0 1 2 ... 9
ltlettergt a b c ... y z
ltmultopgt /
ltaddopgt -
  • Are the following expressions legal, according to
    this syntax?
  • i) -a
  • ii) bc(3/d)
  • iii) a(c-(4b))
  • iv) 5(9-e)/d

35
36
BNF rules can be recursive
  • expr gt expr term
  • expr - term term
  • term gt term factor
  • term / factor
  • factor
  • factor gt ( expr ) ID NUMBER
  • where the tokens are
  • NUMBER 0-9
  • ID A-Za-z_A-Za-z_0-9

36
37
Uses of Recursion
  • Repetition
  • expr gt expr term
  • gt expr term term
  • gt expr term term term
  • gt term ... term term
  • Parser can recursively expand expr each time one
    is found
  • Could lead to arbitrary depth analysis
  • Greatly simplifies implementation

37
38
Example The Micro Language
  • To illustrate BNF parsing, consider an example
    imaginary language the Micro language
  • 1) A program is of the form begin sequence
    of statements end
  • 2) Only statements allowed are
  • assignment
  • read (list of variables)
  • write (list of expressions)

38
39
Micro
  • 3) Variables are declared implicitly
  • their type is integer
  • 4) Each statement ends in a semi-colon
  • 5) Only operators are , -
  • parentheses may be used

39
40
Micro CFG
  • 1) A program is of the formbegin
    statementsend
  • 2) Permissible statements
  • assignment
  • read (list of variables)
  • write (list of expressions)
  • 3) Variables are declared implicitly
  • their type is integer
  • 4)Statements end in a semi-colon
  • 5) Valid operators are , - but can use
    parentheses
  1. ltprogramgt begin ltstat-listgt end
  2. ltstat-listgt ltstatementgt ltstatementgt
  3. ltstatementgt id ltexprgt
  4. ltstatementgt read ( ltid-listgt )
  5. ltstatementgt write ( ltexpr-listgt )
  6. ltid-listgt id , id
  7. ltexpr-listgt ltexprgt , ltexprgt
  8. ltexprgt ltprimarygt ltaddopgt ltprimarygt
  9. ltprimarygt ( ltexprgt )
  10. ltprimarygt id
  11. ltprimarygt intliteral
  12. ltaddopgt
  13. ltaddopgt -

40
41
BNF
  • Items such as ltprogramgt are non-terminals
  • require further expansion
  • Items such as begin are terminals
  • correspond to language tokens
  • Usual to combine productions using (or)
  • e.g. ltprimarygt ( ltexprgt ) id
    intliteral

41
42
Parsing
  • Bottom-up
  • Look for patterns in the input which correspond
    to phrases in the grammar
  • Replace patterns of items by phrases, then
    combine these into higher-level phrases, and so
    on
  • Stop when input converted to single ltprogramgt
  • Top-down
  • Assume input is a ltprogramgt
  • Search for each of the sub-phrases forming a
    ltprogramgt, then for each of the sub-sub-phrases,
    and so on
  • Stop when we reach terminals
  • A program is syntactically correct iff it can be
    derived from the CFG

42
43
Question
  • Consider the following grammar, where S, A and B
    are non-terminals, and a and b are terminals
  • S AB
  • A a
  • A BaB
  • B bbA
  • Which of the following is FALSE?
  • The length of every string derived from S is
    even.
  • No string derived from S has an odd number of
    consecutive bs.
  • No string derived from S has three consecutive
    as.
  • No string derived from S has four consecutive
    bs.
  • Every string derived from S has at least as many
    bs as as.

Answerc No string derived from S has three
consecutive as
43
44
Example
  • Parse begin A B (10 - C) end

ltprogramgt
begin ltstat-listgt end (apply rule 1)
begin ltstatementgt end (2)
begin id ltexprgt end (3)
begin id ltprimarygt ltaddopgt ltprimarygt end (8)
begin id ltprimarygt ltprimarygt end (12)
...
44
45
Exercise
  • Complete the previous parse
  • Clue - this is the final line of the parse
  • begin id id (intliteral - id) end

45
46
Answer
  • Parse begin A B (10 - C) end
  • ltprogramgt
  • begin ltstat-listgt end
    (apply rule 1)
  • begin ltstatementgt end
    (2)
  • begin id ltexprgt end
    (3)
  • begin id ltprimarygt ltaddopgt ltprimarygt end
    (8)
  • begin id ltprimarygt ltprimarygt end
    (12)
  • begin id id ltprimarygt end
    (10)
  • begin id id (ltexprgt) end
    (9)
  • begin id id (ltprimarygtltaddopgtltprimarygt)
    end (8)
  • begin id id (ltprimarygt - ltprimarygt) end
    (13)
  • begin id id (intliteral - ltprimarygt) end
    (11)
  • begin id id (intliteral - id) end
    (10)

46
47
Parse Tree
  • ltprogramgtbegin ltstat-listgt
    end ltstatementgt id
    ltexprgt ltprimarygt ltaddopgt ltprimarygt
    id ( ltexprgt
    ) ltprimarygt ltaddopgt
    ltprimarygt intliteral
    - id
  • The parser creates a data structure representing
    how the input is matched to grammar rules.
  • Usually as a tree.
  • Also called syntax tree or derivation tree

47
48
Expression Grammars
  • For expressions, a CFG can indicate
    associativity and operator precedence, e.g.

ltexprgt ltfactorgt ltaddopgt ltfactorgt
ltfactorgt ltprimarygt ltmultopgt ltprimarygt
ltprimarygt ( ltexprgt ) id literal
ltexprgtltfactorgt ltaddopgt
ltfactorgtltprimarygt ltprimarygt ltmultopgt
ltprimarygt id id
id
ABC
48
49
Ambiguity
  • A grammar is ambiguous if there is more than one
    parse tree for a valid sentence.
  • Example
  • expr gt expr expr expr expr id
  • number
  • How would you parse x y z using this rule?

49
50
Example of Ambiguity
  • Grammar Rules
  • expr gt expr expr expr ? expr (
    expr ) NUMBER
  • Expression 2 3 4
  • Two possible parse trees

50
51
Another Example of Ambiguity
  • Grammar rules
  • expr gt expr expr expr - expr
    ( expr ) NUMBER
  • Expression 2 - 3 - 4
  • Parse trees

51
52
Ambiguity
  • Ambiguity can lead to inconsistent
    implementations of a language.
  • Ambiguity can cause infinite loops in some
    parsers.
  • Specification of a grammar should be unambiguous!
  • How to resolve ambiguity
  • rewrite grammar rules to remove ambiguity
  • add some additional requirement for parser, such
    as "always use the left-most match first"
  • EBNF (later) helps remove ambiguity

52
53
Abstract Syntax Tree (AST)
  • More compact form of derivation tree
  • contains just enough info. to drive later
    phasese.g. Y 3X I
    id
    id
    const 3 id

to symbol table
IX
Y
tag attribute
53
54
Semantics
  • Specify meaning of language constructs
  • usually defined informally
  • A statement may be syntactically legal but
    semantically meaningless
  • colourless green ideas sleep furiously
  • Semantic errors may be
  • static (detected at compile time)e.g. a x
    true
  • dynamic (detected at run time)e.g. array
    subscript out of bounds

54
55
Question
  • If the array x contains 20 ints, as defined by
    the following declaration
  • int x new int20
  • What kind of message would be generated by the
    following line of code?
  • a 22
  • val xa
  • A Syntax Error.
  • A Static Semantic Error.
  • A Dynamic Semantic Error.
  • A Warning, rather than an error.
  • None of the above.

Answer c A dynamic semantic error the value of
a would cause an array out of bounds error
55
56
Semantics
  • Also needed to generate appropriate codee.g. a
    b
  • in Java and C, this means assign b to a
  • in Pascal and Ada, this means compare equality of
    a and b
  • hence, generate different code in each case

56
57
Semantic Routines
  • 1) Semantic analysis
  • Completes analysis phase of compilation
  • Object descriptors are associated with
    identifiers in symbol table
  • Static semantic error checking performed
  • 2) Semantic synthesis
  • Code generation

57
58
Object Descriptors
Symbol table entry
(to next entryin chain)
name token descriptor list
link
  • Token tells us what name is
  • e.g. while-token, if-token, identifier, etc.
  • A descriptor contains things like type, address,
    array bounds, etc.
  • Need a list of descriptors because of identifier
    re-use

58
59
Identifier Re-use
  • Can have code such as int x // level
    1 main() float x // level 2

symbol table entry
x
2 float
1 integer
59
60
Descriptor Lists
  • For efficiency, the most local descriptors are
    kept at the front of the list
  • At the end of a block, all descriptors declared
    in that block must be deleted
  • To aid in this, all descriptors within same block
    may be linked together

60
61
Attribute Propagation
  • Before code can be generated, semantic attributes
    may need to be propagated through tree
  • Top-down (inherited attributes)
  • declarations processed to build symbol table
  • identifiers looked up in table to attach
    attribute info to nodes
  • Bottom-up (synthesised attributes)
  • determine types of expressions based on operators
    and types of identifiers
  • Propagation can be done at same time as static
    semantic error checking, and often forms next
    pass
  • May also be combined with code generation

61
62
Example a bc bd
float a, d int b, c
(float) a
(float) (int)
(float) b (int) c (int) b (int)
d (float)
SYMBOLTABLE
synthesised
inherited
  • Type attribute recorded in extra field of each
    node
  • After propagation, tree is said to be decorated

62
63
Static Semantic Error Checking
  • With info from attribute propagation, static
    checking often trivial, e.g.
  • type mismatch(compare type attributes)
  • identifier not declared(null descriptor field
    in symbol table)
  • identifier already declared(descriptor with
    current level number already present)

63
64
Question
  • A BNF grammar includes the following statement
  • ltstatementgt ltidengt ( ltexprgt )
  • What kind of message would be produced by the
    following line of code?
  • a (2 b
  • A Syntax Error.
  • A Static Semantic Error.
  • A Dynamic Semantic Error.
  • A Warning, rather than an error.
  • None of the above.

Answer a A syntax error all the tokens are
valid, but the close parenthesis is missing,
resulting in an error in the grammar
64
65
Code Generation
  • Often performed by tree-walking the
    ASTGenAssign(node) // Gen code for RHS,
    leaving result in R1 GenExpr(node.rhs, R1)
    //Calculate addr for LHS GenAddr(node.lhs,
    Addr) Gen(STORE, R1, Addr)GenExpr(node,
    reg) if (node.type op)
    GenExpr(node.lhs, reg) GenExpr(node.rhs,
    reg1) Gen(node.opcode, reg, reg1)
    ...

65
66
Abstract Syntax Tree (AST) Again
  • More compact form of derivation tree
  • contains just enough info. to drive later
    phasese.g. Y 3X I
    id
    id
    const 3 id

to symbol table
IX
Y
tag attribute
66
67
Tree Walking
  • LOAD R1, 3 LOAD R2, XY
    (int) (int) MULT R1, R2 LOAD R2,
    I (int) I (int) ADD R1,
    R2 STORE R1, Y
  • 3 X (int)
  • Advantage of AST is that order of traversal can
    be chosen
  • code generated in one-pass compiler corresponds
    to strictly fixed traversal of tree(hence, code
    not as good)

67
68
Intermediate Code (IC)
  • Instead of generating target machine code,
    semantic routines may generate IC.
  • can form input to separate code generator (CG)
  • advantage is that all target machine dependencies
    can be limited to CG
  • Postfix
  • e.g. a bc bd a b c b d
  • Concise and simple, but not very good for
    generating code unless stack-based architecture
    used

68
69
Postfix
  • In normal algebraic notation the arithmetic
    operator appears between the two operands to
    which it is being applied
  • This is called infix notation
  • example a / b c
  • It may require parentheses to specify the desired
    order of operations
  • example a / (b c)
  • In postfix (or Reverse Polish) notation the
    operator is placed directly after the two
    operands to which it applies
  • Therefore, in postfix notation the need for
    parenthesis is eliminated

69
70
Operator Precedence
  • To do the conversion from infix to postfix, we
    need to prioritise operators as follows
  • highest priority
  • , /
  • , -
  • lt, gt, , ...
  • (and)
  • (or) lowest priority

70
71
Exercise
  • Convert the following infix expressions into
    postfix
  • ab/c
  • ac(b-d)
  • acb-d

71
72
Postfix
  • Example 1
  • The infix expression a b c
  • Becomes in postfix a b c
  • Example 2
  • The infix expression a (b c)
  • Becomes in postfix a b c
  • Example 3
  • The infix expression b c 5 ( 3 6 / a )
  • Becomes in postfix b c 5 3 6 a /

72
73
Question
  • Which of the following postfix expressions is
    equivalent to the following expression?
  • ab c/d
  • a b c d - /
  • a b - c d /
  • a b c d / -
  • a b c d / -
  • a b c - d /

Answer d a b c d / -
73
74
Today
  • Code generation
  • Three address code
  • Code optimisation
  • Techniques
  • Classification of optimisations
  • Time of application
  • Area of application

74
75
Intermediate Code
  • Code can be generated from syntax tree
  • However, this doesnt represent target code very
    well
  • Tree represents constructs such as conditionals
    (ifthenelse) or loops (whiledo)
  • Target code includes jumps to memory addresses
  • Intermediate code represents a linearisation of
    the syntax tree
  • Postfix is an example of a stack-based
    linerisation
  • Typically related in some way to target
    architecture
  • Good for efficient code
  • Can be exploited by code optimisation routines

75
76
Three Address Code
  • Reflects the notion of simple operations of the
    form
  • x y op z
  • Many instructions are of this form
  • Introduces the notion of temporary variables
  • These represent interior nodes in the tree
  • Usually assigned to registers
  • Represents a left-to-right linearization of the
    code
  • Other variants exist, e.g. for unary operations
  • x -y

76
77
Three Address Code
  • Consider the arithmetic expression
  • 2a(b-3)
  • The corresponding three-address code is
  • t1 2 a
  • t2 b 3
  • t3 t1 t2

77
78
Example factorial function
  • read x
  • if (0 lt x) then
  • fact 1
  • repeat
  • fact fact x
  • x x 1
  • until x 0
  • write fact
  • end
  • read x
  • t1 x gt 0
  • if_false t1 goto L1
  • fact 1
  • label L2
  • t2 fact x
  • fact t2
  • t3 x 1
  • x t3
  • t4 x 0
  • if_false t4 goto L2
  • write fact
  • label L1
  • halt

78
79
P-Code
  • Was initially a target assembly generated by
    Pascal compilers in early 70ies
  • Format is very similar to assembly
  • designed to work on a hypothetical stack machine
    called a P-machine
  • aim was to aid portability
  • P-code instructions could then be mapped to
    assembly for target platform
  • Simple, abstract version given on the next slide

79
80
P-Code
  • Consider the arithmetic expression
  • 2a(b-3)
  • The corresponding P-code is
  • lcd 2 load constant 2
  • lod a load value of var a
  • mpi integer multiplication
  • lod b load value of var b
  • ldc 3 load constant 3
  • sbi integer subtraction
  • adi integer addition

80
81
Question
  • Which of the following is NOT a form of
    intermediate representation used by compilers?
  • Postfix
  • Tuples
  • Context-free grammar
  • Abstract syntax tree
  • Virtual machine code

Answer c A context-free grammar defines the
language used by the compiler the rest are
intermediate representations
81
82
Code Optimisation
  • Aim is to improve quality of target code
  • Disadvantages
  • compiler more difficult to write
  • compilation time may double or triple
  • target code often bears little resemblance to
    unoptimised code
  • greater chance of translation errors
  • more difficult to debug programs

82
83
Optimisation Techniques
  • Constant folding
  • can evaluate expressions involving constants at
    compile-time
  • aim is for the compiler to pre-compute (or
    remove) as many operations as possible
  • a 316 - 2LOAD 1, 46STORE 1, a

83
84
Techniques
  • Global register allocation
  • analyse program to determine which variables are
    likely to be used most and allocate these to
    registers
  • good use of registers is a very important feature
    of efficient code
  • aided by architectures that provide an increased
    number of registers

84
85
Techniques
  • Code deletion
  • identify and delete unreachable or dead code
  • boolean debug false...if (debug)
    ... No need to generate code for this

85
86
Techniques
  • Common sub-expression elimination
  • avoid generating code for unnecessary operations
    by identifying expressions that are repeated
  • a (bc/5 x) - (bc/5 y)
  • generate code for bc/5 only once

86
87
Exercise
  • Optimise the following
  • a 100322
  • b (a-30)5
  • if (altb)
  • screen.println(a)

87
88
Techniques
  • Code motion out of loops
  • for (int i0 i lt n i) x a 5
    //loop-invariant code Screen.println(xi)
  • x a 5for (int i0 i lt n i)
    Screen.println(xi)

88
89
Techniques
  • Strength reduction
  • replace operations by others which are equivalent
    but more efficiente.g. a 2LOAD 1, a LOAD 1,
    aMULT 1, 2 ADD 1, 1

89
90
Question
  • What optimisation technique could be applied in
    the following examples?
  • a b2
  • a a / 2
  • Constant Folding
  • Code Deletion
  • Common Sub-Expression Elimination
  • Strength Reduction
  • Global Register Allocation

Answer d Both expressions can be reduced by
changing the operator a b 2 can be reduced
to a b b a a / 2 is a right shift
operation a a gtgt 1
90
91
Classification of Optimisations
  • Optimisations can be classified according to
    their different characteristics
  • Two useful classifications
  • the period of the compilation process during
    which an optimisation can be applied
  • the area of the program to which the optimisation
    applies

91
92
Time of Application
  • Optimisations can be performed at virtually every
    stage of the compilation process
  • e.g. constant folding can be performed during
    parsing
  • other optimisations might be applied to target
    code
  • The majority of optimisations are performed
    either during or just after intermediate code
    generation, or during target code generation
  • source-level optimisations do not depend upon
    characteristics of the target machine and can be
    performed earlier
  • target-level optimisations depend upon the target
    architecture
  • sometimes an optimisation can consist of both

92
93
Target Code Optimisations
  • Optimisations performed on target code are known
    as peephole optimisations
  • scan target code, searching for sequences of
    target code that can be replaced by more
    efficient ones, e.g.
  • LOAD 1, a INC aADD 1, 1STORE 1, a
  • replacements may introduce further possibilities
  • effective and simple
  • sometimes tacked onto end of one-pass compiler

93
94
Area of Application
  • Optimisations can be applied to different areas
    of a program
  • Local optimisations those that are applied to
    straight-line segments of code, i.e. with no
    jumps into or out of the sequence
  • easiest optimisations to perform
  • Global optimisations those that extend beyond
    basic blocks but are confined to an individual
    procedure
  • more difficult to perform
  • Inter-procedural optimisations those that extend
    beyond the boundaries of procedures to the entire
    program
  • most difficult optimisations to perform

94
95
Today
  • Compiler-writing tools
  • Regular expressions
  • Lex
  • Yacc
  • Code generator generators

95
96
Compiler-Writing Tools
  • Various software tools exist which aid in the
    construction of compilers.
  • Parser generators
  • e.g. yacc
  • Code generator generators
  • Scanner generators
  • e.g. lex
  • The input to lex consists of a definition of each
    token as a regular expression

96
97
Regular Expressions (REs)
  • Used in many UNIX tools, e.g. awk, grep, sed,
    lex, vi
  • REs specify patterns to be matched against input
    text
  • An RE may be just a stringcat matches the
    string cat
  • A full stop matches any single charc.t matches
    cat, cut, cot, etc.

97
98
REs
  • The beginning of a line is specified by
  • End of line is specified as
  • An asterisk means zero or more occurrences of
    the immediately preceding itemxyz matches xz,
    xyz, xyyz, xyyyz, etc.
  • A plus sign means one or morexyz matches
    xyz, xyyz, etc.
  • A vertical bar means or e.g.x(ab)y matches
    xay or xby

98
99
Exercise
  • What will be matched by the pattern a.d in the
    following line of characters?
  • add a dog and aardvark
  • Using the same line of characters what will match
    a.d ?
  • What do we get if we search a file for all
    occurrences of the following patterns?
  • hello

99
100
Exercise
  • Using the same line of characters as before, what
    will be matched by the following?
  • and
  • and
  • What will be matched by
  • 101
  • .

100
101
Character Classes
  • Square brackets denote a character
    classabc matches character a, b, or c
  • Can also abbreviate1-6 is equivalent to
    123456
  • Asterisk and plus may be applied to character
    classese.g. to define hex numbers in a Java or C
    program0x0-9a-fA-F
  • Can negate a character classabc match any
    char except a,b,c
  • Note different use of

101
102
Exercise
  • Which of the following will match
  • Kkaitle
  • Kate kite kale kit ?
  • What matches the following? \t\n

102
103
Lex
  • Input to lex consists of pairs of REs and actions
  • Each RE defines a particular language token
  • Each action is a fragment of C code, to be
    executed if the token is encountered
  • Lex transforms this input to a C function called
    yylex(), which returns a token each time it is
    called
  • The string that matches an RE is placed in an
    array called yytext
  • Extra info about a token can be passed back to
    calling program via a global variable called
    yylval

103
104
Lex Example
  • \t\n while return(WHILE_SYMB)for retu
    rn(FOR_SYMB)..0-9 / convert to int
    / yylval atoi(yytext)
    return(NUMBER) a-zA-Za-zA-Z0-9
    / find in sym tab / yylval
    lookup(yytext) return(IDENT)

104
105
Yacc
  • Stands for Yet Another Compiler-Compiler
  • It is a parser generator
  • Parser generators are programs that take as input
    the grammar defining a language and produce as
    output a parser for that language
  • A Yacc parser matches sequences of input tokens
    to the rules of the given grammar

105
106
Yacc
  • The specification file that Yacc takes as input
    consists of three sections
  • Definitions contains info about the tokens, data
    types and grammar rules required to build the
    parser
  • Rules contains the rules (in a form of BNF) of
    the grammar, along with actions in C code to be
    executed whenever a given rule is recognised
  • Auxiliary routines contains any auxiliary
    procedure and function declarations required to
    complete the parser

106
107
Yacc Example
  • Example format of rules
  • assign IDENT BECOMES expr SEMI / action
    for assignment / while WHILE expr DO
    statement / action for while stat /
  • The parsing procedure produced by Yacc is called
    yyparse()
  • returns an int value 0 if the parse is
    successful, 1 otherwise

107
108
Error Recovery in Yacc
  • Errors need to be recognised and recovered from
    Yacc provides error productions as the principal
    way to achieve this
  • Error productions have on their right hand side
    an error pseudotoken
  • These productions identify a context in which
    erroneous tokens can be deleted until tokens are
    encountered that enable the parse to be
    re-started
  • When errors are encountered appropriate syntax
    error messages will be generated

108
109
Code Generator Generators
  • CGGs remove the burden of deciding what code to
    generate for each construct
  • Implementer produces a formal description of what
    each target machine instruction does
  • CG automatically searches machine description to
    find the instructions(s) that produce desired
    computation
  • Code almost as good as conventional compiler, but
    generation speed much slower

109
110
Question
  • Lex is a software tool that can be used to aid
    compiler construction. It is an example of which
    of the following?
  • A scanner generator
  • A parser generator
  • A code generator generator
  • A semantic analyser
  • A code debugger

Answer a Lex is responsible for identifying
tokens using regular expressions. It is
therefore a scanner generator
110
Write a Comment
User Comments (0)
About PowerShow.com