Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

1 / 64
About This Presentation
Title:

Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

Description:

Applications may contain much instruction level parallelism ... Graph needs 3 colors (chromatic nr =3) = program needs 3 registers. H.C. TD 5102. 30 ... –

Number of Views:99
Avg rating:3.0/5.0
Slides: 65
Provided by: imec6
Category:

less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation


1
Embedded Systems in SiliconTD5102Compilerswith
emphasis on ILP compilation
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Compiling for ILP Architectures
  • Overview
  • Motivation and Goals
  • Measuring and exploiting available parallelism
  • Compiler basics
  • Scheduling for ILP architectures
  • Summary and Conclusions

3
Motivation
  • Performance requirements increase
  • Applications may contain much instruction level
    parallelism
  • Processors offer lots of hardware concurrency
  • Problem to be solved
  • how to exploit this concurrency automatically?

4
Goals of code generation
  • High speedup
  • Exploit all the hardware concurrency
  • Extract all application parallelism
  • obey true dependencies only
  • resolve false dependencies by renaming
  • No code rewriting automatic parallelization
  • However application tuning may be required
  • Limit code expansion

5
Overview
  • Motivation and Goals
  • Measuring and exploiting available parallelism
  • Compiler basics
  • Scheduling for ILP architectures
  • Summary and Conclusions

6
Measuring and exploiting available parallelism
  • How to measure parallelism within applications?
  • Using existing compiler
  • Using trace analysis
  • Track all the real data dependencies (RaWs) of
    instructions from issue window
  • register dependence
  • memory dependence
  • Check for correct branch prediction
  • if prediction correct continue
  • if wrong, flush schedule and start in next cycle

7
Trace analysis
Execution trace set r1,0 set r2,3 set r3,A st
r1,0(r3) add r1,r1,1 add r3,r3,4 brne
r1,r2,Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3
Compiled code set r1,0 set r2,3 set
r3,A Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop add r1,r5,3
Program For i 0..2 Ai i S X3
How parallel can this code be executed?
8
Trace analysis
Parallel Trace set r1,0 set r2,3 set
r3,A st r1,0(r3) add r1,r1,1 add
r3,r3,4 st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne
r1,r2,Loop add r1,r5,3
Max ILP Speedup Lparallel / Lserial 16 / 6
2.7
9
Ideal Processor
  • Assumptions for ideal/perfect processor
  • 1. Register renaming infinite number of
    virtual registers gt all register WAW WAR
    hazards avoided
  • 2. Branch and Jump prediction Perfect gt all
    program instructions available for execution
  • 3. Memory-address alias analysis addresses are
    known. A store can be moved before a load
    provided addresses not equal
  • Also
  • unlimited number of instructions issued/cycle
    (unlimited resources), and
  • unlimited instruction window
  • perfect caches
  • 1 cycle latency for all instructions (FP ,/)
  • Programs were compiled using MIPS compiler with
    maximum optimization level

10
Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
11
Different effects reduce the exploitable
parallelism
  • Reducing window size
  • i.e., the number of instructions to choose from
  • Non-perfect branch prediction
  • perfect (oracle model)
  • dynamic predictor
  • (e.g. 2 bit prediction table with finite number
    of entries)
  • static prediction (using profiling)
  • no prediction
  • Restricted number of registers for renaming
  • typical superscalars have O(100) registers
  • Restricted number of other resources, like FUs

12
Different effects reduce the exploitable
parallelism
  • Non-perfect alias analysis (memory
    disambiguation)Models to use
  • perfect
  • inspection no dependence in following cases
  • r1 0(r9) r1 0(fp)
  • 4(r9) r2 0(gp) r2
  • A more advanced analysis may disambiguate most
    stack and global references, but not the heap
    references
  • none
  • Important
  • good branch prediction, 128 registers for
    renaming, alias analysis on stack and global
    accesses, and for FloatingPt a large window size

13
Summary
  • Amount of parallelism is limited
  • higher in Multi-Media
  • higher in kernels
  • Trace analysis detects all types of parallelism
  • task, data and operation types
  • Detected parallelism depends on
  • quality of compiler
  • hardware
  • source-code transformations

14
Overview
  • Motivation and Goals
  • Measuring and exploiting available parallelism
  • Compiler basics
  • Scheduling for ILP architectures
  • Source level transformations
  • Compilation frameworks
  • Summary and Conclusions

15
Compiler basics
  • Overview
  • Compiler trajectory / structure / passes
  • Abstract Syntax Tree (AST)
  • Control Flow Graph (CFG)
  • Data Dependence Graph (DDG)
  • Basic optimizations
  • Register allocation
  • Code selection

16
Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
17
Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
18
Compiler basics structure Simple compilation
example
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
19
Compiler basics structure - SUIF-1
toolkit example
20
Compiler basics Abstract Syntax Tree (AST)
C input code
Parse tree infinite nesting
if (a gt b) r a b else r b
a
Stat IF Cmp gt Var a Var
b Statlist Stat Expr
Assign Var r
Binop Var a
Var b Statlist
Stat Expr Assign
Var r Binop
Var b
Var a
21
Compiler basics Control flow graph (CFG)
C input code
if (a gt b) r a b else r b
a
1 sub t1, a, b bgz t1, 2, 3
CFG
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
22
Data Dependence Graph (DDG)
a b 15 c 3.14 d e c / f
Translation to DDG
d
ld
3.14
f
b
ld
ld

15
c

/
st
a
e
st
st
23
Compiler basics Basic optimizations
  • Machine independent optimizations
  • Machine dependent optimizations
  • (details are in any good compiler book)

24
Machine independent optimizations
  • Common subexpression elimination
  • Constant folding
  • Copy propagation
  • Dead-code elimination
  • Induction variable elimination
  • Strength reduction
  • Algebraic identities
  • Commutative expressions
  • Associativity Tree height reduction
  • Note not always allowed(due to limited
    precision)

25
Machine dependent optimization example
  • Whats the optimal implementation of a34 ?
  • Use multiplier mul Tb,Ta,34
  • Pro No thinking required
  • Con May take many cycles
  • Alternative
  • SHL Tc, Ta, 1
  • ADD Tb, Tc, Tzero
  • SHL Tc, Tc, 4
  • ADD Tb, Tb, Tc
  • Pros May take fewer cycles
  • Cons
  • Uses more registers
  • Additional instructions ( I-cache load / code
    size)

26
Compiler basics Register allocation
  • Register Organization
  • Conventions needed for parameter passing
  • and register usage across function calls a MIPS
    example

r31
Callee saved registers
r21
Caller saved registers
r20
Temporaries
r11
r10
Argument and result transfer
r1
Hard-wired 0
r0
27
Register allocation using graph coloring
  • Given a set of registers, what is the most
    efficient
  • mapping of registers to program variables in
  • terms of execution time of the program?
  • A variable is defined at a point in program when
    a value is assigned to it.
  • A variable is used at a point in a program when
    its value is referenced in an expression.
  • The live range of a variable is the execution
    range between definitions and uses of a variable.

28
Register allocation using graph coloring
Example
Live Ranges
29
Register allocation using graph coloring
Inference Graph
a
Coloring a red b green c blue d green
b
c
d
Graph needs 3 colors (chromatic nr 3) gt program
needs 3 registers
30
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not
enough colors (registers) to color the
interference graph
Example Only two registers available !!
31
Compiler basics Code selection
  • CISC era
  • Code size important
  • Determine shortest sequence of code
  • Many options may exist
  • Pattern matching
  • Example M68020
  • D1 D1 M M10A1 16D2 20 ?
  • ADD (10,A1, D216, 20), D1
  • RISC era
  • Performance important
  • Only few possible code sequences
  • New implementations of old architectures optimize
    RISC part of instruction set only for e.g. i486
    / Pentium / M68020

32
Overview
  • Motivation and Goals
  • Measuring and exploiting available parallelism
  • Compiler basics
  • Scheduling for ILP architectures
  • Source level transformations
  • Compilation frameworks
  • Summary and Conclusions

33
What is scheduling?
  • Time allocation
  • Assigning instructions or operations to time
    slots
  • Preserve dependences
  • Register dependences
  • Memory dependences
  • Optimize code with respect to performance/ code
    size/ power consumption/ ..
  • Space allocation
  • satisfy resource constraints
  • Bind operations to FUs
  • Bind variables to registers/ register files
  • Bind transports to buses

34
Why scheduling?
  • Lets look at the execution time
  • Texecution Ncycles x Tcycle
  • Ninstructions x CPI x Tcycle
  • Scheduling may reduce Texecution
  • Reduce CPI (cycles per instruction)
  • early scheduling of long latency operations
  • avoid pipeline stalls due to structural, data and
    control hazards
  • allow Nissue gt 1 and therefore CPI lt 1
  • Reduce Ninstructions
  • compact many operations into each instruction
    (VLIW)

35
Scheduling data hazards RaW dependence
Avoiding RaW stalls
Reordering of instructions by the compiler
Example avoiding one-cycle load interlock
Code a b c d e - f
36
Scheduling control hazards
  • Branch requires 3 actions
  • Compute new address
  • Determine condition
  • Perform the actual branch (if taken) PC new
    address

37
Control hazards what's the penalty?
  • CPI CPIideal fbranch x Pbranch
  • Pbranch Ndelayslots x miss_rate
  • Superscalars tend to have large branch penalty
    Pbranch due to
  • many pipeline stages
  • multiple instructions (or operations) / cycle
  • Note
  • the lower CPI the larger the effect of penalties

38
What can we do about control hazards and CPI
penalty?
  • Keep penalty Pbranch low
  • Early computation of new PC
  • Early determination of condition
  • Visible delay slots filled by compiler (MIPS)
  • Branch prediction
  • Reduce control dependencies (control height
    reduction) Schlansker and Kathail, Micro95
  • Remove branches if-conversion
  • Conditional instructions CMOVE, cond skip next
  • Guarding all instructions TriMedia

39
Scheduling Conditional instructions
  • Example Cmove (supported by Alpha)

If (A0) S T assume r1 A, r2 S,
r3 T
Object code Bnez r1, L Mov r2, r3 L . . . .
  • After conversion

Cmovz r2, r3, r1
40
Scheduling Conditional instructions
  • Conditional instructions are useful, however
  • Squashed instructions still take execution time
    and execution resources
  • Consequence long target blocks can not be
    if-converted
  • Condition has to be known early
  • Moving operations across multiple branches
    requires complicated predicates
  • Compatibility change of ISA (instruction set
    architecture)
  • Practice
  • Current superscalars support a limited set of
    conditional instructions
  • CMOVE alpha, MIPS, PowerPC, SPARC
  • HP PA any RR instruction can conditionally
    squash next instruction
  • Large VLIWs profit from making all instructions
    conditional
  • guarded execution TriMedia, Intel/HP IA-64, TI
    C6x

41
Guarded execution
SLT r1,r2,r3 BEQ r1,r0, else then ADDI
r2,r2,1 ..X.. j cont else SUBI
r2,r2,1 ..Y.. cont MUL r4,r2
IF-conversion
SLT b1,r2,r3 b1ADDI r2,r2,1 !b1 SUBI
r2,r2,1 b1..X.. !b1 ..Y.. MUL r4,r2
42
Scheduling Conditional instructions
  • Full guard support
  • If-conversion of conditional code
  • Assume
  • tbranch branch latency
  • pbranch branching probability
  • ttrue execution time of the TRUE branch
  • tfalse execution time of the FALSE branch
  • Execution times of original and if-converted code
    for non-ILP architecture
  • toriginal_code (1 pbranch) x tbranch
  • p x ttrue (1 - pbranch) x tfalse
  • tif_converted_code ttrue tfalse

43
Scheduling Conditional instructions
Speedup of if-converted code for non-ILP
architectures
Only interesting for short target blocks!
44
Scheduling Conditional instructions
Speedup of if-converted code for ILP
architectures with sufficient resources
tif_converted max(ttrue, tfalse)
Much larger area of interest !!
45
Scheduling Conditional instructions
  • Full guard support for large ILP architectures
    has a number of advantages
  • Removing unpredictable branches
  • Enlarging scheduling scope
  • Enabling software pipelining
  • Enhancing code motion when speculation is not
    allowed
  • Resource sharing even when speculation is
    allowed guarding may be profitable

46
Scheduling Overview
  • Transforming a sequential program into a parallel
    program
  • read sequential program
  • read machine description file
  • for each procedure do
  • perform function inlining
  • for each procedure do
  • transform an irreducible CFG into a reducible CFG
  • perform control flow analysis
  • perform loop unrolling
  • perform data flow analysis
  • perform memory reference disambiguation
  • perform register allocation
  • for each scheduling scope do
  • perform instruction scheduling
  • write parallel program

47
Scheduling Int.Lin.Programming
  • Integer linear programming scheduling method
  • Introduce
  • Decision variables xi,j 1 if operation i is
    scheduled in cycle j
  • Constraints like
  • Limited resources
  • where xt operation of type t and Mt number of
    resources of type t
  • Data dependence constraints
  • Timing constraints
  • Problem too many decision variables

48
List Scheduling
  • Make a dependence graph
  • Determine minimal length
  • Determine ASAP, ALAP, and slack of each operation
  • Place each operation in first cycle with
    sufficient resources
  • Note
  • Scheduling order sequential
  • Priority determined by used heuristic e.g. slack

49
Basic Block Scheduling
ASAP cycle
B
C
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
50
ASAP and ALAP formulas
slack(v) alap(v) - asap(v)
51
Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E // all nodes which have no
predecessor ready ready // all nodes
which can be scheduled in sched ? //
current cycle current_cycle 0
while sched ? V do for each v ? ready
do if ?ResourceConfl(v,current_cycl
e, sched) then cycle(v)
current_cycle sched sched ?
v endif endfor
current_cycle current_cycle 1
ready v v ? sched ? ? (u,v)? E, u ? sched
ready v v ? ready ? ? (u,v)?
E, cycle(u) delay(u,v) ? current_cycle endwhil
e endproc
52
Problem with basic block scheduling
  • Basic blocks contain on average only about 6
    instructions
  • Unrolling may help for loops
  • Go beyond basic blocks 1. Extended basic
    block scheduling 2. Software pipelining

53
Extended basic block scheduling Scope
Partitioning a CFG into scheduling scopes
54
Extended basic block scheduling Scope
Partitioning a CFG into scheduling scopes
55
Extended basic block scheduling Scope
Comparing scheduling scopes
56
Extended basic block scheduling Code Motion
  • Downward code motions?
  • a ? B, a ? C, a ? D, c ? D, d ? D
  • Upward code motions?
  • c ? A, d ? A, e ? B, e ? C, e ? A

57
Extended basic block scheduling Code Motion
Legend
Basic blocks between source and destination basic
blocks
I
Basic blocks where duplication have to be placed
D
Control flow edges where off-liveness checks have
to be performed
M
b
Destination basic blocks
b
Source basic blocks
  • SCP (single copy on a path) rule no path may
    exist between 2 different D blocks

58
Extended basic block schedulingCode Motion
  • A dominates B ? A is always executed before B
  • Consequently
  • A does not dominate B ? code motion from B to A
    requires
  • code duplication
  • B post-dominates A ? B is always executed after A
  • Consequently
  • B does not post-dominate A ? code motion from B
    to A is speculative

Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
59
Scheduling Loops
Loop Optimizations
A
B
C
A
D
A
B
C
B
C
C
C
C
C
D
D
Loop unrolling
Loop peeling
60
Scheduling Loops
  • Problems with unrolling
  • Exploits only parallelism within sets of n
    iterations
  • Iteration start-up latency
  • Code expansion

Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
61
Software pipelining
  • Software pipelining a loop is
  • Scheduling the loop such that iterations start
    before preceding iterations have finished
  • Or
  • Moving operations across the backedge

LD LD ML LD ML ST ML ST ST
Unroling 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
62
Software pipelining Modulo scheduling
Example Modulo scheduling a loop
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Prologue
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Kernel
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Epilogue
(c) Software pipeline
  • Prologue fills the SW pipeline with iterations
  • Epilogue drains the SW pipeline

63
Summary and Conclusions
  • Compilation for ILP architectures is getting
    mature and enters the commercial area.
  • However
  • Great discrepancy between available and
    exploitable parallelism
  • What if you need more parallelism?
  • source-to-source transformations
  • use other algorithms

64
The End
Thanks
Write a Comment
User Comments (0)
About PowerShow.com