Title: Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation
1Embedded Systems in SiliconTD5102Compilerswith
emphasis on ILP compilation
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2Compiling for ILP Architectures
- Overview
- Motivation and Goals
- Measuring and exploiting available parallelism
- Compiler basics
- Scheduling for ILP architectures
- Summary and Conclusions
3Motivation
- Performance requirements increase
- Applications may contain much instruction level
parallelism - Processors offer lots of hardware concurrency
- Problem to be solved
- how to exploit this concurrency automatically?
4 Goals of code generation
- High speedup
- Exploit all the hardware concurrency
- Extract all application parallelism
- obey true dependencies only
- resolve false dependencies by renaming
- No code rewriting automatic parallelization
- However application tuning may be required
- Limit code expansion
5Overview
- Motivation and Goals
- Measuring and exploiting available parallelism
- Compiler basics
- Scheduling for ILP architectures
- Summary and Conclusions
6Measuring and exploiting available parallelism
- How to measure parallelism within applications?
- Using existing compiler
- Using trace analysis
- Track all the real data dependencies (RaWs) of
instructions from issue window - register dependence
- memory dependence
- Check for correct branch prediction
- if prediction correct continue
- if wrong, flush schedule and start in next cycle
7Trace analysis
Execution trace set r1,0 set r2,3 set r3,A st
r1,0(r3) add r1,r1,1 add r3,r3,4 brne
r1,r2,Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3
Compiled code set r1,0 set r2,3 set
r3,A Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop add r1,r5,3
Program For i 0..2 Ai i S X3
How parallel can this code be executed?
8Trace analysis
Parallel Trace set r1,0 set r2,3 set
r3,A st r1,0(r3) add r1,r1,1 add
r3,r3,4 st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne
r1,r2,Loop add r1,r5,3
Max ILP Speedup Lparallel / Lserial 16 / 6
2.7
9Ideal Processor
- Assumptions for ideal/perfect processor
- 1. Register renaming infinite number of
virtual registers gt all register WAW WAR
hazards avoided - 2. Branch and Jump prediction Perfect gt all
program instructions available for execution - 3. Memory-address alias analysis addresses are
known. A store can be moved before a load
provided addresses not equal - Also
- unlimited number of instructions issued/cycle
(unlimited resources), and - unlimited instruction window
- perfect caches
- 1 cycle latency for all instructions (FP ,/)
- Programs were compiled using MIPS compiler with
maximum optimization level
10Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
11Different effects reduce the exploitable
parallelism
- Reducing window size
- i.e., the number of instructions to choose from
- Non-perfect branch prediction
- perfect (oracle model)
- dynamic predictor
- (e.g. 2 bit prediction table with finite number
of entries) - static prediction (using profiling)
- no prediction
- Restricted number of registers for renaming
- typical superscalars have O(100) registers
- Restricted number of other resources, like FUs
12Different effects reduce the exploitable
parallelism
- Non-perfect alias analysis (memory
disambiguation)Models to use - perfect
- inspection no dependence in following cases
- r1 0(r9) r1 0(fp)
- 4(r9) r2 0(gp) r2
- A more advanced analysis may disambiguate most
stack and global references, but not the heap
references - none
- Important
- good branch prediction, 128 registers for
renaming, alias analysis on stack and global
accesses, and for FloatingPt a large window size
13Summary
- Amount of parallelism is limited
- higher in Multi-Media
- higher in kernels
- Trace analysis detects all types of parallelism
- task, data and operation types
- Detected parallelism depends on
- quality of compiler
- hardware
- source-code transformations
14Overview
- Motivation and Goals
- Measuring and exploiting available parallelism
- Compiler basics
- Scheduling for ILP architectures
- Source level transformations
- Compilation frameworks
- Summary and Conclusions
15Compiler basics
- Overview
- Compiler trajectory / structure / passes
- Abstract Syntax Tree (AST)
- Control Flow Graph (CFG)
- Data Dependence Graph (DDG)
- Basic optimizations
- Register allocation
- Code selection
16Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
17Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
18Compiler basics structure Simple compilation
example
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
19Compiler basics structure - SUIF-1
toolkit example
20Compiler basics Abstract Syntax Tree (AST)
C input code
Parse tree infinite nesting
if (a gt b) r a b else r b
a
Stat IF Cmp gt Var a Var
b Statlist Stat Expr
Assign Var r
Binop Var a
Var b Statlist
Stat Expr Assign
Var r Binop
Var b
Var a
21Compiler basics Control flow graph (CFG)
C input code
if (a gt b) r a b else r b
a
1 sub t1, a, b bgz t1, 2, 3
CFG
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
22Data Dependence Graph (DDG)
a b 15 c 3.14 d e c / f
Translation to DDG
d
ld
3.14
f
b
ld
ld
15
c
/
st
a
e
st
st
23Compiler basics Basic optimizations
- Machine independent optimizations
- Machine dependent optimizations
- (details are in any good compiler book)
24Machine independent optimizations
- Common subexpression elimination
- Constant folding
- Copy propagation
- Dead-code elimination
- Induction variable elimination
- Strength reduction
- Algebraic identities
- Commutative expressions
- Associativity Tree height reduction
- Note not always allowed(due to limited
precision)
25Machine dependent optimization example
- Whats the optimal implementation of a34 ?
- Use multiplier mul Tb,Ta,34
- Pro No thinking required
- Con May take many cycles
- Alternative
- SHL Tc, Ta, 1
- ADD Tb, Tc, Tzero
- SHL Tc, Tc, 4
- ADD Tb, Tb, Tc
- Pros May take fewer cycles
- Cons
- Uses more registers
- Additional instructions ( I-cache load / code
size)
26Compiler basics Register allocation
- Register Organization
- Conventions needed for parameter passing
- and register usage across function calls a MIPS
example
r31
Callee saved registers
r21
Caller saved registers
r20
Temporaries
r11
r10
Argument and result transfer
r1
Hard-wired 0
r0
27Register allocation using graph coloring
- Given a set of registers, what is the most
efficient - mapping of registers to program variables in
- terms of execution time of the program?
- A variable is defined at a point in program when
a value is assigned to it. - A variable is used at a point in a program when
its value is referenced in an expression. - The live range of a variable is the execution
range between definitions and uses of a variable.
28Register allocation using graph coloring
Example
Live Ranges
29Register allocation using graph coloring
Inference Graph
a
Coloring a red b green c blue d green
b
c
d
Graph needs 3 colors (chromatic nr 3) gt program
needs 3 registers
30Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not
enough colors (registers) to color the
interference graph
Example Only two registers available !!
31Compiler basics Code selection
- CISC era
- Code size important
- Determine shortest sequence of code
- Many options may exist
- Pattern matching
- Example M68020
- D1 D1 M M10A1 16D2 20 ?
- ADD (10,A1, D216, 20), D1
- RISC era
- Performance important
- Only few possible code sequences
- New implementations of old architectures optimize
RISC part of instruction set only for e.g. i486
/ Pentium / M68020
32Overview
- Motivation and Goals
- Measuring and exploiting available parallelism
- Compiler basics
- Scheduling for ILP architectures
- Source level transformations
- Compilation frameworks
- Summary and Conclusions
33What is scheduling?
- Time allocation
- Assigning instructions or operations to time
slots - Preserve dependences
- Register dependences
- Memory dependences
- Optimize code with respect to performance/ code
size/ power consumption/ .. - Space allocation
- satisfy resource constraints
- Bind operations to FUs
- Bind variables to registers/ register files
- Bind transports to buses
34Why scheduling?
- Lets look at the execution time
- Texecution Ncycles x Tcycle
- Ninstructions x CPI x Tcycle
- Scheduling may reduce Texecution
- Reduce CPI (cycles per instruction)
- early scheduling of long latency operations
- avoid pipeline stalls due to structural, data and
control hazards - allow Nissue gt 1 and therefore CPI lt 1
- Reduce Ninstructions
- compact many operations into each instruction
(VLIW)
35Scheduling data hazards RaW dependence
Avoiding RaW stalls
Reordering of instructions by the compiler
Example avoiding one-cycle load interlock
Code a b c d e - f
36Scheduling control hazards
- Branch requires 3 actions
- Compute new address
- Determine condition
- Perform the actual branch (if taken) PC new
address
37Control hazards what's the penalty?
- CPI CPIideal fbranch x Pbranch
- Pbranch Ndelayslots x miss_rate
- Superscalars tend to have large branch penalty
Pbranch due to - many pipeline stages
- multiple instructions (or operations) / cycle
- Note
- the lower CPI the larger the effect of penalties
38What can we do about control hazards and CPI
penalty?
- Keep penalty Pbranch low
- Early computation of new PC
- Early determination of condition
- Visible delay slots filled by compiler (MIPS)
- Branch prediction
- Reduce control dependencies (control height
reduction) Schlansker and Kathail, Micro95 - Remove branches if-conversion
- Conditional instructions CMOVE, cond skip next
- Guarding all instructions TriMedia
39Scheduling Conditional instructions
- Example Cmove (supported by Alpha)
If (A0) S T assume r1 A, r2 S,
r3 T
Object code Bnez r1, L Mov r2, r3 L . . . .
Cmovz r2, r3, r1
40Scheduling Conditional instructions
- Conditional instructions are useful, however
- Squashed instructions still take execution time
and execution resources - Consequence long target blocks can not be
if-converted - Condition has to be known early
- Moving operations across multiple branches
requires complicated predicates - Compatibility change of ISA (instruction set
architecture) -
- Practice
- Current superscalars support a limited set of
conditional instructions - CMOVE alpha, MIPS, PowerPC, SPARC
- HP PA any RR instruction can conditionally
squash next instruction - Large VLIWs profit from making all instructions
conditional - guarded execution TriMedia, Intel/HP IA-64, TI
C6x
41Guarded execution
SLT r1,r2,r3 BEQ r1,r0, else then ADDI
r2,r2,1 ..X.. j cont else SUBI
r2,r2,1 ..Y.. cont MUL r4,r2
IF-conversion
SLT b1,r2,r3 b1ADDI r2,r2,1 !b1 SUBI
r2,r2,1 b1..X.. !b1 ..Y.. MUL r4,r2
42Scheduling Conditional instructions
- Full guard support
- If-conversion of conditional code
- Assume
- tbranch branch latency
- pbranch branching probability
- ttrue execution time of the TRUE branch
- tfalse execution time of the FALSE branch
-
- Execution times of original and if-converted code
for non-ILP architecture - toriginal_code (1 pbranch) x tbranch
- p x ttrue (1 - pbranch) x tfalse
- tif_converted_code ttrue tfalse
43Scheduling Conditional instructions
Speedup of if-converted code for non-ILP
architectures
Only interesting for short target blocks!
44Scheduling Conditional instructions
Speedup of if-converted code for ILP
architectures with sufficient resources
tif_converted max(ttrue, tfalse)
Much larger area of interest !!
45Scheduling Conditional instructions
- Full guard support for large ILP architectures
has a number of advantages - Removing unpredictable branches
- Enlarging scheduling scope
- Enabling software pipelining
- Enhancing code motion when speculation is not
allowed - Resource sharing even when speculation is
allowed guarding may be profitable
46Scheduling Overview
- Transforming a sequential program into a parallel
program - read sequential program
- read machine description file
- for each procedure do
- perform function inlining
- for each procedure do
- transform an irreducible CFG into a reducible CFG
- perform control flow analysis
- perform loop unrolling
- perform data flow analysis
- perform memory reference disambiguation
- perform register allocation
- for each scheduling scope do
- perform instruction scheduling
- write parallel program
47Scheduling Int.Lin.Programming
- Integer linear programming scheduling method
- Introduce
- Decision variables xi,j 1 if operation i is
scheduled in cycle j - Constraints like
- Limited resources
- where xt operation of type t and Mt number of
resources of type t - Data dependence constraints
- Timing constraints
- Problem too many decision variables
48List Scheduling
- Make a dependence graph
- Determine minimal length
- Determine ASAP, ALAP, and slack of each operation
- Place each operation in first cycle with
sufficient resources - Note
- Scheduling order sequential
- Priority determined by used heuristic e.g. slack
49Basic Block Scheduling
ASAP cycle
B
C
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
50ASAP and ALAP formulas
slack(v) alap(v) - asap(v)
51Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E // all nodes which have no
predecessor ready ready // all nodes
which can be scheduled in sched ? //
current cycle current_cycle 0
while sched ? V do for each v ? ready
do if ?ResourceConfl(v,current_cycl
e, sched) then cycle(v)
current_cycle sched sched ?
v endif endfor
current_cycle current_cycle 1
ready v v ? sched ? ? (u,v)? E, u ? sched
ready v v ? ready ? ? (u,v)?
E, cycle(u) delay(u,v) ? current_cycle endwhil
e endproc
52Problem with basic block scheduling
- Basic blocks contain on average only about 6
instructions - Unrolling may help for loops
- Go beyond basic blocks 1. Extended basic
block scheduling 2. Software pipelining
53Extended basic block scheduling Scope
Partitioning a CFG into scheduling scopes
54Extended basic block scheduling Scope
Partitioning a CFG into scheduling scopes
55Extended basic block scheduling Scope
Comparing scheduling scopes
56Extended basic block scheduling Code Motion
- Downward code motions?
- a ? B, a ? C, a ? D, c ? D, d ? D
- Upward code motions?
- c ? A, d ? A, e ? B, e ? C, e ? A
57Extended basic block scheduling Code Motion
Legend
Basic blocks between source and destination basic
blocks
I
Basic blocks where duplication have to be placed
D
Control flow edges where off-liveness checks have
to be performed
M
b
Destination basic blocks
b
Source basic blocks
- SCP (single copy on a path) rule no path may
exist between 2 different D blocks
58Extended basic block schedulingCode Motion
- A dominates B ? A is always executed before B
- Consequently
- A does not dominate B ? code motion from B to A
requires - code duplication
- B post-dominates A ? B is always executed after A
- Consequently
- B does not post-dominate A ? code motion from B
to A is speculative
Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
59Scheduling Loops
Loop Optimizations
A
B
C
A
D
A
B
C
B
C
C
C
C
C
D
D
Loop unrolling
Loop peeling
60Scheduling Loops
- Problems with unrolling
- Exploits only parallelism within sets of n
iterations - Iteration start-up latency
- Code expansion
Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
61Software pipelining
- Software pipelining a loop is
- Scheduling the loop such that iterations start
before preceding iterations have finished - Or
- Moving operations across the backedge
LD LD ML LD ML ST ML ST ST
Unroling 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
62Software pipelining Modulo scheduling
Example Modulo scheduling a loop
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Prologue
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Kernel
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Epilogue
(c) Software pipeline
- Prologue fills the SW pipeline with iterations
- Epilogue drains the SW pipeline
63Summary and Conclusions
- Compilation for ILP architectures is getting
mature and enters the commercial area. - However
- Great discrepancy between available and
exploitable parallelism
- What if you need more parallelism?
- source-to-source transformations
- use other algorithms
64The End
Thanks