Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation presentation

About This Presentation

Title:

Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

Description:

Applications may contain much instruction level parallelism ... Graph needs 3 colors (chromatic nr =3) = program needs 3 registers. H.C. TD 5102. 30 ... –

Number of Views:99

Avg rating:3.0/5.0

Slides: 65

Provided by: imec6

Category:

more less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

1
Embedded Systems in SiliconTD5102Compilerswith
emphasis on ILP compilation
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Compiling for ILP Architectures

Overview
Motivation and Goals
Measuring and exploiting available parallelism
Compiler basics
Scheduling for ILP architectures
Summary and Conclusions

3
Motivation

Performance requirements increase
Applications may contain much instruction level
parallelism
Processors offer lots of hardware concurrency
Problem to be solved
how to exploit this concurrency automatically?

4
Goals of code generation

High speedup
Exploit all the hardware concurrency
Extract all application parallelism
obey true dependencies only
resolve false dependencies by renaming
No code rewriting automatic parallelization
However application tuning may be required
Limit code expansion

5
Overview

Motivation and Goals
Measuring and exploiting available parallelism
Compiler basics
Scheduling for ILP architectures
Summary and Conclusions

6
Measuring and exploiting available parallelism

How to measure parallelism within applications?
Using existing compiler
Using trace analysis
Track all the real data dependencies (RaWs) of
instructions from issue window
register dependence
memory dependence
Check for correct branch prediction
if prediction correct continue
if wrong, flush schedule and start in next cycle

7
Trace analysis
Execution trace set r1,0 set r2,3 set r3,A st
r1,0(r3) add r1,r1,1 add r3,r3,4 brne
r1,r2,Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3
Compiled code set r1,0 set r2,3 set
r3,A Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop add r1,r5,3
Program For i 0..2 Ai i S X3
How parallel can this code be executed?
8
Trace analysis
Parallel Trace set r1,0 set r2,3 set
r3,A st r1,0(r3) add r1,r1,1 add
r3,r3,4 st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne
r1,r2,Loop add r1,r5,3
Max ILP Speedup Lparallel / Lserial 16 / 6
2.7
9
Ideal Processor

Assumptions for ideal/perfect processor
1. Register renaming infinite number of
virtual registers gt all register WAW WAR
hazards avoided
2. Branch and Jump prediction Perfect gt all
program instructions available for execution
3. Memory-address alias analysis addresses are
known. A store can be moved before a load
provided addresses not equal
Also
unlimited number of instructions issued/cycle
(unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP ,/)
Programs were compiled using MIPS compiler with
maximum optimization level

10
Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
11
Different effects reduce the exploitable
parallelism

Reducing window size
i.e., the number of instructions to choose from
Non-perfect branch prediction
perfect (oracle model)
dynamic predictor
(e.g. 2 bit prediction table with finite number
of entries)
static prediction (using profiling)
no prediction
Restricted number of registers for renaming
typical superscalars have O(100) registers
Restricted number of other resources, like FUs

12
Different effects reduce the exploitable
parallelism

Non-perfect alias analysis (memory
disambiguation)Models to use
perfect
inspection no dependence in following cases
r1 0(r9) r1 0(fp)
4(r9) r2 0(gp) r2
A more advanced analysis may disambiguate most
stack and global references, but not the heap
references
none
Important
good branch prediction, 128 registers for
renaming, alias analysis on stack and global
accesses, and for FloatingPt a large window size

13
Summary

Amount of parallelism is limited
higher in Multi-Media
higher in kernels
Trace analysis detects all types of parallelism
task, data and operation types
Detected parallelism depends on
quality of compiler
hardware
source-code transformations

14
Overview

Motivation and Goals
Measuring and exploiting available parallelism
Compiler basics
Scheduling for ILP architectures
Source level transformations
Compilation frameworks
Summary and Conclusions

15
Compiler basics

Overview
Compiler trajectory / structure / passes
Abstract Syntax Tree (AST)
Control Flow Graph (CFG)
Data Dependence Graph (DDG)
Basic optimizations
Register allocation
Code selection

16
Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
17
Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
18
Compiler basics structure Simple compilation
example
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
19
Compiler basics structure - SUIF-1
toolkit example
20
Compiler basics Abstract Syntax Tree (AST)
C input code
Parse tree infinite nesting
if (a gt b) r a b else r b
a
Stat IF Cmp gt Var a Var
b Statlist Stat Expr
Assign Var r
Binop Var a
Var b Statlist
Stat Expr Assign
Var r Binop
Var b
Var a
21
Compiler basics Control flow graph (CFG)
C input code
if (a gt b) r a b else r b
a
1 sub t1, a, b bgz t1, 2, 3
CFG
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
22
Data Dependence Graph (DDG)
a b 15 c 3.14 d e c / f
Translation to DDG
d
ld
3.14
f
b
ld
ld

15
c

/
st
a
e
st
st
23
Compiler basics Basic optimizations

Machine independent optimizations
Machine dependent optimizations
(details are in any good compiler book)

24
Machine independent optimizations

Common subexpression elimination
Constant folding
Copy propagation
Dead-code elimination
Induction variable elimination
Strength reduction
Algebraic identities
Commutative expressions
Associativity Tree height reduction
Note not always allowed(due to limited
precision)

25
Machine dependent optimization example

Whats the optimal implementation of a34 ?
Use multiplier mul Tb,Ta,34
Pro No thinking required
Con May take many cycles
Alternative
SHL Tc, Ta, 1
ADD Tb, Tc, Tzero
SHL Tc, Tc, 4
ADD Tb, Tb, Tc
Pros May take fewer cycles
Cons
Uses more registers
Additional instructions ( I-cache load / code
size)

26
Compiler basics Register allocation

Register Organization
Conventions needed for parameter passing
and register usage across function calls a MIPS
example

r31
Callee saved registers
r21
Caller saved registers
r20
Temporaries
r11
r10
Argument and result transfer
r1
Hard-wired 0
r0
27
Register allocation using graph coloring

Given a set of registers, what is the most
efficient
mapping of registers to program variables in
terms of execution time of the program?
A variable is defined at a point in program when
a value is assigned to it.
A variable is used at a point in a program when
its value is referenced in an expression.
The live range of a variable is the execution
range between definitions and uses of a variable.

28
Register allocation using graph coloring
Example
Live Ranges
29
Register allocation using graph coloring
Inference Graph
a
Coloring a red b green c blue d green
b
c
d
Graph needs 3 colors (chromatic nr 3) gt program
needs 3 registers
30
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not
enough colors (registers) to color the
interference graph
Example Only two registers available !!
31
Compiler basics Code selection

CISC era
Code size important
Determine shortest sequence of code
Many options may exist
Pattern matching
Example M68020
D1 D1 M M10A1 16D2 20 ?
ADD (10,A1, D216, 20), D1
RISC era
Performance important
Only few possible code sequences
New implementations of old architectures optimize
RISC part of instruction set only for e.g. i486
/ Pentium / M68020

32
Overview

Motivation and Goals
Measuring and exploiting available parallelism
Compiler basics
Scheduling for ILP architectures
Source level transformations
Compilation frameworks
Summary and Conclusions

33
What is scheduling?

Time allocation
Assigning instructions or operations to time
slots
Preserve dependences
Register dependences
Memory dependences
Optimize code with respect to performance/ code
size/ power consumption/ ..
Space allocation
satisfy resource constraints
Bind operations to FUs
Bind variables to registers/ register files
Bind transports to buses

34
Why scheduling?

Lets look at the execution time
Texecution Ncycles x Tcycle
Ninstructions x CPI x Tcycle
Scheduling may reduce Texecution
Reduce CPI (cycles per instruction)
early scheduling of long latency operations
avoid pipeline stalls due to structural, data and
control hazards
allow Nissue gt 1 and therefore CPI lt 1
Reduce Ninstructions
compact many operations into each instruction
(VLIW)

35
Scheduling data hazards RaW dependence
Avoiding RaW stalls
Reordering of instructions by the compiler
Example avoiding one-cycle load interlock
Code a b c d e - f
36
Scheduling control hazards

Branch requires 3 actions
Compute new address
Determine condition
Perform the actual branch (if taken) PC new
address

37
Control hazards what's the penalty?

CPI CPIideal fbranch x Pbranch
Pbranch Ndelayslots x miss_rate
Superscalars tend to have large branch penalty
Pbranch due to
many pipeline stages
multiple instructions (or operations) / cycle
Note
the lower CPI the larger the effect of penalties

38
What can we do about control hazards and CPI
penalty?

Keep penalty Pbranch low
Early computation of new PC
Early determination of condition
Visible delay slots filled by compiler (MIPS)
Branch prediction
Reduce control dependencies (control height
reduction) Schlansker and Kathail, Micro95
Remove branches if-conversion
Conditional instructions CMOVE, cond skip next
Guarding all instructions TriMedia

39
Scheduling Conditional instructions

Example Cmove (supported by Alpha)

If (A0) S T assume r1 A, r2 S,
r3 T
Object code Bnez r1, L Mov r2, r3 L . . . .

After conversion

Cmovz r2, r3, r1
40
Scheduling Conditional instructions

Conditional instructions are useful, however
Squashed instructions still take execution time
and execution resources
Consequence long target blocks can not be
if-converted
Condition has to be known early
Moving operations across multiple branches
requires complicated predicates
Compatibility change of ISA (instruction set
architecture)
Practice
Current superscalars support a limited set of
conditional instructions
CMOVE alpha, MIPS, PowerPC, SPARC
HP PA any RR instruction can conditionally
squash next instruction
Large VLIWs profit from making all instructions
conditional
guarded execution TriMedia, Intel/HP IA-64, TI
C6x

41
Guarded execution
SLT r1,r2,r3 BEQ r1,r0, else then ADDI
r2,r2,1 ..X.. j cont else SUBI
r2,r2,1 ..Y.. cont MUL r4,r2
IF-conversion
SLT b1,r2,r3 b1ADDI r2,r2,1 !b1 SUBI
r2,r2,1 b1..X.. !b1 ..Y.. MUL r4,r2
42
Scheduling Conditional instructions

Full guard support
If-conversion of conditional code
Assume
tbranch branch latency
pbranch branching probability
ttrue execution time of the TRUE branch
tfalse execution time of the FALSE branch
Execution times of original and if-converted code
for non-ILP architecture
toriginal_code (1 pbranch) x tbranch
p x ttrue (1 - pbranch) x tfalse
tif_converted_code ttrue tfalse

43
Scheduling Conditional instructions
Speedup of if-converted code for non-ILP
architectures
Only interesting for short target blocks!
44
Scheduling Conditional instructions
Speedup of if-converted code for ILP
architectures with sufficient resources
tif_converted max(ttrue, tfalse)
Much larger area of interest !!
45
Scheduling Conditional instructions

Full guard support for large ILP architectures
has a number of advantages
Removing unpredictable branches
Enlarging scheduling scope
Enabling software pipelining
Enhancing code motion when speculation is not
allowed
Resource sharing even when speculation is
allowed guarding may be profitable

46
Scheduling Overview

Transforming a sequential program into a parallel
program
read sequential program
read machine description file
for each procedure do
perform function inlining
for each procedure do
transform an irreducible CFG into a reducible CFG
perform control flow analysis
perform loop unrolling
perform data flow analysis
perform memory reference disambiguation
perform register allocation
for each scheduling scope do
perform instruction scheduling
write parallel program

47
Scheduling Int.Lin.Programming

Integer linear programming scheduling method
Introduce
Decision variables xi,j 1 if operation i is
scheduled in cycle j
Constraints like
Limited resources
where xt operation of type t and Mt number of
resources of type t
Data dependence constraints
Timing constraints
Problem too many decision variables

48
List Scheduling

Make a dependence graph
Determine minimal length
Determine ASAP, ALAP, and slack of each operation
Place each operation in first cycle with
sufficient resources
Note
Scheduling order sequential
Priority determined by used heuristic e.g. slack

49
Basic Block Scheduling
ASAP cycle
B
C
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
50
ASAP and ALAP formulas
slack(v) alap(v) - asap(v)
51
Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E // all nodes which have no
predecessor ready ready // all nodes
which can be scheduled in sched ? //
current cycle current_cycle 0
while sched ? V do for each v ? ready
do if ?ResourceConfl(v,current_cycl
e, sched) then cycle(v)
current_cycle sched sched ?
v endif endfor
current_cycle current_cycle 1
ready v v ? sched ? ? (u,v)? E, u ? sched
ready v v ? ready ? ? (u,v)?
E, cycle(u) delay(u,v) ? current_cycle endwhil
e endproc
52
Problem with basic block scheduling

Basic blocks contain on average only about 6
instructions
Unrolling may help for loops
Go beyond basic blocks 1. Extended basic
block scheduling 2. Software pipelining

53
Extended basic block scheduling Scope
Partitioning a CFG into scheduling scopes
54
Extended basic block scheduling Scope
Partitioning a CFG into scheduling scopes
55
Extended basic block scheduling Scope
Comparing scheduling scopes
56
Extended basic block scheduling Code Motion

Downward code motions?
a ? B, a ? C, a ? D, c ? D, d ? D
Upward code motions?
c ? A, d ? A, e ? B, e ? C, e ? A

57
Extended basic block scheduling Code Motion
Legend
Basic blocks between source and destination basic
blocks
I
Basic blocks where duplication have to be placed
D
Control flow edges where off-liveness checks have
to be performed
M
b
Destination basic blocks
b
Source basic blocks

SCP (single copy on a path) rule no path may
exist between 2 different D blocks

58
Extended basic block schedulingCode Motion

A dominates B ? A is always executed before B
Consequently
A does not dominate B ? code motion from B to A
requires
code duplication
B post-dominates A ? B is always executed after A
Consequently
B does not post-dominate A ? code motion from B
to A is speculative

Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
59
Scheduling Loops
Loop Optimizations
A
B
C
A
D
A
B
C
B
C
C
C
C
C
D
D
Loop unrolling
Loop peeling
60
Scheduling Loops

Problems with unrolling
Exploits only parallelism within sets of n
iterations
Iteration start-up latency
Code expansion

Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
61
Software pipelining

Software pipelining a loop is
Scheduling the loop such that iterations start
before preceding iterations have finished
Or
Moving operations across the backedge

LD LD ML LD ML ST ML ST ST
Unroling 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
62
Software pipelining Modulo scheduling
Example Modulo scheduling a loop
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Prologue
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Kernel
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Epilogue
(c) Software pipeline

Prologue fills the SW pipeline with iterations
Epilogue drains the SW pipeline

63
Summary and Conclusions

Compilation for ILP architectures is getting
mature and enters the commercial area.
However
Great discrepancy between available and
exploitable parallelism