Title: Platform-based Design
1Platform-based Design
Exploiting ILP VLIW architectures
- TU/e 5kk70
- Henk Corporaal
- Bart Mesman
2What are we talking about?
ILP Instruction Level Parallelism ability to
perform multiple operations (or
instructions), from a single instruction
stream, in parallel
3VLIW Topics Overview
- Enhance performance
- What options do you have?
- Instruction Level Parallelism
- Limits on ILP
- VLIW
- Examples
- Clustering
- Code generation
- Hands-on
4Enhance performance 4 architecture methods
- (Super)-pipelining
- Powerful instructions
- MD-technique
- multiple data operands per operation
- MO-technique
- multiple operations per instruction
- Multiple instruction issue
5Architecture methodsPipelined Execution of
Instructions
IF Instruction Fetch DC Instruction Decode RF
Register Fetch EX Execute instruction WB Write
Result Register
CYCLE
1
2
4
3
5
6
7
8
1
2
INSTRUCTION
3
4
Simple 5-stage pipeline
- Purpose of pipelining
- Reduce gate_levels in critical path
- Reduce CPI close to one (instead of a large
number for the multicycle machine) - More efficient Hardware
- Problems
- Hazards pipeline stalls
- Structural hazards add more hardware
- Control hazards, branch penalties use branch
prediction - Data hazards by passing required
6Architecture methodsPipelined Execution of
Instructions
- Superpipelining
- Split one or more of the critical pipeline stages
- Superpipelining degree S
S(architecture) ? f(Op) lt (Op)
?Op ?I_set
where f(op) is frequency of operation op
lt(op) is latency of operation op
7Architecture methodsPowerful Instructions (1)
- MD-technique
- Multiple data operands per operation
- SIMD Single Instruction Multiple Data
Vector instruction for (i0, i, ilt64) ci
ai 5bi or c a 5b
Assembly set vl,64 ldv v1,0(r2) mulvi
v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv
v3,0(r3)
8Architecture methodsPowerful Instructions (1)
- SIMD computing
- Nodes used for independent operations
- Mesh or hypercube connectivity
- Exploit data locality of e.g. image processing
applications - Dense encoding (few instruction bits needed)
9Architecture methodsPowerful Instructions (1)
- Sub-word parallelism
- SIMD on restricted scale
- Used for Multi-media instructions
- Examples
- MMX, SSX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow,
Trimedia II - Example ?i1..4ai-bi
10Architecture methodsPowerful Instructions (2)
- MO-technique multiple operations per instruction
- Two options
- CISC (Complex Instruction Set Computer)
- VLIW (Very Long Instruction Word)
FU 1
FU 2
FU 3
FU 4
FU 5
field
sub r8, r5, 3
and r1, r5, 12
mul r6, r5, r2
ld r3, 0(r5)
bnez r5, 13
instruction
VLIW instruction example
11VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
Q How many ports does the registerfile need for
n-issue?
12TriMedia TM32A processor
0.18 micron area 16.9mm2 200 MHz (typ) 1.4 W 7
mW/MHz (MIPS 0.9 mW/MHz)
13Architecture methods Powerful Instructions (2)
VLIW Characteristics
- Only RISC like operation support
- Short cycle times
- Flexible Can implement any FU mixture
- Extensible
- Tight inter FU connectivity required
- Large instructions (up to 1000 bits)
- Not binary compatible !!!
- But good compilers exist
14Architecture methodsMultiple instruction issue
(per cycle)
- Who guarantees semantic correctness?
- can instructions be executed in parallel
- User he specifies multiple instruction streams
- Multi-processor MIMD (Multiple Instruction
Multiple Data) - HW Run-time detection of ready instructions
- Superscalar
- Compiler Compile into dataflow representation
- Dataflow processors
15Multiple instruction issueThree Approaches
Example code
a b 15 c 3.14 d e c / f
Translation to DDG (Data Dependence Graph)
d
ld
3.14
f
b
ld
ld
15
c
/
st
a
e
st
st
16Instr. Sequential Code Dataflow Code
I1 ld r1,M(b) ld(M(b) -gt I2 I2 addi r1,r1,15
addi 15 -gt I3 I3 st r1,M(a) st
M(a) I4 ld r1,M(d) ld M(d) -gt
I5 I5 muli r1,r1,3.14 muli 3.14 -gt I6,
I8 I6 st r1,M(c) st M(c) I7 ld r2,M(f) ld
M(f) -gt I8 I8 div r1,r1,r2 div -gt
I9 I9 st r1,M(e) st M(e)
- Notes
- An MIMD may execute two streams (1) I1-I3 (2)
I4-I9 - No dependencies between streams in practice
communication and synchronization required
between streams - A superscalar issues multiple instructions from
sequential stream - Obey dependencies (True and name dependencies)
- Reverse engineering of DDG needed at run-time
- Dataflow code is direct representation of DDG
17Multiple Instruction Issue Data flow processor
Token Matching
Token Store
Instruction Generate
Instruction Store
Result Tokens
Reservation Stations
18Instruction Pipeline Overview
CISC
RISC
Superscalar
Superpipelined
DATAFLOW
VLIW
19Four dimensional representation of the
architecture design space ltI, O, D, Sgt
20Architecture design space
Typical values of K ( of functional units or
processor nodes), and ltI, O, D, Sgt for different
architectures
S(architecture) ? f(Op) lt (Op)
?Op ?I_set
Mpar IODS
21Overview
- Enhance performance architecture methods
- Instruction Level Parallelism
- limits on ILP
- VLIW
- Examples
- Clustering
- Code generation
- Hands-on
22General organization of an ILP architecture
23Motivation for ILP
- Increasing VLSI densities decreasing feature
size - Increasing performance requirements
- New application areas, like
- multi-media (image, audio, video, 3-D)
- intelligent search and filtering engines
- neural, fuzzy, genetic computing
- More functionality
- Use of existing Code (Compatibility)
- Low Power P ?fCVdd2
24Low power through parallelism
- Sequential Processor
- Switching capacitance C
- Frequency f
- Voltage V
- P ?fCV2
- Parallel Processor (two times the number of
units) - Switching capacitance 2C
- Frequency f/2
- Voltage V lt V
- P ?f/2 2C V2 ?fCV2
25Measuring and exploiting available ILP
- How much ILP is there in applications?
- How to measure parallelism within applications?
- Using existing compiler
- Using trace analysis
- Track all the real data dependencies (RaWs) of
instructions from issue window - register dependence
- memory dependence
- Check for correct branch prediction
- if prediction correct continue
- if wrong, flush schedule and start in next cycle
26Trace analysis
Trace set r1,0 set r2,3 set r3,A st
r1,0(r3) add r1,r1,1 add r3,r3,4 brne
r1,r2,Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3
Compiled code set r1,0 set r2,3 set
r3,A Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop add r1,r5,3
Program For i 0..2 Ai i S X3
How parallel can this code be executed?
27Trace analysis
Parallel Trace set r1,0 set r2,3 set
r3,A st r1,0(r3) add r1,r1,1 add
r3,r3,4 st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne
r1,r2,Loop add r1,r5,3
Max ILP Speedup Lparallel / Lserial 16 / 6
2.7
28Ideal Processor
- Assumptions for ideal/perfect processor
- 1. Register renaming infinite number of
virtual registers gt all register WAW WAR
hazards avoided - 2. Branch and Jump prediction Perfect gt all
program instructions available for execution - 3. Memory-address alias analysis addresses are
known. A store can be moved before a load
provided addresses not equal - Also
- unlimited number of instructions issued/cycle
(unlimited resources), and - unlimited instruction window
- perfect caches
- 1 cycle latency for all instructions (FP ,/)
- Programs were compiled using MIPS compiler with
maximum optimization level
29Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
30Window Size and Branch Impact
- Change from infinite window to examine 2000 and
issue at most 64 instructions per cycle
FP 15 - 45
Integer 6 12
IPC
Perfect Tournament BHT(512) Profile No
prediction
31Limiting nr. of Renaming Registers
- Changes 2000 instr. window, 64 instr. issue, 8K
2-level predictor (slightly better than
tournament predictor)
FP 11 - 45
Integer 5 - 15
IPC
Infinite 256 128 64 32
32Memory Address Alias Impact
- Changes 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers
FP 4 - 45 (Fortran, no heap)
Integer 4 - 9
IPC
Perfect Global/stack perfect Inspection
None
33Reducing Window Size
- Assumptions Perfect disambiguation, 1K Selective
predictor, 16 entry return stack, 64 renaming
registers, issue as many as window
FP 8 - 45
IPC
Integer 6 - 12
Infinite 256 128 64 32
16 8 4
34How to Exceed ILP Limits of This Study?
- WAR and WAW hazards through memory eliminated
WAW and WAR hazards through register renaming,
but not in memory - Unnecessary dependences
- compiler did not unroll loops so iteration
variable dependence - Overcoming the data flow limit value prediction,
predicting values and speculating on prediction - Address value prediction and speculation predicts
addresses and speculates by reordering loads and
stores. Could provide better aliasing analysis
35Conclusions
- Amount of parallelism is limited
- higher in Multi-Media and Signal Processing appl.
- higher in kernels
- Trace analysis detects all types of parallelism
- task, data and operation types
- Detected parallelism depends on
- quality of compiler
- hardware
- source-code transformations
36Overview
- Enhance performance architecture methods
- Instruction Level Parallelism
- VLIW
- Examples
- C6
- TM
- IA-64 Itanium, ....
- TTA
- Clustering
- Code generation
- Hands-on
37VLIW concept
A VLIW architecture with 7 FUs
Instruction register
Function units
38VLIW characteristics
- Multiple operations per instruction
- One instruction per cycle issued (at most)
- Compiler is in control
- Only RISC like operation support
- Short cycle times
- Easier to compile for
- Flexible Can implement any FU mixture
- Extensible / Scalable
- However
- tight inter FU connectivity required
- not binary compatible !!
- (new long instruction format)
- low code density
39VelociTIC6x datapath
40VLIW example TMS320C62
- TMS320C62 VelociTI Processor
- 8 operations (of 32-bit) per instruction (256
bit) - Two clusters
- 8 Fus 4 Fus / cluster (2 Multipliers, 6 ALUs)
- 2 x 16 registers
- One bus available to write in register file of
other cluster - Flexible addressing modes (like circular
addressing) - Flexible instruction packing
- All instruction conditional
- Originally 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
- 128 KB on-chip RAM
41VLIW example Philips TriMedia TM1000
Register file (128 regs, 32 bit, 15 ports)
5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2
DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP
compare 1 FP div/sqrt
Exec unit
Exec unit
Exec unit
Exec unit
Exec unit
Data cache (16 kB)
Instruction register (5 issue slots)
PC
Instruction cache (32kB)
42 Intel EPIC Architecture IA-64
- Explicit Parallel Instruction Computer (EPIC)
- IA-64 architecture -gt Itanium, first realization
2001 - Register model
- 128 64-bit int x bits, stack, rotating
- 128 82-bit floating point, rotating
- 64 1-bit boolean
- 8 64-bit branch target address
- system control registers
- See http//en.wikipedia.org/wiki/Itanium
43EPIC Architecture IA-64
- Instructions grouped in 128-bit bundles
- 3 41-bit instruction
- 5 template bits, indicate type and stop location
- Each 41-bit instruction
- starts with 4-bit opcode, and
- ends with 6-bit guard (boolean) register-id
- Supports speculative loads
44Itanium
45Itanium 2 McKinley
46EPIC Architecture IA-64
- EPIC allows for more binary compatibility then a
plain VLIW - Function unit assignment performed at run-time
- Lock when FU results not available
- See other website for more info on IA-64
- www.ics.ele.tue.nl/heco/courses/ACA
- (look at related material)
47What are we talking about?
ILP Instruction Level Parallelism ability to
perform multiple operations (or
instructions), from a single instruction
stream, in parallel
48VLIW evaluation
- Strong points of VLIW
- Scalable (add more FUs)
- Flexible (an FU can be almost anything e.g.
multimedia support) - Weak points
- With N FUs
- Bypassing complexity O(N2)
- Register file complexity O(N)
- Register file size O(N2)
- Register file design restricts FU flexibility
- Solution ........................................
.......... ?
49VLIW evaluation
50Solution
Mirroring the Programming Paradigm
- TTA Transport Triggered Architecture
-
-
gt
gt
st
st
51Transport Triggered Architecture
General organization of a TTA
FU-1
CPU
FU-2
FU-3
Instruction fetch unit
Instruction decode unit
Bypassing network
FU-4
Instruction memory
Data memory
FU-5
Register file
52TTA structure datapath details
Data Memory
Socket
Instruction Memory
53TTA hardware characteristics
- Modular building blocks easy to reuse
- Very flexible and scalable
- easy inclusion of Special Function Units (SFUs)
- Very low complexity
- gt 50 reduction on register ports
- reduced bypass complexity (no associative
matching) - up to 80 reduction in bypass connectivity
- trivial decoding
- reduced register pressure
- easy register file partitioning (a single port is
enough!)
54TTA software characteristics
That does not look like an improvement !?!
r1 ? add.o1 r2? add.o2 add.r ? r3
o1
o2
r
- More difficult to schedule !
- But extra scheduling optimizations
55Program TTAs
- How to do data operations ?
- 1. Transport of operands to FU
- Operand move (s)
- Trigger move
- 2. Transport of results from FU
- Result move (s)
Example Add r3,r1,r2 becomes r1 ? Oint //
operand move to integer unit r2 ? Tadd // trigger
move to integer unit . // addition operation
in progress Rint ? r3 // result move from
integer unit
How to do Control flow ? 1. Jumps jump-address
? pc 2. Branch displacement ? pcd 3. Call pc
? r call-address ? pcd
56Scheduling example
integer ALU
integer ALU
load/store unit
integer RF
immediate unit
57TTA Instruction format
General MOVE field g guard specifier i
immediate specifier src source dst destination
58Programming TTAs
- How to do conditional execution
- Each move is guarded
- Example
- r1 ? cmp.o1 // operand move to compare unit
- r2 ? cmp.o2 // trigger move to compare unit
- cmp.r ?g // put result in boolean register g
- gr3 ?r4 // guarded move takes place when r1r2
59Register file port pressure for TTAs
60Summary of TTA Advantages
- Better usage of transport capacity
- Instead of 3 transports per dyadic operation,
about 2 are needed - register ports reduced with at least 50
- Inter FU connectivity reduces with 50-70
- No full connectivity required
- Both the transport capacity and register ports
become independent design parameters this
removes one of the major bottlenecks of VLIWs - Flexible Fus can incorporate arbitrary
functionality - Scalable FUS, reg.files, etc. can be changed
- FU splitting results into extra exploitable
concurrency - TTAs are easy to design and can have short cycle
times
61TTA automatic DSE
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
62Overview
- Enhance performance architecture methods
- Instruction Level Parallelism
- VLIW
- Examples
- C6
- TM
- TTA
- Clustering and Reconfigurable components
- Code generation
- Hands-on
63Clustered VLIW
- Clustering Splitting up the VLIW data path-
same can be done for the instruction path
64Clustered VLIW
- Why clustering?
- Timing faster clock
- Lower Cost
- silicon area
- T2M
- Lower Energy
- Whats the disadvantage?
65Fine-Grained reconfigurable Xilinx XC4000 FPGA
Programmable Interconnect
I/O Blocks (IOBs)
Configurable Logic Blocks (CLBs)
66Coarse-Grained reconfigurable Chameleon CS2000
- Highlights
- 32-bit datapath (ALU/Shift)
- 16x24 Multiplier
- distributed local memory
- fixed timing
67Recent Coarse Grain Reconfigurable Architectures
- SmartCell 2009
- read http//www.hindawi.com/journals/es/2009/51865
9.html - Montium (reconfigurable VLIW)
- RAPID
- NIOS II
- RAW
- PicoChip
- PACT XPP64
- many more .
68Hybrid FPGAs Virtex II-Pro
GHz IO Up to 16 serial transceivers
Memory blocks
PowerPC
Reconfigurable logic blocks
69HW or SW reconfigurable?
reset
Reconfiguration time
loopbuffer
context
Subword parallelism
1 cycle
fine
coarse
Data path granularity
70Granularity Makes Differences