Platform-based Design - PowerPoint PPT Presentation

1 / 70

About This Presentation

Title:

Platform-based Design

Description:

Exploiting ILP VLIW architectures TU/e 5kk70 Henk Corporaal Bart Mesman What are we talking about? VLIW: Topics Overview Enhance performance: What options do you have? – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 71

Provided by: abc774

Category:

more less

Transcript and Presenter's Notes

Title: Platform-based Design

1
Platform-based Design
Exploiting ILP VLIW architectures

TU/e 5kk70
Henk Corporaal
Bart Mesman

2
What are we talking about?
ILP Instruction Level Parallelism ability to
perform multiple operations (or
instructions), from a single instruction
stream, in parallel
3
VLIW Topics Overview

Enhance performance
What options do you have?
Instruction Level Parallelism
Limits on ILP
VLIW
Examples
Clustering
Code generation
Hands-on

4
Enhance performance 4 architecture methods

(Super)-pipelining
Powerful instructions
MD-technique
multiple data operands per operation
MO-technique
multiple operations per instruction
Multiple instruction issue

5
Architecture methodsPipelined Execution of
Instructions
IF Instruction Fetch DC Instruction Decode RF
Register Fetch EX Execute instruction WB Write
Result Register
CYCLE
1
2
4
3
5
6
7
8
1
2
INSTRUCTION
3
4
Simple 5-stage pipeline

Purpose of pipelining
Reduce gate_levels in critical path
Reduce CPI close to one (instead of a large
number for the multicycle machine)
More efficient Hardware
Problems
Hazards pipeline stalls
Structural hazards add more hardware
Control hazards, branch penalties use branch
prediction
Data hazards by passing required

6
Architecture methodsPipelined Execution of
Instructions

Superpipelining
Split one or more of the critical pipeline stages
Superpipelining degree S

S(architecture) ? f(Op) lt (Op)
?Op ?I_set
where f(op) is frequency of operation op
lt(op) is latency of operation op
7
Architecture methodsPowerful Instructions (1)

MD-technique
Multiple data operands per operation
SIMD Single Instruction Multiple Data

Vector instruction for (i0, i, ilt64) ci
ai 5bi or c a 5b
Assembly set vl,64 ldv v1,0(r2) mulvi
v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv
v3,0(r3)
8
Architecture methodsPowerful Instructions (1)

SIMD computing
Nodes used for independent operations
Mesh or hypercube connectivity
Exploit data locality of e.g. image processing
applications
Dense encoding (few instruction bits needed)

9
Architecture methodsPowerful Instructions (1)

Sub-word parallelism
SIMD on restricted scale
Used for Multi-media instructions
Examples
MMX, SSX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow,
Trimedia II
Example ?i1..4ai-bi

10
Architecture methodsPowerful Instructions (2)

MO-technique multiple operations per instruction
Two options
CISC (Complex Instruction Set Computer)
VLIW (Very Long Instruction Word)

FU 1
FU 2
FU 3
FU 4
FU 5
field
sub r8, r5, 3
and r1, r5, 12
mul r6, r5, r2
ld r3, 0(r5)
bnez r5, 13
instruction
VLIW instruction example
11
VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
Q How many ports does the registerfile need for
n-issue?
12
TriMedia TM32A processor
0.18 micron area 16.9mm2 200 MHz (typ) 1.4 W 7
mW/MHz (MIPS 0.9 mW/MHz)
13
Architecture methods Powerful Instructions (2)
VLIW Characteristics

Only RISC like operation support
Short cycle times
Flexible Can implement any FU mixture
Extensible
Tight inter FU connectivity required
Large instructions (up to 1000 bits)
Not binary compatible !!!
But good compilers exist

14
Architecture methodsMultiple instruction issue
(per cycle)

Who guarantees semantic correctness?
can instructions be executed in parallel
User he specifies multiple instruction streams
Multi-processor MIMD (Multiple Instruction
Multiple Data)
HW Run-time detection of ready instructions
Superscalar
Compiler Compile into dataflow representation
Dataflow processors

15
Multiple instruction issueThree Approaches
Example code
a b 15 c 3.14 d e c / f
Translation to DDG (Data Dependence Graph)
d
ld
3.14
f
b
ld
ld

15
c

/
st
a
e
st
st
16

Generated Code

Instr. Sequential Code Dataflow Code

I1 ld r1,M(b) ld(M(b) -gt I2 I2 addi r1,r1,15
addi 15 -gt I3 I3 st r1,M(a) st
M(a) I4 ld r1,M(d) ld M(d) -gt
I5 I5 muli r1,r1,3.14 muli 3.14 -gt I6,
I8 I6 st r1,M(c) st M(c) I7 ld r2,M(f) ld
M(f) -gt I8 I8 div r1,r1,r2 div -gt
I9 I9 st r1,M(e) st M(e)

Notes
An MIMD may execute two streams (1) I1-I3 (2)
I4-I9
No dependencies between streams in practice
communication and synchronization required
between streams
A superscalar issues multiple instructions from
sequential stream
Obey dependencies (True and name dependencies)
Reverse engineering of DDG needed at run-time
Dataflow code is direct representation of DDG

17
Multiple Instruction Issue Data flow processor
Token Matching
Token Store
Instruction Generate
Instruction Store
Result Tokens
Reservation Stations
18
Instruction Pipeline Overview
CISC
RISC
Superscalar
Superpipelined
DATAFLOW
VLIW
19
Four dimensional representation of the
architecture design space ltI, O, D, Sgt
20
Architecture design space
Typical values of K ( of functional units or
processor nodes), and ltI, O, D, Sgt for different
architectures
S(architecture) ? f(Op) lt (Op)
?Op ?I_set
Mpar IODS
21
Overview

Enhance performance architecture methods
Instruction Level Parallelism
limits on ILP
VLIW
Examples
Clustering
Code generation
Hands-on

22
General organization of an ILP architecture
23
Motivation for ILP

Increasing VLSI densities decreasing feature
size
Increasing performance requirements
New application areas, like
multi-media (image, audio, video, 3-D)
intelligent search and filtering engines
neural, fuzzy, genetic computing
More functionality
Use of existing Code (Compatibility)
Low Power P ?fCVdd2

24
Low power through parallelism

Sequential Processor
Switching capacitance C
Frequency f
Voltage V
P ?fCV2
Parallel Processor (two times the number of
units)
Switching capacitance 2C
Frequency f/2
Voltage V lt V
P ?f/2 2C V2 ?fCV2

25
Measuring and exploiting available ILP

How much ILP is there in applications?
How to measure parallelism within applications?
Using existing compiler
Using trace analysis
Track all the real data dependencies (RaWs) of
instructions from issue window
register dependence
memory dependence
Check for correct branch prediction
if prediction correct continue
if wrong, flush schedule and start in next cycle

26
Trace analysis
Trace set r1,0 set r2,3 set r3,A st
r1,0(r3) add r1,r1,1 add r3,r3,4 brne
r1,r2,Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3
Compiled code set r1,0 set r2,3 set
r3,A Loop st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop add r1,r5,3
Program For i 0..2 Ai i S X3
How parallel can this code be executed?
27
Trace analysis
Parallel Trace set r1,0 set r2,3 set
r3,A st r1,0(r3) add r1,r1,1 add
r3,r3,4 st r1,0(r3) add r1,r1,1 add
r3,r3,4 brne r1,r2,Loop st r1,0(r3) add
r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne
r1,r2,Loop add r1,r5,3
Max ILP Speedup Lparallel / Lserial 16 / 6
2.7
28
Ideal Processor

Assumptions for ideal/perfect processor
1. Register renaming infinite number of
virtual registers gt all register WAW WAR
hazards avoided
2. Branch and Jump prediction Perfect gt all
program instructions available for execution
3. Memory-address alias analysis addresses are
known. A store can be moved before a load
provided addresses not equal
Also
unlimited number of instructions issued/cycle
(unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP ,/)
Programs were compiled using MIPS compiler with
maximum optimization level

29
Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
30
Window Size and Branch Impact

Change from infinite window to examine 2000 and
issue at most 64 instructions per cycle

FP 15 - 45
Integer 6 12
IPC
Perfect Tournament BHT(512) Profile No
prediction
31
Limiting nr. of Renaming Registers

Changes 2000 instr. window, 64 instr. issue, 8K
2-level predictor (slightly better than
tournament predictor)

FP 11 - 45
Integer 5 - 15
IPC
Infinite 256 128 64 32
32
Memory Address Alias Impact

Changes 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
Integer 4 - 9
IPC
Perfect Global/stack perfect Inspection
None
33
Reducing Window Size

Assumptions Perfect disambiguation, 1K Selective
predictor, 16 entry return stack, 64 renaming
registers, issue as many as window

FP 8 - 45
IPC
Integer 6 - 12
Infinite 256 128 64 32
16 8 4
34
How to Exceed ILP Limits of This Study?

WAR and WAW hazards through memory eliminated
WAW and WAR hazards through register renaming,
but not in memory
Unnecessary dependences
compiler did not unroll loops so iteration
variable dependence
Overcoming the data flow limit value prediction,
predicting values and speculating on prediction
Address value prediction and speculation predicts
addresses and speculates by reordering loads and
stores. Could provide better aliasing analysis

35
Conclusions

Amount of parallelism is limited
higher in Multi-Media and Signal Processing appl.
higher in kernels
Trace analysis detects all types of parallelism
task, data and operation types
Detected parallelism depends on
quality of compiler
hardware
source-code transformations

36
Overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
IA-64 Itanium, ....
TTA
Clustering
Code generation
Hands-on

37
VLIW concept
A VLIW architecture with 7 FUs
Instruction register
Function units
38
VLIW characteristics

Multiple operations per instruction
One instruction per cycle issued (at most)
Compiler is in control
Only RISC like operation support
Short cycle times
Easier to compile for
Flexible Can implement any FU mixture
Extensible / Scalable
However
tight inter FU connectivity required
not binary compatible !!
(new long instruction format)
low code density

39
VelociTIC6x datapath
40
VLIW example TMS320C62

TMS320C62 VelociTI Processor
8 operations (of 32-bit) per instruction (256
bit)
Two clusters
8 Fus 4 Fus / cluster (2 Multipliers, 6 ALUs)
2 x 16 registers
One bus available to write in register file of
other cluster
Flexible addressing modes (like circular
addressing)
Flexible instruction packing
All instruction conditional
Originally 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
128 KB on-chip RAM

41
VLIW example Philips TriMedia TM1000
Register file (128 regs, 32 bit, 15 ports)
5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2
DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP
compare 1 FP div/sqrt
Exec unit
Exec unit
Exec unit
Exec unit
Exec unit
Data cache (16 kB)
Instruction register (5 issue slots)
PC
Instruction cache (32kB)
42
Intel EPIC Architecture IA-64

Explicit Parallel Instruction Computer (EPIC)
IA-64 architecture -gt Itanium, first realization
2001
Register model
128 64-bit int x bits, stack, rotating
128 82-bit floating point, rotating
64 1-bit boolean
8 64-bit branch target address
system control registers
See http//en.wikipedia.org/wiki/Itanium

43
EPIC Architecture IA-64

Instructions grouped in 128-bit bundles
3 41-bit instruction
5 template bits, indicate type and stop location
Each 41-bit instruction
starts with 4-bit opcode, and
ends with 6-bit guard (boolean) register-id
Supports speculative loads

44
Itanium
45
Itanium 2 McKinley
46
EPIC Architecture IA-64

EPIC allows for more binary compatibility then a
plain VLIW
Function unit assignment performed at run-time
Lock when FU results not available
See other website for more info on IA-64
www.ics.ele.tue.nl/heco/courses/ACA
(look at related material)

47
What are we talking about?
ILP Instruction Level Parallelism ability to
perform multiple operations (or
instructions), from a single instruction
stream, in parallel
48
VLIW evaluation

Strong points of VLIW
Scalable (add more FUs)
Flexible (an FU can be almost anything e.g.
multimedia support)
Weak points
With N FUs
Bypassing complexity O(N2)
Register file complexity O(N)
Register file size O(N2)
Register file design restricts FU flexibility
Solution ........................................
.......... ?

49
VLIW evaluation
50
Solution
Mirroring the Programming Paradigm

TTA Transport Triggered Architecture

-

-
gt

gt

st
st
51
Transport Triggered Architecture
General organization of a TTA
FU-1
CPU
FU-2
FU-3
Instruction fetch unit
Instruction decode unit
Bypassing network
FU-4
Instruction memory
Data memory
FU-5
Register file
52
TTA structure datapath details
Data Memory
Socket
Instruction Memory
53
TTA hardware characteristics

Modular building blocks easy to reuse
Very flexible and scalable
easy inclusion of Special Function Units (SFUs)
Very low complexity
gt 50 reduction on register ports
reduced bypass complexity (no associative
matching)
up to 80 reduction in bypass connectivity
trivial decoding
reduced register pressure
easy register file partitioning (a single port is
enough!)

54
TTA software characteristics
That does not look like an improvement !?!
r1 ? add.o1 r2? add.o2 add.r ? r3
o1
o2

r

More difficult to schedule !
But extra scheduling optimizations

55
Program TTAs

How to do data operations ?
1. Transport of operands to FU
Operand move (s)
Trigger move
2. Transport of results from FU
Result move (s)

Example Add r3,r1,r2 becomes r1 ? Oint //
operand move to integer unit r2 ? Tadd // trigger
move to integer unit . // addition operation
in progress Rint ? r3 // result move from
integer unit
How to do Control flow ? 1. Jumps jump-address
? pc 2. Branch displacement ? pcd 3. Call pc
? r call-address ? pcd
56
Scheduling example
integer ALU
integer ALU
load/store unit
integer RF
immediate unit
57
TTA Instruction format
General MOVE field g guard specifier i
immediate specifier src source dst destination
58
Programming TTAs

How to do conditional execution
Each move is guarded
Example
r1 ? cmp.o1 // operand move to compare unit
r2 ? cmp.o2 // trigger move to compare unit
cmp.r ?g // put result in boolean register g
gr3 ?r4 // guarded move takes place when r1r2

59
Register file port pressure for TTAs
60
Summary of TTA Advantages

Better usage of transport capacity
Instead of 3 transports per dyadic operation,
about 2 are needed
register ports reduced with at least 50
Inter FU connectivity reduces with 50-70
No full connectivity required
Both the transport capacity and register ports
become independent design parameters this
removes one of the major bottlenecks of VLIWs
Flexible Fus can incorporate arbitrary
functionality
Scalable FUS, reg.files, etc. can be changed
FU splitting results into extra exploitable
concurrency
TTAs are easy to design and can have short cycle
times

61
TTA automatic DSE
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
62
Overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
TTA
Clustering and Reconfigurable components
Code generation
Hands-on

63
Clustered VLIW

Clustering Splitting up the VLIW data path-
same can be done for the instruction path

64
Clustered VLIW

Why clustering?
Timing faster clock
Lower Cost
silicon area
T2M
Lower Energy
Whats the disadvantage?

65
Fine-Grained reconfigurable Xilinx XC4000 FPGA
Programmable Interconnect
I/O Blocks (IOBs)
Configurable Logic Blocks (CLBs)
66
Coarse-Grained reconfigurable Chameleon CS2000

Highlights
32-bit datapath (ALU/Shift)
16x24 Multiplier
distributed local memory
fixed timing

67
Recent Coarse Grain Reconfigurable Architectures

SmartCell 2009
read http//www.hindawi.com/journals/es/2009/51865
9.html
Montium (reconfigurable VLIW)
RAPID
NIOS II
RAW
PicoChip
PACT XPP64
many more .

68
Hybrid FPGAs Virtex II-Pro
GHz IO Up to 16 serial transceivers
Memory blocks
PowerPC
Reconfigurable logic blocks
69
HW or SW reconfigurable?
reset
Reconfiguration time
loopbuffer
context
Subword parallelism
1 cycle
fine
coarse
Data path granularity
70
Granularity Makes Differences

Write a Comment

User Comments (0)