Title: Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths
1Exploiting Operation Level Parallelism Through
Dynamically Reconfigurable Datapaths
- Zhining Huang, Sharad Malik
- Department of Electrical Engineering
- Princeton University
2Dynamically Reconfigurable Datapaths
- Speed-up kernel loops using reconfigurable
hardware
Applications
Loop 1
Trivial Codes
Loop 3
Loop 2
µP
Reconfigurable Datapath
3Outline
- Application specific programmable platforms
- Methodology overview and architectural model
- Datapath design for kernel loops
- Direct Mapping, Pipelining
- Reconfigurable datapath design
- Case studies
- GSM, MPEG II
- Conclusion
4Application Specific Programmable Platforms
- Why programmable platforms?
- Design cost, time to market
- Different programmable platforms
- Bit level FPGA based
- Word level specialized VLIW, coarse grained
reconfigurable coprocessors - Thread level Multiple PEs with on-chip
communication networks
5Application Specific Programmable Platforms
(contd.)
Flexibility
Performance, Power
- Goal Approach the flexibility of GPPs with the
efficiency of ASICs - Part of the MESCAL project
- Modern Embedded Systems, Compilers, Architectures
and Languages - A disciplined effort for application specific
programmable platform development
6Related Research
- Various reconfigurable coprocessors
- Garp Hauser97, PipeRench Goldstein99,
Pleiades Wan00 - Chameleon Systems, Morphics Technology
- General reconfigurable fabrics compiler
- Hardware resource, routing, compiler
- Our approach
- Design automation of the application specific
reconfigurable fabrics - Coarse grained dynamically reconfigurable logic
7Architectural Model
- RISC Coarse grained reconfigurable datapath
- Fixed function units
- Reconfigurable interconnections
Reconfigurable Datapath
8Methodology Overview
- Designing the application specific reconfigurable
datapath.
Front End Compilation Profiling, DA, etc.
Kernel Loop Extraction
Direct Mapping
Hardware Constraint
Performance Estimation
Datapaths
Mapping Algorithm
Reconfigurable Datapath
9Mapping Kernel Loops from C to Hardware
- Generating a datapath for each kernel loop.
IR code after front end compilation
Mapping within basic blocks
Branch merging
Intra-iteration Scheduling
Detail
Datapath with maximum operation level parallelism
Inter-iteration Scheduling
Critical Path Detection
Datapath with high computation throughput
FU merging
10Direct Mapping
- Direct mapping from IR to hardware
- One instruction to one function unit
Cb5 ld r1, r2, r12 ld r13, r9, r12
ld r14, r10, r12 add r5, r1, 1
add r11, r5, r13 lsr r3, r11, 1
sub r6, r3, r14
11Direct Mapping (contd.)
- Branch condition transforms
Cb5 blt r6, 0, cb7 Cb6 add r19,
r19, r6 jump Cb8 Cb7 sub r19, r19,
r6 Cb8
cmp
-
mux
12Intra-iteration Scheduling
- Schedule FUs into different pipe stages
Cb5 ld r1, r2, r12 ld r13, r9, r12
ld r14, r10, r12 add r5, r1, 1
add r11, r5, r13 lsr r3, r11, 1
sub r6, r3, r14 blt r6, 0, cb7 Cb6
add r19, r19, r6 jump Cb8 Cb7 sub
r19, r19, r6 Cb8
SH
-
cmp
-
Kernel loop code from GSM
13Inter-iteration Scheduling
- Pipelining the execution of loop iterations
- Determine the Initial Interval (II) of a loop
datapath
p1
- if no data dependence
- II 1 (single copy datapath)
- II 0 (multiple copies of datapaths)
p2
ltlt
p3
x
p4
p5
-
14Inter-iteration Scheduling (contd.)
- Data dependence from FU i to FU j across loop
iterations - Feedback connection
- II PipeStage(i) PipeStage(j) FU_Delay(j),
if II gt 0
- II 5 1 1 5
- Fetch new loop iteration every 5 cycles
p1
p2
ltlt
p3
x
p4
p5
-
15Inter-iteration Scheduling (contd.)
- Data dependence on memory access
- No feedback connections needed
- II ? PipeStage(i) PipeStage(j) 1 / k?
- K distance of dependent iterations, from data
dependence analysis
p1
LD
ST
ST
p2
p3
p4
ST
LD
ST
p5
II 4
II 0
II 4
16Execution Time Estimation
T S II(N-1) O W (cycles)
- S total of pipeline stages of the datapath
- II initial interval between the fetch of 2
consecutive iterations - N loop iteration number
- O configuration overhead
- W system write back
- Example T 5 2x(32-1) 4 71
17Reconfigurable Datapath Design
- Embed individual datapaths into a single
datapath. - Datapath graph Gi
- Vertices are hardware resources (memories,
registers, function units) - Edges are connections between them
- Construct a single graph G such that each Gi ?
G and G has the fewest edges and vertices - Bipartite matching based algorithm Huang 2001
18Reconfigurable Datapath
- Merged graph G to reconfigurable datapath
- Vertices to function units
- Edges to reconfigurable interconnects
- By selecting subset of interconnections, any
selected datapath can be generated and executed
on reconfigurable datapath - Appropriate interconnects in merged datapath are
enabled using configuration bits
19Routing
- Useful interconnections are selected
- Routing box to select between multiple
connections - Configuration contexts
- Configuration bits for routing box
- Control bits for some FU
- Static registers initialization
Interconnection Routing
Routing Box
Function Unit
Register
20Reconfiguration Overhead
- Store configuration contexts of limited number of
kernel loops in distributed RAMs - Fast context switch for reconfigurable fabrics
- NEC OmniPath Furuta00, Chameleon systems
- Reconfiguration overhead
- read live-in register set
- write live-out register set
Context Address
RC
Context 1
Context 2
Context 3
Context 4
Reconfiguration controller
21Critical Path and Clock Speed
- Critical path in the reconfigurable datapath
- Delay of FU
- Delay of routing box
- Delay of directly connected wires
- Critical path in general processor
- No longer in FU stage
- Branch control, decoding stage
- The clock speed of reconfigurable datapath should
be no less than that for a general processor
22Benchmark Studies
- MPEG
- Overall speedup 3.57
- 10 kernel loops 86 execution time
- Max possible speedup 7.14
- GSM
- Overall speedup 2.78
- 10 kernel loops 81 execution time
- Max possible speedup 5.26
Speed-up
23Datapath Mapping Results
- Significant overlap between datapaths is
obtained. - Configuration bits MPEG lt 500bits, GSM lt 1000bits
24Speed-up vs. Memory Bandwidth
- Make multiple copies of datapath
- Constraint number of memory ports
Time
Speed-up
Time
Speed-up
of Memory ports
MPEG II Coder
GSM Coder
25Clustered VLIW machine?
- Application specific clustered VLIW processor
with one instruction per kernel loop - Reconfiguration contexts as instructions
- Interconnections as application specific
bypassing networks
Configuration contexts
Configuration contexts
Configuration contexts
Mem Port
Mem Port
FU
FU
FU
FU
26Reconfigurable Datapath (RD) vs. VLIW
Execution Time
MPEG II
Execution Time
GSM
27Applicable Application Domain
- computation intensive applications
- localized operational parallelism
- a few areas account for most of the execution time
28Conclusion
- A methodology for the design of a dynamically
reconfigurable datapath coprocessor - Kernel loop IR to datapath hardware
- Datapath hardware merged into reconfigurable
hardware - MPEG, GSM benchmark case studies
- Examined reconfigurable datapaths vs. VLIW
processors - Future research
- Increasing the datapath pipelining throughput
through FU merging - Fully automating the process