Title: Compiler-directed Synthesis of Multifunction Loop Accelerators
1Compiler-directed Synthesis of Multifunction Loop
Accelerators
- Kevin Fan, Manjunath Kudlur,
- Hyunchul Park, Scott Mahlke
- Advanced Computer Architecture Laboratory
- University of Michigan
2Accelerating Streaming Applications
DRAM
- Streaming applications
- Discrete transformations operating on data stream
- High performance
- Map application to pipeline of accelerators
- Multifunction accelerators reuse hardware
- Improve hardware efficiency
Loop 1
Loop Accelerator
LA1
Frame Type?
Loop 2
Loop 3
Multifunction Loop Accelerator
LA2
Loop 4
Multifunction Loop Accelerator
LA3
Block 5
Accelerator Pipeline
Application
3Loop Accelerator Schema
- Hard wired state machine for one or more critical
loops - Order of magnitude power and performance
improvements over more general designs
4Single Function Accelerator Design
- Use compiler as architecture synthesis tool
- Parameterized meta-architecture all loop
accelerators have same general organization - Performance/throughput is input
- Compiler analysis to understand computation and
communication requirements - Hardware-sensitive optimization to reduce cost
5Flow Diagram
Application Loop, Desired II
Allocate FUs
Concrete Arch
FU
FU
FU
FU
Instantiate Arch
Abstract Arch
RF
Modulo Schedule
Verilog, Control Signals
Scheduled Ops
Synthesize
Build Datapath
Loop Accelerator
6FU Allocation
- Given operations in a loop and cost of hardware
cells implementing those operations - Minimize total FU cost while supporting all
operations
II 2
3 ? ADD 1 ? SUB 2 ? LOAD
-
MEM
7Modulo Scheduling andDatapath Derivation
- Schedule to abstract architecture (FUs)
- Determine register and interconnect requirements
from schedule
r1 Memr2 r3 r1 12
Source Code
8Multifunction Accelerator
- Single hardware accelerator to run multiple loops
- Could place single function accelerators side by
side - Want to exploit potential hardware sharing
between loops - Function units
- Registers
- Interconnect
9Multifunction Design Strategies
FU
FU
FU
FU
FU
FU
2. Phase Ordered Method
FU
FU
FU
FU
10Union Method
Goal combine FUs and register files to improve
hardware sharing.
Positional Union
-
M
M
Accel 1
M
Accel 2
11Union Method
- Smart union formulated as ILP problem which
minimizes FU and register cost - Benefit Look at whole design at once
- Limitation Schedules are fixed prior to union
phase - Fast runtime
12Cost of Union of Accelerators
Image Processing
MPEG4
Signal Processing
Worst union 25 average savings Positional
union 29 average savings Best union 33
average savings
13Phase Ordered Method
- Schedule loops in order
- During scheduling, account for hardware from
previous loop - Cost sensitive scheduler attempts to minimize
hardware cost increase
FU
FU
FU
FU
Loop 1
Loop 2
Accel 1
Accel 12
14Cost Sensitive Scheduling
- Different valid scheduling alternatives are not
equal
FU1
FU2
FU3
0
1
2
FU1
FU2
FU3
1
time
LD1
1
2
2
LD2
LD1
LD2
15Greedy Cost Sensitive Scheduler
- Select scheduling alternative with minimum cost
- Account for estimated cost of unscheduled ops
Loop 1
1
2
Modulo Scheduler
4
3
5
Costi
Alti
Hardware Cost Modeler
16Phase Ordered Method
- Extend conventional iterative modulo scheduler
with hardware cost model - Benefits
- Scheduler is aware of hardware for all previously
scheduled loops - Can adjust schedule to improve cost savings
- Limitation process is localized, greedy.
Schedules of previous loops are fixed - Fast runtime
17Cost Sensitive Scheduling Comparison
Image Processing
MPEG4
Signal Processing
Greedy scheduling 41 average savings ILP
scheduling 51 average savings
18Union vs. Phase Ordered Methods
Image Processing
MPEG4
Signal Processing
Union method 45 average savings Phase ordered
method 41 average savings
19Conclusion
- Compiler-directed design system
- Multifunction accelerator for hardware reuse
- Two multifunction design methods
- Smart union of single-function accelerators 45
average cost savings - Phase ordered scheduling 41 average cost
savings - Overall, 20 61 hardware savings from sharing
20Questions?