Compiler-directed Synthesis of Multifunction Loop Accelerators

About This Presentation

Title:

Compiler-directed Synthesis of Multifunction Loop Accelerators

Description:

Single hardware accelerator to run multiple loops ... Compiler-directed design system. Multifunction accelerator for hardware reuse. Two multifunction design methods ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 21

Provided by: fank

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler-directed Synthesis of Multifunction Loop Accelerators

1
Compiler-directed Synthesis of Multifunction Loop
Accelerators

Kevin Fan, Manjunath Kudlur,
Hyunchul Park, Scott Mahlke
Advanced Computer Architecture Laboratory
University of Michigan

2
Accelerating Streaming Applications
DRAM

Streaming applications
Discrete transformations operating on data stream
High performance
Map application to pipeline of accelerators
Multifunction accelerators reuse hardware
Improve hardware efficiency

Loop 1
Loop Accelerator
LA1
Frame Type?
Loop 2
Loop 3
Multifunction Loop Accelerator
LA2
Loop 4
Multifunction Loop Accelerator
LA3
Block 5

Accelerator Pipeline
Application
3
Loop Accelerator Schema

Hard wired state machine for one or more critical
loops
Order of magnitude power and performance
improvements over more general designs

4
Single Function Accelerator Design

Use compiler as architecture synthesis tool
Parameterized meta-architecture all loop
accelerators have same general organization
Performance/throughput is input
Compiler analysis to understand computation and
communication requirements
Hardware-sensitive optimization to reduce cost

5
Flow Diagram
Application Loop, Desired II
Allocate FUs
Concrete Arch
FU
FU
FU
FU
Instantiate Arch
Abstract Arch
RF
Modulo Schedule
Verilog, Control Signals
Scheduled Ops
Synthesize
Build Datapath
Loop Accelerator
6
FU Allocation

Given operations in a loop and cost of hardware
cells implementing those operations
Minimize total FU cost while supporting all
operations

II 2
3 ? ADD 1 ? SUB 2 ? LOAD

-
MEM
7
Modulo Scheduling andDatapath Derivation

Schedule to abstract architecture (FUs)
Determine register and interconnect requirements
from schedule

r1 Memr2 r3 r1 12
Source Code
8
Multifunction Accelerator

Single hardware accelerator to run multiple loops
Could place single function accelerators side by
side
Want to exploit potential hardware sharing
between loops
Function units
Registers
Interconnect

9
Multifunction Design Strategies

1. Union Method

FU
FU
FU
FU
FU
FU
2. Phase Ordered Method

FU
FU
FU
FU
10
Union Method
Goal combine FUs and register files to improve
hardware sharing.
Positional Union

-
M
M
Accel 1

M
Accel 2
11
Union Method

Smart union formulated as ILP problem which
minimizes FU and register cost
Benefit Look at whole design at once
Limitation Schedules are fixed prior to union
phase
Fast runtime

12
Cost of Union of Accelerators
Image Processing
MPEG4
Signal Processing
Worst union 25 average savings Positional
union 29 average savings Best union 33
average savings
13
Phase Ordered Method

Schedule loops in order
During scheduling, account for hardware from
previous loop
Cost sensitive scheduler attempts to minimize
hardware cost increase

FU
FU
FU
FU
Loop 1
Loop 2
Accel 1
Accel 12
14
Cost Sensitive Scheduling

Different valid scheduling alternatives are not
equal

FU1
FU2
FU3
0
1
2
FU1
FU2
FU3
1
time
LD1
1
2
2
LD2
LD1
LD2
15
Greedy Cost Sensitive Scheduler

Select scheduling alternative with minimum cost
Account for estimated cost of unscheduled ops

Loop 1
1
2
Modulo Scheduler
4
3
5
Costi
Alti
Hardware Cost Modeler
16
Phase Ordered Method

Extend conventional iterative modulo scheduler
with hardware cost model
Benefits
Scheduler is aware of hardware for all previously
scheduled loops
Can adjust schedule to improve cost savings
Limitation process is localized, greedy.
Schedules of previous loops are fixed
Fast runtime

17
Cost Sensitive Scheduling Comparison
Image Processing
MPEG4
Signal Processing
Greedy scheduling 41 average savings ILP
scheduling 51 average savings
18
Union vs. Phase Ordered Methods
Image Processing
MPEG4
Signal Processing
Union method 45 average savings Phase ordered
method 41 average savings
19
Conclusion