Title: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines
1Streamroller Compiler Orchestrated Synthesis of
Accelerator Pipelines
- Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and
Scott Mahlke - University of Michigan
2Automated C to Gates Solution
- SoC design
- 10-100 Gops, 200 mW power budget
- Low level tools ineffective
- Automated accelerator synthesis for whole
application - Correct by construction
- Increase designer productivity
- Faster time to market
3Streaming Applications
- Data streaming through kernels
- Kernels are tight loops
- FIR, Viterbi, DCT
- Coarse grain dataflow between kernels
- Sub-blocks of images, network packets
4System Schema Overview
LA 1
Kernel 1
Task throughput
Kernel 2
Kernel 3
LA 2
time
Kernel 4
Kernel 5
LA 3
5Input Specification
-
- System specification
- Function with main input/output
- Local arrays to pass data
- Sequence of calls to kernels
- Sequential C program
- Kernel specification
- Perfectly nested FOR loop
- Wrapped inside C function
- All data access made explicit
row_trans(char inp88, char
out88 )
dct(char inp88, char out88)
for(i0 ilt8 i) for(j0 jlt8 j) .
. . inpij outij . . .
char tmp188, tmp288
row_trans(inp, tmp1) col_trans(tmp1, tmp2)
zigzag_trans(tmp2, out)
col_trans(char inp88, char
out88) zigzag_trans(char inp88,
char out88)
6System Level Decisions
- Throughput of each LA Initiation Interval
- Grouping of loops into a multifunction LA
- More loops in a single LA ? LA occupied for
longer time in current task
Throughput 1 task / 200 cycles
LA 2
LA 1 occupied for 200 cycles
LA 3
7System Decisions (Contd..)
- Cost of SRAM buffers for intermediate arrays
- More buffers ? more task overlap ? high
performance
tmp1 buffer in use by LA2
Adjacent tasks use different buffers
8Case Study Simple benchmark
LA 1
Loop graph
TC256
3
9Prescribed Throughput Accelerators
- Traditional behavioral synthesis
- Directly translate C operatorsinto gates
- Our approach Application-centric Architectures
- Achieve fixed throughput
- Maximize hardware sharing
Operation graph
Datapath
Application
Architecture
10Loop Accelerator Template
- Hardware realization of modulo scheduled loop
- Parameterized execution resources, storage,
connectivity
11Loop Accelerator Design Flow
FU Alloc
FU
FU
.c
RF
C Code, Performance (Throughput)
Abstract Arch
12Multifunction Accelerator
- Map multiple loops to single accelerator
- Improve hardware efficiency via reuse
- Opportunities for sharing
- Disjoint stages(loops 2, 3)
- Pipeline slack(loops 4, 5)
Loop 1
Frame Type?
Loop 2
Loop 3
Loop 4
Block 5
Application
13Union
Cost SensitiveModulo Scheduler
FU
FU
Loop 1
Cost SensitiveModulo Scheduler
Loop 2
FU
FU
- 43 average savings over sum of accelerators
- Smart union within 3 of joint scheduling solution
14Challenges Throughput Enabling Transformations
- Algorithm-level pipeline retiming
- Splitting loops based on tiling
- Co-scheduling adjacent loops
Loop 1
Loop 1
Critical loop
Loop 2
Loop 2a
Critical loop
Loop 2b
Loop 3
Loop 3,4
Loop 4
15Challenges Programmable Loop Accelerator
- Support bug fixes, evolving standards
- Accelerate loops not known at design time
- Minimize additional control overhead
Interconnect
II
Local Mem
Control
FU
FU
MEM
Controlsignals
16Challenges Timing Aware Synthesis
- Technology scaling, increasing FUs ? rising
interconnect cost, wire capacitance - Strategies to eliminate long wires
- Preemptive predict prevent long wires
- Reactive use feedback from floorplanner
- Insert flip flop on long path - Reschedule with
added latency
FU1
FU2
FU3
17Challenges Adaptable Voltage/Frequency Levels
flip-flop
- Allow voltage scaling beyond margins
- Using shadow latches in loop accelerator
- Localized error detection
- Control is predefined simple error recovery
D
Q
CLK
error
delay
shadowlatch
FU
FU
Shadowlatch
Extra queueentries
18For More Information
- Visit http//cccp.eecs.umich.edu