Title: Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines
1Streamroller Automatic Synthesis of Prescribed
Throughput Accelerator Pipelines
- Manjunath Kudlur, Kevin Fan, Scott Mahlke
- Advanced Computer Architecture Lab
- University of Michigan
2Automated C to Gates Solution
- SoC design
- 10-100 Gops, 200 mW power budget
- Low level tools ineffective
- Automated accelerator synthesis for whole
application - Correct by construction
- Increase designer productivity
- Faster time to market
3Streaming Applications
- Data streaming through kernels
- Kernels are tight loops
- FIR, Viterbi, DCT
- Coarse grain dataflow between kernels
- Sub-blocks of images, network packets
4Software Overview
1
SRAM Buffers
System Level Synthesis
Frontend Analyses
2
3
4
Whole Application
Accelerator Pipeline
Loop Graph
Multifunction Accelerator
5Input Specification
- Sequential C program
- Kernel specification
- Perfectly nested FOR loop
- Wrapped inside C function
- All data access made explicit
-
- System specification
- Function with main input/output
- Local arrays to pass data
- Sequence of calls to kernels
row_trans(char inp88, char
out88 )
dct(char inp88, char out88)
for(i0 ilt8 i) for(j0 jlt8 j) .
. . inpij outij . . .
char tmp188, tmp288
row_trans(inp, tmp1) col_trans(tmp1, tmp2)
zigzag_trans(tmp2, out)
col_trans(char inp88, char
out88) zigzag_trans(char inp88,
char out88)
6Performance Specification
Input image (1024 x 768)
- High performance DCT
- Process one 1024x768 image every 2ms
- Given 400 Mhz clock
- One image every 800000 cycles
- One block every 64 cycles
- Low Performance DCT
- Process one 1024x768 image every 4ms
- One block every 128 cycles
inp
row_trans
tmp1
col_trans
Task
tmp2
zigzag_trans
Output coeffs
out
Performance goal Task throughput in number of
cycles between tasks
7Building Blocks
Kernel 1
tmp1
Kernel 2
tmp2
Kernel 3
Multifunction Loop Accelerator CODES/ISSS 06
tmp3
Kernel 4
SRAM buffers
8System Schema Overview
LA 1
Kernel 1
Task throughput
Kernel 2
Kernel 3
LA 2
time
Kernel 4
Kernel 5
LA 3
9Cost Components
- Cost of loop accelerator data path
- Cost of FUs, shift registers, muxes, interconnect
- Initiation interval (II)
- Key parameter that decides LA cost
- Low II ? high performance ? high cost
- Loop execution time (trip count) x II
- Appropriate II chosen to satisfy task throughput
Low performance
10Cost Components (Contd..)
- Grouping of loops into a multifunction LA
- More loops in a single LA ? LA occupied for
longer time in current task
Throughput 1 task / 200 cycles
LA 2
LA 1 occupied for 200 cycles
LA 3
11Cost Components (Contd..)
- Cost of SRAM buffers for intermediate arrays
- More buffers ? more task overlap ? high
performance
tmp1 buffer in use by LA2
Adjacent tasks use different buffers
12ILP Formulation
- Variables
- II for each loop
- Which loops are combined into single LA
- Number of buffers for temp array
- Objective function
- Cost of LAs cost of buffers
- Constraints
- Overall task throughput should be achieved
13Non-linear LA Cost
Relative Cost
Initiation interval
IImin
IImax
IImin II IImax
II 1II1 2II2 3II3 . . . . 14II14
and 0 IIi 1
Cost(II) C1II1 C2II2 C3II3 . . . .
C14II14
14Multifunction Accelerator Cost
LA 1
LA 2
LA 1
LA 2
LA 1
LA 2
LA 4
LA 3
LA 4
LA 4
LA 3
LA 3
Worst Case No sharing Cost Sum
Realistic Case Some sharing Cost Between Sum
and Max
Best case Full sharing Cost Max
- Impractical to obtain accurate cost of all
combinations - CLA 0.5 (SUMCLA MAXCLA)
15Case Study Simple benchmark
LA 1
Loop graph
TC256
3
16Beamformer
- Beamformer
- 10 loops
- Memory Cost 60 to 70
- Up to 20 cost savings due to hardware sharing in
multifunction accelerators - Systems at lower throughput have over-designed
LAs - Not profitable to pick a lower performance LA
- Memory buffer cost significant
- High performance producer consumer better than
more buffers
17Conclusions
- Automated design realistic for system of loops
- Designers can move up the abstraction hierarchy
- Observations
- Macro level hardware sharing can achieve
significant cost savings - Memory cost is significant need to
simultaneously optimize for datapath and memory
cost - ILP formulation tractable
- Solver took less than 1 minute for systems with
30 loops
18(No Transcript)