Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines - PowerPoint PPT Presentation

About This Presentation
Title:

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Description:

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines ... Automated accelerator synthesis for whole application. Correct by construction ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 19
Provided by: fank
Category:

less

Transcript and Presenter's Notes

Title: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines


1
Streamroller Compiler Orchestrated Synthesis of
Accelerator Pipelines
  • Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and
    Scott Mahlke
  • University of Michigan

2
Automated C to Gates Solution
  • SoC design
  • 10-100 Gops, 200 mW power budget
  • Low level tools ineffective
  • Automated accelerator synthesis for whole
    application
  • Correct by construction
  • Increase designer productivity
  • Faster time to market

3
Streaming Applications
  • Data streaming through kernels
  • Kernels are tight loops
  • FIR, Viterbi, DCT
  • Coarse grain dataflow between kernels
  • Sub-blocks of images, network packets

4
System Schema Overview
LA 1
Kernel 1
Task throughput
Kernel 2
Kernel 3
LA 2
time
Kernel 4
Kernel 5
LA 3
5
Input Specification
  • System specification
  • Function with main input/output
  • Local arrays to pass data
  • Sequence of calls to kernels
  • Sequential C program
  • Kernel specification
  • Perfectly nested FOR loop
  • Wrapped inside C function
  • All data access made explicit

row_trans(char inp88, char
out88 )
dct(char inp88, char out88)
for(i0 ilt8 i) for(j0 jlt8 j) .
. . inpij outij . . .
char tmp188, tmp288
row_trans(inp, tmp1) col_trans(tmp1, tmp2)
zigzag_trans(tmp2, out)

col_trans(char inp88, char
out88) zigzag_trans(char inp88,
char out88)
6
System Level Decisions
  • Throughput of each LA Initiation Interval
  • Grouping of loops into a multifunction LA
  • More loops in a single LA ? LA occupied for
    longer time in current task

Throughput 1 task / 200 cycles
LA 2
LA 1 occupied for 200 cycles
LA 3
7
System Decisions (Contd..)
  • Cost of SRAM buffers for intermediate arrays
  • More buffers ? more task overlap ? high
    performance

tmp1 buffer in use by LA2
Adjacent tasks use different buffers
8
Case Study Simple benchmark
LA 1
Loop graph
TC256
3
9
Prescribed Throughput Accelerators
  • Traditional behavioral synthesis
  • Directly translate C operatorsinto gates
  • Our approach Application-centric Architectures
  • Achieve fixed throughput
  • Maximize hardware sharing

Operation graph
Datapath
Application
Architecture
10
Loop Accelerator Template
  • Hardware realization of modulo scheduled loop
  • Parameterized execution resources, storage,
    connectivity

11
Loop Accelerator Design Flow
FU Alloc
FU
FU
.c
RF
C Code, Performance (Throughput)
Abstract Arch
12
Multifunction Accelerator
  • Map multiple loops to single accelerator
  • Improve hardware efficiency via reuse
  • Opportunities for sharing
  • Disjoint stages(loops 2, 3)
  • Pipeline slack(loops 4, 5)

Loop 1
Frame Type?
Loop 2
Loop 3
Loop 4
Block 5

Application
13
Union
Cost SensitiveModulo Scheduler
FU
FU
Loop 1
Cost SensitiveModulo Scheduler
Loop 2
FU
FU
  • 43 average savings over sum of accelerators
  • Smart union within 3 of joint scheduling solution

14
Challenges Throughput Enabling Transformations
  • Algorithm-level pipeline retiming
  • Splitting loops based on tiling
  • Co-scheduling adjacent loops

Loop 1
Loop 1
Critical loop
Loop 2
Loop 2a
Critical loop
Loop 2b
Loop 3
Loop 3,4
Loop 4
15
Challenges Programmable Loop Accelerator
  • Support bug fixes, evolving standards
  • Accelerate loops not known at design time
  • Minimize additional control overhead

Interconnect






II
Local Mem
Control
FU
FU
MEM
Controlsignals
16
Challenges Timing Aware Synthesis
  • Technology scaling, increasing FUs ? rising
    interconnect cost, wire capacitance
  • Strategies to eliminate long wires
  • Preemptive predict prevent long wires
  • Reactive use feedback from floorplanner

- Insert flip flop on long path - Reschedule with
added latency
FU1
FU2
FU3
17
Challenges Adaptable Voltage/Frequency Levels
flip-flop
  • Allow voltage scaling beyond margins
  • Using shadow latches in loop accelerator
  • Localized error detection
  • Control is predefined simple error recovery

D
Q
CLK
error
delay
shadowlatch
FU
FU
Shadowlatch
Extra queueentries
18
For More Information
  • Visit http//cccp.eecs.umich.edu
Write a Comment
User Comments (0)
About PowerShow.com