Compiler-directed Synthesis of Programmable Loop Accelerators - PowerPoint PPT Presentation

About This Presentation

Title:

Compiler-directed Synthesis of Programmable Loop Accelerators

Description:

NPA (Nonprogrammable Accelerator) Synthesis in PICO. University of Michigan ... PICO Backend. Resource allocation (II, operation graph) ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 24

Provided by: cccpEec

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler-directed Synthesis of Programmable Loop Accelerators

1
Compiler-directed Synthesis of Programmable Loop
Accelerators

Kevin Fan, Hyunchul Park, Scott Mahlke
September 25, 2004
EDCEP Workshop

2
Loop Accelerators

Hardware implementation of a critical loop nest
Hardwired state machine
Digital camera appln 1000x vs Pentium III
Multiple accelerators hooked up in a pipeline
Loop accelerator vs. customized processor
1 block of code vs. multiple blocks
Trivial control flow vs. handling generic
branches
Traditionally state machine vs. instruction driven

3
Programmable Loop Accelerators

Goals
Multifunction accelerators Accelerator hardware
can handle multiple loops (re-use)
Post-programmable To a degree, allow changes to
the application
Use compiler as architecture synthesis tool
But
Dont build a customized processor
Maintain ASIC-level efficiency

4
NPA (Nonprogrammable Accelerator) Synthesis in
PICO
5
PICO Frontend

Goals
Exploit loop-level parallelism
Map loop to abstract hardware
Manage global memory BW
Steps
Tiling
Load/store elimination
Iteration mapping
Iteration scheduling
Virtual processor clustering

for i 1 to ni
for j 1 to nj
yi wj xij
for jt 1 to 100 step 10
for t 0 to 502
for p 0 to 1
(i,j) function of (t,p)
if (igt1) Wtp Wt-5p
else wjtj
if (igt1 jltbj) Xtp Xt-4p1
else xijtj
Ytp Wtp Xtp
6
PICO Backend

Resource allocation (II, operation graph)
Synthesize machine description for fake fully
connected processor with allocated resources

7
Reduced VLIW Processor after Modulo Scheduling
8
Data/control-path Synthesis ? NPA
9
PICO Methodology Why it Works?

Systematic design methodology
1. Parameterized meta-architecture all NPAs
have same general organization
2. Performance/throughput is input
3. Abstract architecture We know how to build
compilers for this
4. Mapping mechanism Determine architecture
specifics from schedule for abstract architecture

10
Direct Generalization of PICO?

Programmability would require full interconnect
between elements
Back to the meta architecture!
Generalize connectivity to enable
post-programmability
But stylize it

11
Programmable Loop Accelerator Design Strategy

Compile for partially defined architecture
Build long distance communication into schedule
Limit global communication bandwidth
Proposed meta-architecture
Multi-cluster VLIW
Explicit inter-cluster transfers (varying
latency/BW)
Intra-cluster communication is complete
Hardware partially defined expensive units

12
Programmable Loop Accelerator Schema
DRAM
Shift Register
II
Stream Unit
SRAM
Control Unit
FU
MEM
Accelerator

Intra-cluster Communication

Stream Buffer
Stream Unit
FU
FU
Accelerator
Inter-cluster Register File

Accelerator Datapath
Pipeline of Tiled or Clustered Accelerators
13
Flow Diagram
cheap FUs FUs assigned to clusters
Assembly code, II
Modulo Schedule
FU Alloc
Shift register depth, width, porting Intercluster
bandwidth
clusters expensive FUs
Loop Accelerator
Partition
14
Sobel Kernel
for (i 0 i lt N1 i) for (j 0 j
lt N2 j) int t00, t01, t02, t10, t12, t20,
t21, t22 int e, tmp t00 xi j
t01 xi j1 t02 xi j2 t10
xi1j t12 xi1j2 t20
xi2j t21 xi2j1 t22
xi2j2 e1 ((t00 t01) (t01 t02))
((t20 t21) (t21 t22)) e2
((t00 t10) (t10 t20)) ((t02 t12)
(t12 t22)) e12 e1e1 e22 e2e2 e
e12 e22 if (e gt threshold) tmp 1 else
tmp 0 edgeij tmp
15
FU Allocation

Determine number of clusters
Determine number of expensive FUs
MPY, DIV, memory

Sobel with II4
41 ops
? 3 clusters
2 MPY ops
? 1 multiplier
9 memory ops
? 3 memory units

16
Partitioning

Multi-level approach consists of two phases
Coarsening
Refinement
Minimize inter-cluster communication
Load balance
Max of 4 ? II operations per cluster
Take FU allocation into account
Restricted of expensive units
of cheap units (ADD, logic) determined from
partition

17
Coarsening

Group highly related operations together
Pair operations together at each step
Forces partitioner to consider several operations
as a single unit
Coarsening Sobel subgraph into 2 groups

18
Refinement

Move operations between clusters
Good moves
Reduce inter-cluster communication
Improve load balance
Reduce hardware cost
Reduce number of expensive units to meet limit
Collect similar bitwidth operations together

19
Partitioning Example

From sobel, II4
Place MPYs together
Place each tree of ADD-LOAD-ADDs together
Cuts 6 edges

20
Modulo Scheduling

Determines shift register width, depth, and
number of read ports
Sobel II4

FU Cycle Max resultlifetime Reqddepth Reqd ports
0 2 4 4 1
1 1 2 4 2
1 3 4 4 2
2 4 1 1 1
3 0 - 1 1
3 3 1 1 1
FU0
FU1
FU2
FU3
cycle
ADD
0
LD
1
ADD
2
LD
ADD
ADD
3
21
Test Cases