Title: Compiler-directed Synthesis of Programmable Loop Accelerators
1Compiler-directed Synthesis of Programmable Loop
Accelerators
- Kevin Fan, Hyunchul Park, Scott Mahlke
- September 25, 2004
- EDCEP Workshop
2Loop Accelerators
- Hardware implementation of a critical loop nest
- Hardwired state machine
- Digital camera appln 1000x vs Pentium III
- Multiple accelerators hooked up in a pipeline
- Loop accelerator vs. customized processor
- 1 block of code vs. multiple blocks
- Trivial control flow vs. handling generic
branches - Traditionally state machine vs. instruction driven
3Programmable Loop Accelerators
- Goals
- Multifunction accelerators Accelerator hardware
can handle multiple loops (re-use) - Post-programmable To a degree, allow changes to
the application - Use compiler as architecture synthesis tool
- But
- Dont build a customized processor
- Maintain ASIC-level efficiency
4NPA (Nonprogrammable Accelerator) Synthesis in
PICO
5PICO Frontend
- Goals
- Exploit loop-level parallelism
- Map loop to abstract hardware
- Manage global memory BW
- Steps
- Tiling
- Load/store elimination
- Iteration mapping
- Iteration scheduling
- Virtual processor clustering
for i 1 to ni
for j 1 to nj
yi wj xij
for jt 1 to 100 step 10
for t 0 to 502
for p 0 to 1
(i,j) function of (t,p)
if (igt1) Wtp Wt-5p
else wjtj
if (igt1 jltbj) Xtp Xt-4p1
else xijtj
Ytp Wtp Xtp
6PICO Backend
- Resource allocation (II, operation graph)
- Synthesize machine description for fake fully
connected processor with allocated resources
7Reduced VLIW Processor after Modulo Scheduling
8Data/control-path Synthesis ? NPA
9PICO Methodology Why it Works?
- Systematic design methodology
- 1. Parameterized meta-architecture all NPAs
have same general organization - 2. Performance/throughput is input
- 3. Abstract architecture We know how to build
compilers for this - 4. Mapping mechanism Determine architecture
specifics from schedule for abstract architecture
10Direct Generalization of PICO?
- Programmability would require full interconnect
between elements - Back to the meta architecture!
- Generalize connectivity to enable
post-programmability - But stylize it
11Programmable Loop Accelerator Design Strategy
- Compile for partially defined architecture
- Build long distance communication into schedule
- Limit global communication bandwidth
- Proposed meta-architecture
- Multi-cluster VLIW
- Explicit inter-cluster transfers (varying
latency/BW) - Intra-cluster communication is complete
- Hardware partially defined expensive units
12Programmable Loop Accelerator Schema
DRAM
Shift Register
II
Stream Unit
SRAM
Control Unit
FU
MEM
Accelerator
Intra-cluster Communication
Stream Buffer
Stream Unit
FU
FU
Accelerator
Inter-cluster Register File
Accelerator Datapath
Pipeline of Tiled or Clustered Accelerators
13Flow Diagram
cheap FUs FUs assigned to clusters
Assembly code, II
Modulo Schedule
FU Alloc
Shift register depth, width, porting Intercluster
bandwidth
clusters expensive FUs
Loop Accelerator
Partition
14Sobel Kernel
for (i 0 i lt N1 i) for (j 0 j
lt N2 j) int t00, t01, t02, t10, t12, t20,
t21, t22 int e, tmp t00 xi j
t01 xi j1 t02 xi j2 t10
xi1j t12 xi1j2 t20
xi2j t21 xi2j1 t22
xi2j2 e1 ((t00 t01) (t01 t02))
((t20 t21) (t21 t22)) e2
((t00 t10) (t10 t20)) ((t02 t12)
(t12 t22)) e12 e1e1 e22 e2e2 e
e12 e22 if (e gt threshold) tmp 1 else
tmp 0 edgeij tmp
15FU Allocation
- Determine number of clusters
- Determine number of expensive FUs
- MPY, DIV, memory
- Sobel with II4
- 41 ops
- ? 3 clusters
- 2 MPY ops
- ? 1 multiplier
- 9 memory ops
- ? 3 memory units
16Partitioning
- Multi-level approach consists of two phases
- Coarsening
- Refinement
- Minimize inter-cluster communication
- Load balance
- Max of 4 ? II operations per cluster
- Take FU allocation into account
- Restricted of expensive units
- of cheap units (ADD, logic) determined from
partition
17Coarsening
- Group highly related operations together
- Pair operations together at each step
- Forces partitioner to consider several operations
as a single unit - Coarsening Sobel subgraph into 2 groups
18Refinement
- Move operations between clusters
- Good moves
- Reduce inter-cluster communication
- Improve load balance
- Reduce hardware cost
- Reduce number of expensive units to meet limit
- Collect similar bitwidth operations together
19Partitioning Example
- From sobel, II4
- Place MPYs together
- Place each tree of ADD-LOAD-ADDs together
- Cuts 6 edges
20Modulo Scheduling
- Determines shift register width, depth, and
number of read ports - Sobel II4
FU Cycle Max resultlifetime Reqddepth Reqd ports
0 2 4 4 1
1 1 2 4 2
1 3 4 4 2
2 4 1 1 1
3 0 - 1 1
3 3 1 1 1
FU0
FU1
FU2
FU3
cycle
ADD
0
LD
1
ADD
2
LD
ADD
ADD
3
21Test Cases
- Sobel and fsed kernels, II4 designs
- Each machine has 4 clusters with 4 FUs per cluster
M
-
M
-
M
-
B
ltlt
sobel
-
-
-
-
-
ltlt
M
-
M
-
M
B
-
fsed
-
ltlt
-
ltlt
22Cross Compile Results
- Computation is localized
- sobel 1.5 moves/cycle
- fsed 1 move/cycle
- Cross compile
- Can still achieve II4
- More inter-cluster communication
- May require more units
- sobel on fsed machine 2 moves/cycle
- fsed on sobel machine 3 moves/cycle
23Concluding Remarks
- Programmable loop accelerator design strategy
- Meta-architecture with stylized interconnect
- Systematic compiler-directed design flow
- Costs of programmability
- Interconnect, inter-cluster communication
- Control micro-instructions are necessary
- Just scratching the surface of this work
- For more, see the CCCP group webpage
- http//cccp.eecs.umich.edu