Title: Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures
1Edge-centric Modulo Scheduling for
Coarse-Grained Reconfigurable Architectures
- Hyunchul Park, Kevin Fan, Scott Mahlke,
- Taewook Oh, Heeseok Kim, Hong-seok Kim
- University of Michigan
- Samsung Advanced Institute of Technology
October 28, 2008
2Coarse-Grained Reconfigurable Architecture (CGRA)
- Array of PEs connected in a mesh-like
interconnect - High throughput with a large number of resources
- Distributed hardware offers low cost/power
consumption - High flexibility with dynamic reconfiguration
3CGRA Attractive Alternative to ASICs
- Suitable for running multimedia applications for
future embedded systems - High throughput, low power consumption, high
flexibility - Morphosys 8x8 array with RISC processor
- SiliconHive hierarchical systolic array
- ADRES 4x4 array with tightly coupled VLIW
Morphosys SiliconHive ADRES
viterbi at 80Mbps
h.264 at 30fps
50-60 MOps /mW
3
4Scheduling in CGRA
- Sparse interconnect and distributed register
files - No dedicated routing resources FUs are used for
routing - Need explicit routing of operands by compiler
FU
RF
FU
RF
FU
RF
FU
RF
Central RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
FU
FU
FU
FU
RF
FU
RF
FU
RF
FU
RF
Conventional VLIW
FU
RF
FU
RF
FU
RF
FU
RF
CGRA
5Scheduling Difficulties
- VLIW routing is guaranteed by central RF
- CGRA Multiple possible routes
- Compiler is responsible for finding routes
- Routing can easily fail by other operations
VLIW
CGRA
5
6Objective of This Work
- Modulo scheduling technique for CGRAs
- Exploit loop-level parallelism by overlapping
execution of iterations - Customized approach based on characteristics of
CGRAs - Achieve fast compile time and good performance
- Huge scheduling space, distributed resources
- Naïve approach can result in either poor solution
or long compile time
6
7Traditional Approach Node-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
C
C
0
1
C
C
C
3
2
4
C
C
C
5
6
7
C
C
C
8
9
10
Operations are placed first, then routing is
performed Visit all candidate slots to find the
solution
8Node-centric Inefficiency 1
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
C
0
1
C
C
3
2
4
C
C
5
6
7
C
8
9
10
Attempt routing to non-reachable slots by edge P1
to C
9Node-centric Inefficiency 2
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P1
P2
P2
P1
C
C
0
1
C
3
2
4
C
5
6
7
C
8
9
10
Repeat the same routing already performed
10Our Approach Edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
P2
P1
0
0
1
C
1
3
2
4
C
2
5
6
7
C
C
C
3
4
8
9
10
Node-centric
Edge-centric
Start routing without placing the
operation Placement occurs during routing
11Benefit 1 Less Routing Calls
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
P2
P1
0
1
0
3
2
4
1
5
6
7
2
C
C
8
9
10
3
4
Node-centric
Edge-centric
11 routing calls for P1 ? C
1 routing call for P1 ? C
Reduce compile time with less number of routing
calls
12Benefit 2 Global View
node-centric
edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P
P
0
0
C
1
1
C
2
2
- Assume slot 0 is a precious resource (better to
save it for later use) - Node-centric greedily picks slot 1
- Edge-centric can avoid slot 0 by simply assigning
a high cost
12
13Edge-centric Modulo Scheduling
- Its all about edges
- Scheduling is constructed by routing edges
- Placement is integrated into routing process
- Global perspective for EMS
- Scheduling order of edges
- Prioritize edges to determine scheduling order
- Routing optimization
- Develop contention model for routing resources
13
141 Edge Prioritization
- Focus on consumers
- Simple edges / High fanout edges
- Height-based priority
- Give high priority to high fanout edges
- Edges scheduled later will likely use extra
resources - Extra resources in simple edges are just being
wasted - Extra resources in high-fanout edges can be
helpful - Other consumers can make use of those
14
15Fanout Clustering
- Our approach the opposite
- Give priority to simple edges
- Operations connected in simple edges form a
cluster - Schedule simple edges within a cluster
- Schedule high-fanout edges when consumers are
visited - 17 of 81 loops in H.264 show better throughput
- Only 1 shows worse throughput
15
162 Routing Optimization
- Routing is guided by cost associated with each
routing slot - Intelligent routing cost metrics are important
- Minimize routing resources for current edge
- Static cost fixed positive cost for each
resource - Minimize routing resources for other edges to
prod/cons - Affinity cost use common consumer information
- Avoid routing failures for other edges
- Probabilistic cost predict future resource usage
routing cost F(static cost, affinity cost,
probabilistic cost)
17Affinity Cost Heuristic
time FU 0 FU 1 FU 2 FU 3
0
1
2
3
A
B
C
FU 0
FU 1
FU 2
FU 3
Routing Cost 2
Routing Cost 0
- Affinity cost utilize common consumer
information - Affinity value how close common consumer is in
DFG - Place operations with high affinity close to each
other
17
18Probabilistic Cost Heuristic
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
5
6
7
P1
P2
P1
C1
C2
P2
. . . .
C1
C2
ST
Three possible routes, all using same routing
slots
18
19Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
5
6
7
P1
P2
P1
C1
C2
P2
. . . .
C1
C2
ST
Need to consider other unplaced edges/operations
Slots that might be used for routing P2 ? C2
Slots that might be used for placing ST
19
20Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4
0 0.33
1 0.33
2 1.0
3 1.0 0.33
4 1.0
5 0.5 0.5
6 0.5 0.5
7
P1
P2
P1
C1
C2
P2
. . . .
X
X
C1
C2
ST
Probabilities on future usage of slots are
calculated and guide routing of P1 ? C1
Route in the middle is selected
20
21EMS System Flow
Schedule
Select target edge
Preprocessing
Cost calculation
Fanout clustering
Final schedule
DFG
Perform routing
Prioritize edges
Place operations
Route to others
CGRA
21
22Experimental Setup
- 214 loops from highly optimized media
applications - H.264, 3D graphics, AAC, MP3
- Target architecture
- 4x4 heterogeneous CGRA (6 memory, 4 multiply)
- Local RF for each PE
- Mesh-plus interconnect mesh 2 hop connections
- Compared to 3 other solutions
- IMS iterative modulo scheduling, no routing
optimization - NMS same heuristics as EMS, but in a
node-centric way - DRESC IMECs simulated annealing
23Results
- Performance normalized throughput of loops
- Max throughput is determined by ops in a loop
and resources - Compile time for all 214 loops
23
24Conclusion
- EMS is a good match for scheduling in CGRA
- Routing is more important than placement
- Edge-centric approach allows fast compile time
- 18x speed up over simulated annealing
- Intelligent routing cost metrics allows good
performance - 24 improvement over IMS, 98 performance of
existing solution
25Questions ?
25