Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures

About This Presentation

Title:

Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures

Description:

FU. RF. FU. FU. FU. FU. Conventional VLIW. CGRA. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. University of Michigan ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 26

Provided by: fank

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures

1
Edge-centric Modulo Scheduling for
Coarse-Grained Reconfigurable Architectures

Hyunchul Park, Kevin Fan, Scott Mahlke,
Taewook Oh, Heeseok Kim, Hong-seok Kim
University of Michigan
Samsung Advanced Institute of Technology

October 28, 2008
2
Coarse-Grained Reconfigurable Architecture (CGRA)

Array of PEs connected in a mesh-like
interconnect
High throughput with a large number of resources
Distributed hardware offers low cost/power
consumption
High flexibility with dynamic reconfiguration

3
CGRA Attractive Alternative to ASICs

Suitable for running multimedia applications for
future embedded systems
High throughput, low power consumption, high
flexibility
Morphosys 8x8 array with RISC processor
SiliconHive hierarchical systolic array
ADRES 4x4 array with tightly coupled VLIW

Morphosys SiliconHive ADRES
viterbi at 80Mbps
h.264 at 30fps
50-60 MOps /mW
3
4
Scheduling in CGRA

Sparse interconnect and distributed register
files
No dedicated routing resources FUs are used for
routing
Need explicit routing of operands by compiler

FU
RF
FU
RF
FU
RF
FU
RF
Central RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
FU
FU
FU
FU
RF
FU
RF
FU
RF
FU
RF
Conventional VLIW
FU
RF
FU
RF
FU
RF
FU
RF
CGRA
5
Scheduling Difficulties

VLIW routing is guaranteed by central RF
CGRA Multiple possible routes
Compiler is responsible for finding routes
Routing can easily fail by other operations

VLIW
CGRA
5
6
Objective of This Work

Modulo scheduling technique for CGRAs
Exploit loop-level parallelism by overlapping
execution of iterations
Customized approach based on characteristics of
CGRAs
Achieve fast compile time and good performance
Huge scheduling space, distributed resources
Naïve approach can result in either poor solution
or long compile time

6
7
Traditional Approach Node-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
C
C
0
1
C
C
C
3
2
4
C
C
C
5
6
7
C
C
C
8
9
10
Operations are placed first, then routing is
performed Visit all candidate slots to find the
solution
8
Node-centric Inefficiency 1
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
C
0
1
C
C
3
2
4
C
C
5
6
7
C
8
9
10
Attempt routing to non-reachable slots by edge P1
to C
9
Node-centric Inefficiency 2
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P1
P2
P2
P1
C
C
0
1
C
3
2
4
C
5
6
7
C
8
9
10
Repeat the same routing already performed
10
Our Approach Edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
P2
P1
0
0
1
C
1
3
2
4
C
2
5
6
7
C
C
C
3
4
8
9
10
Node-centric
Edge-centric
Start routing without placing the
operation Placement occurs during routing
11
Benefit 1 Less Routing Calls
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P2
P1
P2
P1
0
1
0
3
2
4
1
5
6
7
2
C
C
8
9
10
3
4
Node-centric
Edge-centric
11 routing calls for P1 ? C
1 routing call for P1 ? C
Reduce compile time with less number of routing
calls
12
Benefit 2 Global View
node-centric
edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
P
P
0
0
C
1
1
C
2
2

Assume slot 0 is a precious resource (better to
save it for later use)
Node-centric greedily picks slot 1
Edge-centric can avoid slot 0 by simply assigning
a high cost

12
13
Edge-centric Modulo Scheduling

Its all about edges
Scheduling is constructed by routing edges
Placement is integrated into routing process
Global perspective for EMS
Scheduling order of edges
Prioritize edges to determine scheduling order
Routing optimization
Develop contention model for routing resources

13
14
1 Edge Prioritization

Focus on consumers
Simple edges / High fanout edges
Height-based priority
Give high priority to high fanout edges
Edges scheduled later will likely use extra
resources
Extra resources in simple edges are just being
wasted
Extra resources in high-fanout edges can be
helpful
Other consumers can make use of those

14
15
Fanout Clustering

Our approach the opposite
Give priority to simple edges
Operations connected in simple edges form a
cluster
Schedule simple edges within a cluster
Schedule high-fanout edges when consumers are
visited
17 of 81 loops in H.264 show better throughput
Only 1 shows worse throughput

15
16
2 Routing Optimization

Routing is guided by cost associated with each
routing slot
Intelligent routing cost metrics are important
Minimize routing resources for current edge
Static cost fixed positive cost for each
resource
Minimize routing resources for other edges to
prod/cons
Affinity cost use common consumer information
Avoid routing failures for other edges
Probabilistic cost predict future resource usage

routing cost F(static cost, affinity cost,
probabilistic cost)
17
Affinity Cost Heuristic
time FU 0 FU 1 FU 2 FU 3
0
1
2
3
A
B
C
FU 0
FU 1
FU 2
FU 3
Routing Cost 2
Routing Cost 0

Affinity cost utilize common consumer
information
Affinity value how close common consumer is in
DFG
Place operations with high affinity close to each
other

17
18
Probabilistic Cost Heuristic
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
5
6
7
P1
P2
P1
C1
C2
P2
. . . .
C1
C2
ST
Three possible routes, all using same routing
slots
18
19
Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4
0
1
2
3
4
5
6
7
P1
P2
P1
C1
C2
P2
. . . .
C1
C2
ST
Need to consider other unplaced edges/operations
Slots that might be used for routing P2 ? C2
Slots that might be used for placing ST
19
20
Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4
0 0.33
1 0.33
2 1.0
3 1.0 0.33
4 1.0
5 0.5 0.5
6 0.5 0.5
7
P1
P2
P1
C1
C2
P2
. . . .
X
X
C1
C2
ST
Probabilities on future usage of slots are
calculated and guide routing of P1 ? C1
Route in the middle is selected
20
21
EMS System Flow
Schedule
Select target edge
Preprocessing
Cost calculation
Fanout clustering
Final schedule
DFG
Perform routing
Prioritize edges
Place operations
Route to others
CGRA
21
22
Experimental Setup

214 loops from highly optimized media
applications
H.264, 3D graphics, AAC, MP3
Target architecture
4x4 heterogeneous CGRA (6 memory, 4 multiply)
Local RF for each PE
Mesh-plus interconnect mesh 2 hop connections
Compared to 3 other solutions
IMS iterative modulo scheduling, no routing
optimization
NMS same heuristics as EMS, but in a
node-centric way
DRESC IMECs simulated annealing

23
Results

Performance normalized throughput of loops
Max throughput is determined by ops in a loop
and resources
Compile time for all 214 loops

23
24
Conclusion

EMS is a good match for scheduling in CGRA
Routing is more important than placement
Edge-centric approach allows fast compile time
18x speed up over simulated annealing
Intelligent routing cost metrics allows good
performance
24 improvement over IMS, 98 performance of
existing solution

25
Questions ?
25

Write a Comment

User Comments (0)

About PowerShow.com

Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures - PowerPoint PPT Presentation

Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures

FU. RF. FU. FU. FU. FU. Conventional VLIW. CGRA. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. FU. RF. University of Michigan ... – PowerPoint PPT presentation