Title: EECS 583 Lecture 20 Multicluster Compilation
1EECS 583 Lecture 20Multicluster Compilation
- University of Michigan
- March 24, 2004
- Guest speakers today Michael Chu and Kevin Fan
2Recap Traditional VLIW Architectures
- Conventional VLIW
- Target architecture seen so far in class
- Large, centralized register file
- Many functional units connected
- Problems with conventional design
- Longer wires require longer latencies on RF
accesses - Large number of connected FUs to the register
file require more ports. - Register file access time increases quadratically
with number of ports
Conventional Architecture
RF
Register File
FU
FU
FU
FU
FU
3Multicluster VLIW Architectures
- Multicluster VLIW
- Solution to problems with conventional VLIW
architecture design. - Decentralized architecture by splitting RF and
connecting subsets of the FUs - Require communication between clusters through
intercluster communication path - Problem with Multicluster VLIW
- Compilation must now deal with disjoint FU/RFs,
and schedule operations accordingly - Used in commercial proceesors
- Alpha 21264, TI C6x, etc.
Clustered Architecture
Register File
Register File
FU
FU
FU
FU
Cluster 1
Cluster 2
4Other Multicluster Architectures Designs
- Clusters can be homogeneous or heterogeneous
- Homogeneous means each cluster is identical
- Heterogeneous means FU number/types differ per
cluster - Communication paths can be intercluster buses or
cross cluster FU inputs
Cross-cluster FU inputs
Intercluster Bus
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
Cluster 1
Cluster 2
Cluster 1
Cluster 2
5Multicluster Compilation Basics
- Goal distribute operations evenly to balance
workload while minimizing communication - When two operations on separate clusters require
communication, interconnection network must be
used
Interconnection Network
Register File
Register File
gtgt
LW
I
MEM
MEM
I
Intercluster move
Cluster 1
Cluster 2
6Cluster Assignment
- When do we want to do operation cluster
assignment? - Highly intertwined with Scheduling and Register
Allocation - Assignment to clusters can change how well the
code can be scheduled, which changes how well
registers can be allocated. - Elcors model
- Other possible models
- Combine cluster assignment with scheduling
- Combine all three
- Unifying any or all of these three steps can
greatly increase complexity
Cluster Assignment
Scheduling
Register Allocation
7Bottom-Up Greedy (BUG) Algorithm
- First clustering algorithm introduced for the
Multiflow Trace architecture - Used by Elcor
- Basic idea
- Recursive algorithm
- Go from exit ops to entry ops and pass along good
cluster candidates for each op - Go from entry ops back to exit ops and make final
decisions - Consider ops on critical path first
8BUG Algorithm (cont.)
- Given an op and its immediate predecessors and
successors, how to choose a good cluster? - Op must get its input operands from its
predecessors - Perform some computation
- Send its output to its successors
- Want to pick cluster such that this process
completes soonest (greedy) - A good choice depends on what clusters the ops
predecessors and successors are assigned to
9Definitions
- Available time
- When a source operand is computed
- Arrival time
- When source operand is moved to current cluster
- Start time
- When all source operands are ready (max of
arrival times) and resources are available - Completion time
- Result has been computed and moved to consumers
10Definitions Illustrated
Relative to Op 3
2
Time
AvailableTime (Op2)
move
1
ArrivalTime (Op2, C1)
AvailableTime (Op1), ArrivalTime (Op1, C1)
StartTime (Op3, C1)
3
CompletionTime (Op3, C1, C1, C2)
4
- Choose a cluster for Op 3 to minimize Completion
Time
11The Main Function Assign
- Assign (Op, Dests)
- for each Predecessor of Op
- Est-clusters Estimate (Op, Dests)
- Assign (Pred, Est-clusters)
-
- Est-clusters Estimate (Op, Dests)
- Cluster first cluster in Est-clusters
- Assign Op to Cluster
- Mark Clusters resources busy at
StartTime(Op, Cluster)
Upward pass
recursive call
Downward pass
actual assignment
- Estimate function returns a list of Clusters for
which CompletionTime(Op, Cluster, Dests) is
minimum
12BUG
- Traverses DFG in a reverse depth-first-search
fashion - Upward pass
- Predecessors have not been assigned yet
- Use depth (estart) plus latency to approximate
predecessors AvailableTime - Estimate a set of good clusters for current op
- Recursively assign predecessors with current set
aspredecessors Dests - Downward pass
- Make final cluster decisions for ops
13Example
- Assume all ops are 1-cycle
- Each cluster can execute one op per cycle
- Cluster 1 can execute any op, cluster 2 can only
execute
C1
C2
M
14Example left path upward pass
AvailTime(Op1)1
3
5
C1
5
CompTime(Op3,C1,C1) 2 CompTime(Op3,C2,C1)
3
CompTime(Op5,C1,-) 3
1
3
C1
5
C1
CompTime(Op1,C1,C1) 1 CompTime(Op1,C2,C1)
2
15Example left path downward pass
1
ArrivTime(Op1,C1)1 ArrivTime(Op1,C2)2
3
5
C1
StartTime(Op3,C1)1 CompTime(Op3,C1,C1)
2 StartTime(Op3,C2)2 CompTime(Op3,C2,C1) 4
1
3
5
C1
16Example right path upward pass
1
2
AvailTime(Op2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
C1,C2
5
C1
CompTime(Op2,C1,C1,C2) 3 CompTime(Op2,C2,C1,C
2) 1
17Example right path downward pass
1
2
ArrivTime(Op2,C1)2 ArrivTime(Op2,C2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
5
CompTime(Op5,C1,-) 4
18Class problem
C1
C2
M3
4
M
5
Schedule
19Problems with BUG
- BUG does a fairly good job of partitioning the
DFG, but it can be improved - Problem 1 Local scope of the DFG
- Has a very narrow view of the DFG
- Doesnt consider the best global clustering
- Problem 2 Scheduler-centric
- Using the scheduler to determine the clustering
is slow! - BUG is not the only solution to cluster
assignment - Many different algorithms exist all using
different techniques, different scopes, and occur
at different phases in the compilation process - No clear cut winner on the best algorithm for all
situations.
20Local Scope
Local scope clustering
Global scope clustering
1
3
1
4
1
1
2
7
2
8
6
4
move
6
5
2
3
4
5
2
3
4
5
10
8
3
9
6
7
8
9
cycle
cycle
6
7
8
9
5
7
11
11
10
move
11
10
move
9
10
12
12
11
12
12
21Scheduler-centric Nature
- Cluster Assignment during scheduling adds
complexity - Detailed resource model/reservation table is
slow! - Forces local decisions to be made
Cluster 2
cycle
Cluster 1
cycle
X
X
X
X
1
1
1
X
X
X
X
2
2
2
3
4
5
X
X
X
X
1
1
6
7
8
9
X
X
X
X
2
2
11
10
X
X
X
X
1
1
12
X
X
X
X
2
2
22Region-based Hierarchical Operation Partitioning
- RHOP is one of many advanced clustering
techniques - Code is considered region at a time
- Weight calculation determines hints for how
operations affect scheduler - Partitioning uses multilevel graph partitioner to
cluster operations
Program
Region
int main int x printf() . . .
Weight Calculation
Graph Partitioning
23Weight Calculation
- Node weights are used to determine approximate
resource usage - Differs depending on how many FUs of each type
per cluster - Edge weights are used to determine where to best
break the graph - Where is intercluster communication free or
preferred?
1
2
Register File
I
F
M
B
3
(0,0)
(0,0)
1
2
(0,1)
(0,1)
(0,1)
(0,1)
3
5
6
7
4
(1,1)
(1,2)
10
11
8
9
(1,2)
(0,2)
(2,2)
13
12
(2,3)
(3,3)
14
(estart, lstart)
(4,4)
24RHOP - Coarsening
- Coarsening takes highly-related operations and
groups them together to later partition - Groups based on edge weights
- Takes snapshots of how things are coarsened,
later will consider them together
25RHOP Scheduling estimate
0
1
2
Cluster 1
1
4
6
5
2
2.5
2.0
9
3
cycle
8
0.5
12
0.0
14
0.0
Cluster_wgt1 5.0
0
1
Cluster 2
2
7
0.0
10
11
0.33
13
0.33
cycle
0.0
0.0
Cluster_wgt2 0.67
26RHOP Checking proposed moves
- Move groups of operations over, see how it
changes the load on the schedule estimate
Cluster 1
1
2
1.0
SL(before) 5.0
0.0
3
cycle
SL(after) 4.5
8
0.0
12
0.0
14
0.0
Cluster 2
Lgain 0.5
1.33
4
6
5
7
10
2.33
9
11
Egain -1.0
13
0.83
cycle
0.0
Mgain 4.0
0.0
27RHOP - Refinement