Title: Advancing Supercomputer Performance Through Interconnection Topology Synthesis
1Advancing Supercomputer Performance Through
Interconnection Topology Synthesis
- Yi Zhu, Michael Taylor, Scott B. Baden and
Chung-Kuan Cheng - Department of Computer Science and Engineering
- University of California, San Diego
2Outline
- Introduction
- Design Flow, Formulation Algorithms
- Example Blue Gene/L Packaging
- Overview
- Models Constraints
- Experiments
- Benchmark Instances
- Generated Instances
- Conclusion Future Work
3Interconnection Networks
- Interconnection networks become a more critical
factor than computing or memory modules (W.
Dally, HPCA 2007 Keynote Speech) - Popular network topologies
- Hypercube (SGI Origin2000)
- 2D torus (Cray X1)
- 3D torus (Cray T3E and XT3, IBM Blue Gene/L)
- Crossbar (NEC Earth Simulator)
- Folded Clos (Cray BlackWidow)
- Fat tree, flattened butterfly, Etc.
4Our Work
- We propose a design methodology to select the
best topology to minimize the average latency - Design flow is fully automated
- Physical constraints can be specified by users
- Efficient multi-commodity flow algorithm to
evaluate - Demonstrate the efficiency using Blue Gene/L
packaging framework
5Design Flow
6Multi-Commodity Flow (MCF)
- Graph G(V,E)
- K commodities, each has a source and a sink, and
demand amount d(k) - Each edge e has a capacity u(e)
- Each edge e has a weight w(e)
- Minimum Cost MCF each commodity k is routed
units under the capacity constraints, minimize
, where f(e) is the flow routed on
edge e
7Map Supercomputer Performance Evaluation to MCF
Problem
- Nodes processors
- Edges interconnection links
- Commodities communications
- Demands communication bandwidth (injection
rate) - Flow amount wires assignments
- Capacity constraints physical constraints
(wires, pins, board dim) - Edge weight unit latency (unit power)
8An Example on Maximum Concurrent Flow
- Two commodities s1-gtt1, s2-gtt2, both have demand
d(1)d(2)1 - Optimal throughput 1.5
9Approximation Algorithms
- The duality theory in LP for a maximization,
primal feasible , dual feasible D, optimal
solution OPT - Increase and decrease D iteratively till the
duality gap is small enough
10Blue Gene/L An Example
11Assumptions
- We follow the same hierarchical structure
midplane node card compute card - The properties of boards (dimensions, layers,
dielectric) keep unchanged - We seek better topologies than the existing 3D
torus to implement the networks in the midplane
12Topology Generation
- Generate 8-node 1D topologies and duplicate to
each row and column - Topologies are isomorph-free and has maximum
degree bound for each node
isomorph-free topologies
13Node Card Graph Model
Horizontal Strongly Connected Vertical
Generated Topology
14Midplane Graph Model
Coteus et al., Packaging the Blue Gene/L
SupercomputerIBM J of Res Dev, Vol. 43, pp.
213-248
15Experiment 1 Benchmark Instances
- NAS Parallel Benchmarks (121/128 processes)
Benchmark source code
Best topology
Compiled with Intel Trace Collector Analyzer
Our design flow
Executable
Task placement
Run on multi-processor machines
Simulated annealing placement
Traffic Patterns
Output
16Benchmarks
Characteristics
Communication Pattern MG
17Results
- Optimal each instance has different topology
- Aggregate one topology for all instances
- 3D Torus 3D torus topology
18Experiment 2 Generated Instances
- Randomly generated communications
- Scalar values which represent the demand for
bandwidth between each pair of nodes - More general, time independent
- Control Parameters
- communication demands O(n) pairs
- Communication amount uniform traffic but vary
case by case (different congestion level)
19Latency Throughput Tradeoffs
Distribution 40 / 50 / 10
20Topologies with Different Injection Rates
With larger injection rate, more (red) links are
needed to go through the cut between 4 and 5, in
order to reduce the number of hops
21Conclusion
- An design flow for interconnection network
synthesis - Fully automated
- Explore large design space
- Efficient evaluation algorithm
- Future work
- Power consumption
- Accurate simulation
22QA