Title: Michael Bender, SUNY Stony Brook
1Communication-Aware Processor Allocation for
Supercomputers
- Michael Bender, SUNY Stony Brook
- David Bunde, University of Illinois Urbana
- Erik Demaine, MIT
- Sandor Fekete, Braunschweig University of
Technology - Vitus Leung, Sandia National Laboratories
- Henk Meijer, Queens University, Ontario
- Cynthia Phillips, Sandia National Laboratories
Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energy under contract DE-AC04-94AL85000.
2Computational Plant (Cplant)
- Commodity-based supercomputers at Sandia National
Laboratories (off-the-shelf components) - Up to 1500 processors
- Production computing environment
- Our Job Improve parallel node allocation on
Cplant to optimize performance.
3The Cplant System
- DEC alpha processors
- Myrinet interconnect (Sandia modified)
- MPI
- Different sizes/topologies usually 2D or 3D grid
with toroidal wraps - Ross 1500 proc, 3D mesh
- Zermatt 128-proc 2D mesh
- Alaska 600, heavily-augmented 2D mesh
(cannibalized). - Modified Linux OS (now public domain)
- Four processors/switch (compute, I/O, service
nodes)
4Scheduling Environment
- Users submit jobs to queue (online)
- Users specify number of processors and runtime
estimate - If a job runs past this estimate by 5 min, it is
killed - No preemption, no migration, no multitasking
(security) - Actual runtime depends on set of processors
allocated and placement of other jobs - Goals
- User - minimum response time
- Bureaucracy (GAO) - high utilization
5Scheduler/Allocator Association
- Scheduler and allocator effect each others
performance.
Performance dependencies
6Scheduler/Allocator Dissociation
Job
User Executable processors Requested time
Node Allocator
PBS Scheduler
Cplant
. . .
queue
Job
- Scheduler enforces policy
- Management sets priorities for access,
utilization policy - Allocator can optimize performance
7Whats a Good Allocation?
Good allocation For 2D mesh
Bad allocation For 2D mesh
- Objective Allocate jobs to processors to
minimize network contention ? processor locality. - Especially important for commodity networks
8Quantitative Effect of Processor Locality
2 ?
faster than
empty processor
9Communication Hops on a 2D grid
5
4
- L1 distance hops ( switches) between 2
processors on grid
10Allocation Problem
- Given n available points on grid (some
unavailable) - Find a set of k available points with minimum
average (or total) L1 distance. - Example green allocation 3(2) 3(1) 9
11Empirical Correlation
Leung et al, 2002 Related support Mache and Lo,
1996
12Previous Work
- Various Work forcing a convex set
- Insufficient processor utilization
- Mache, Lo, Windisch MC algorithm
- Krume et al 2-approximation, NP-hard w/general
metric - Complexity open for grids
- Dispersion problem (max distance) linear time for
fixed k (Fekete and Meijer)
13Optimal Unconstrained ShapeBender,Bender,Demaine
,Fekete 2004
Almost a circle but not quite. Only .05 percent
difference in area.
0.650 245 952 951
14Our Results
- 7/4-approximation (2 - in d dimensions)
- PTAS ((1?)-approximation in time poly(n, )
- MC is a 4-approximation
- Linear-time exact dynamic program 1D
- O(n log n) time for k3
- Simulations (performance on job streams)
15An L1 Ball on a 2D Grid
(0,1)
x y 1
y - x 1
(-1,0)
(1,0)
x - y 1
x y -1
(0,-1)
16Possible medians of selected set
- A median will always share x coordinate with an
available point and y coordinate with a (possibly
different) available point.
17Manhattan Median (MM) Algorithm
- For each possible median p
- Pick k free processors closest to p (in L1)
- Compute total pairwise L1 distance
- Return set with the smallest total distance.
- Krumke et al (1997) previously showed this is a
2-approximation in arbitrary metric spaces. - We proved it is a 7/4-approximation for L1. This
is tight.
18Lower Bound Instance (7/4)
19Upper Bound Techniques
- WLOG assume the origin is a median of OPT
- Let M be the k points closest to the origin
- Candidate point set for algorithm MM
- Set returned by MM can only be better
- Compare M to optimal
- Assume M is the worst-case example
20Upper Bound Techniques
- Transform optimal and M to point placements that
have the same performance ratio, but are easy to
analyze - Transform in steps
- Argue the ratio gets worse if we deviate from
this form (impossible if M is the worst case)
All points of Opt and M at these 5 points
21Simulations Performance on a Job Stream
- Weve analyzed a greedy algorithm for placing a
single job - How well does it do for a stream of jobs?
- Consider two types of algorithms
- Situation algorithm Places job stream prefix
(system normal/default) - Decision algorithm Places current job (can be a
1-time override)
22Simulation Set up
Situation Algorithm
Job stream
Current Allocation
1-time decision Algorithm
- Job stream from LLNL Cray T3D Trace
- 21323 jobs, 256 processors
23Simulations Alternative Placement Algorithm MC
- Search in shell from minimum-size region of
preferred shape. - Weight processors by shells
- Return processor set with minimum weight.
24Alternative One-Dimensional Reduction
rlrubin illustrate algorithms unlikely to be
efficiently solvable more motivation - why
default is not good enough
- Order processors so that
- close in linear order ? close in physical
processor graph - Consider one-dimensional processor allocation
- Pack jobs onto the line (or ring), allowing
fragmentation
25Hilbert (Space-Filling) Curves
- For 2D and 3D grids
- Previous applications
- I/O efficient and cache-oblivious computation
- Compression (images)
- Domain decomposition
26Four Algorithms for Simulation
- MM
- MM Incremental improvement
- Hilbert curve with best fit
- MC
27Results
- Ordering in a row consistent with proven
approximation performance MMInc, MM, MC1x1,
HilbertBF - Ordering on diagonal (normal operation)
approximately opposite
28Results
- MM paints into a corner on streams
- But good for single high-priority job
- Thoughts rectangles pack better than circles
29New System Red Storm
- 10,368 AMD Opteron 2Ghz
- 31.2 TB Memory, 240 TB disk
- 41.47 TF peak performance
- 3D Mesh
30Impact
- Changed the node allocator on Cplant
- 1D default allocator
- Carried over to Red Storm system software
- 1D algorithms current default
- 2D algorithms implemented on Red Storm
- Awaiting testing for use
- RD 100 submission (must win internal competition)
31Questions
- Whats the right allocation for a stream
(online)? - Scheduling Allocation
- Simulation issues
- Nondeterminism
- Credit for good placement in timing