Michael Bender, SUNY Stony Brook - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Michael Bender, SUNY Stony Brook

Description:

Users submit jobs to queue (online) Users specify number of processors and runtime estimate ... PBS. Scheduler. Node. Allocator. Cplant. queue. What's a Good ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 32
Provided by: cirmUn
Category:
Tags: suny | bender | brook | michael | stony

less

Transcript and Presenter's Notes

Title: Michael Bender, SUNY Stony Brook


1
Communication-Aware Processor Allocation for
Supercomputers
  • Michael Bender, SUNY Stony Brook
  • David Bunde, University of Illinois Urbana
  • Erik Demaine, MIT
  • Sandor Fekete, Braunschweig University of
    Technology
  • Vitus Leung, Sandia National Laboratories
  • Henk Meijer, Queens University, Ontario
  • Cynthia Phillips, Sandia National Laboratories

Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energy under contract DE-AC04-94AL85000.
2
Computational Plant (Cplant)
  • Commodity-based supercomputers at Sandia National
    Laboratories (off-the-shelf components)
  • Up to 1500 processors
  • Production computing environment
  • Our Job Improve parallel node allocation on
    Cplant to optimize performance.

3
The Cplant System
  • DEC alpha processors
  • Myrinet interconnect (Sandia modified)
  • MPI
  • Different sizes/topologies usually 2D or 3D grid
    with toroidal wraps
  • Ross 1500 proc, 3D mesh
  • Zermatt 128-proc 2D mesh
  • Alaska 600, heavily-augmented 2D mesh
    (cannibalized).
  • Modified Linux OS (now public domain)
  • Four processors/switch (compute, I/O, service
    nodes)

4
Scheduling Environment
  • Users submit jobs to queue (online)
  • Users specify number of processors and runtime
    estimate
  • If a job runs past this estimate by 5 min, it is
    killed
  • No preemption, no migration, no multitasking
    (security)
  • Actual runtime depends on set of processors
    allocated and placement of other jobs
  • Goals
  • User - minimum response time
  • Bureaucracy (GAO) - high utilization

5
Scheduler/Allocator Association
  • Scheduler and allocator effect each others
    performance.

Performance dependencies
6
Scheduler/Allocator Dissociation
Job
User Executable processors Requested time
Node Allocator
PBS Scheduler
Cplant
. . .
queue
Job
  • Scheduler enforces policy
  • Management sets priorities for access,
    utilization policy
  • Allocator can optimize performance

7
Whats a Good Allocation?
Good allocation For 2D mesh
Bad allocation For 2D mesh
  • Objective Allocate jobs to processors to
    minimize network contention ? processor locality.
  • Especially important for commodity networks

8
Quantitative Effect of Processor Locality
2 ?
  • But, speed-up anomaly

faster than
empty processor
9
Communication Hops on a 2D grid
5
4
  • L1 distance hops ( switches) between 2
    processors on grid

10
Allocation Problem
  • Given n available points on grid (some
    unavailable)
  • Find a set of k available points with minimum
    average (or total) L1 distance.
  • Example green allocation 3(2) 3(1) 9

11
Empirical Correlation
Leung et al, 2002 Related support Mache and Lo,
1996
12
Previous Work
  • Various Work forcing a convex set
  • Insufficient processor utilization
  • Mache, Lo, Windisch MC algorithm
  • Krume et al 2-approximation, NP-hard w/general
    metric
  • Complexity open for grids
  • Dispersion problem (max distance) linear time for
    fixed k (Fekete and Meijer)

13
Optimal Unconstrained ShapeBender,Bender,Demaine
,Fekete 2004
Almost a circle but not quite. Only .05 percent
difference in area.
0.650 245 952 951
14
Our Results
  • 7/4-approximation (2 - in d dimensions)
  • PTAS ((1?)-approximation in time poly(n, )
  • MC is a 4-approximation
  • Linear-time exact dynamic program 1D
  • O(n log n) time for k3
  • Simulations (performance on job streams)

15
An L1 Ball on a 2D Grid
(0,1)
x y 1
y - x 1
(-1,0)
(1,0)
x - y 1
x y -1
(0,-1)
16
Possible medians of selected set
  • A median will always share x coordinate with an
    available point and y coordinate with a (possibly
    different) available point.

17
Manhattan Median (MM) Algorithm
  • For each possible median p
  • Pick k free processors closest to p (in L1)
  • Compute total pairwise L1 distance
  • Return set with the smallest total distance.
  • Krumke et al (1997) previously showed this is a
    2-approximation in arbitrary metric spaces.
  • We proved it is a 7/4-approximation for L1. This
    is tight.

18
Lower Bound Instance (7/4)
19
Upper Bound Techniques
  • WLOG assume the origin is a median of OPT
  • Let M be the k points closest to the origin
  • Candidate point set for algorithm MM
  • Set returned by MM can only be better
  • Compare M to optimal
  • Assume M is the worst-case example

20
Upper Bound Techniques
  • Transform optimal and M to point placements that
    have the same performance ratio, but are easy to
    analyze
  • Transform in steps
  • Argue the ratio gets worse if we deviate from
    this form (impossible if M is the worst case)

All points of Opt and M at these 5 points
21
Simulations Performance on a Job Stream
  • Weve analyzed a greedy algorithm for placing a
    single job
  • How well does it do for a stream of jobs?
  • Consider two types of algorithms
  • Situation algorithm Places job stream prefix
    (system normal/default)
  • Decision algorithm Places current job (can be a
    1-time override)

22
Simulation Set up
Situation Algorithm
Job stream
Current Allocation
1-time decision Algorithm
  • Job stream from LLNL Cray T3D Trace
  • 21323 jobs, 256 processors

23
Simulations Alternative Placement Algorithm MC
  • Search in shell from minimum-size region of
    preferred shape.
  • Weight processors by shells
  • Return processor set with minimum weight.

24
Alternative One-Dimensional Reduction
rlrubin illustrate algorithms unlikely to be
efficiently solvable more motivation - why
default is not good enough
  • Order processors so that
  • close in linear order ? close in physical
    processor graph
  • Consider one-dimensional processor allocation
  • Pack jobs onto the line (or ring), allowing
    fragmentation

25
Hilbert (Space-Filling) Curves
  • For 2D and 3D grids
  • Previous applications
  • I/O efficient and cache-oblivious computation
  • Compression (images)
  • Domain decomposition

26
Four Algorithms for Simulation
  • MM
  • MM Incremental improvement
  • Hilbert curve with best fit
  • MC

27
Results
  • Ordering in a row consistent with proven
    approximation performance MMInc, MM, MC1x1,
    HilbertBF
  • Ordering on diagonal (normal operation)
    approximately opposite

28
Results
  • MM paints into a corner on streams
  • But good for single high-priority job
  • Thoughts rectangles pack better than circles

29
New System Red Storm
  • 10,368 AMD Opteron 2Ghz
  • 31.2 TB Memory, 240 TB disk
  • 41.47 TF peak performance
  • 3D Mesh

30
Impact
  • Changed the node allocator on Cplant
  • 1D default allocator
  • Carried over to Red Storm system software
  • 1D algorithms current default
  • 2D algorithms implemented on Red Storm
  • Awaiting testing for use
  • RD 100 submission (must win internal competition)

31
Questions
  • Whats the right allocation for a stream
    (online)?
  • Scheduling Allocation
  • Simulation issues
  • Nondeterminism
  • Credit for good placement in timing
Write a Comment
User Comments (0)
About PowerShow.com