Sequoia RFP and Benchmarking Status - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Sequoia RFP and Benchmarking Status

Description:

Lite-Weight Kernel ~1015. Figure of merit is FTQ sample Kurtosis ... AMG Results. AMG message size distribution ... impact AMG communication performance. ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 23
Provided by: spsci
Learn more at: http://www.spscicomp.org
Category:

less

Transcript and Presenter's Notes

Title: Sequoia RFP and Benchmarking Status


1
Sequoia RFP and BenchmarkingStatus
UNCLASSIFIED
  • Scott Futral
  • Mark K. Seager
  • Tom Spelce
  • Lawrence Livermore National Laboratory
  • 2008 SciComp Summer Meeting

2
Overview
  • Sequoia Objectives
  • 25-50x BlueGene/L (367TF/s) on Science Codes
  • 12-24x Purple on Integrated Design Codes
  • Sequoia Procurement Strategy
  • Sequoia is actually a cluster of procurements
  • Risk management pervades everything
  • Sequoia Target Architecture
  • Driven by programmatic requirements and technical
    realities
  • Requires innovation on several fronts

Sequoia will deliver petascale computing for the
mission and pushes the envelope by 10-100x in
every dimension!
3
By leveraging industry trends, Sequoia will
successfully deliver a petascale UQ engine for
the stockpile
  • Sequoia Production Platform Programmatic Drivers
  • UQ Engine for mission deliverables in the
    2011-2015 timeframe
  • Programmatic drivers require unprecedented leap
    forward in computing power
  • Program needs both Capability and Capacity
  • 25-50x BGL (367TF/s) for science codes (knob
    removal)
  • 12-24x Purple for capability runs on Purple
    (8,192 MPI tasks UQ Engine)
  • These requirements met with current industry
    trends drive us to a different target
    architecture than Purple or BGL

4
Predicting stockpile performance drives five
separate classes of petascale calculations
  • Quantifying uncertainty (for all classes of
    simulation)
  • Identify and model missing physics
  • Improving accuracy in material property data
  • Improving models for known physical processes
  • Improving the performance of complex models and
    algorithms in macro-scale simulation codes

Each of these mission drivers require petascale
computing
5
Sequoia Strategy
  • Two major deliverables
  • Petascale Scaling Dawn Platform in 2009
  • Petascale Sequoia Platform in 2011
  • Lessons learned from previous capability and
    capacity procurements
  • Leverage best-of-breed for platform, file system,
    SAN and storage
  • Major Sequoia procurement is for long term
    platform partnership
  • Three RD partnerships to incentivize bidders to
    stretch goals
  • Risk reduction built into overall strategy from
    day-one
  • Drive procurement with single peak mandatory
  • Target PeakSustained on marquee benchmarks
  • Timescale, budget, technical details as target
    requirements
  • Include TCO factors such as power

6
To Minimize Risk, Dawn Deployment Extends the
Existing Purple and BG/L Integrated Simulation
Environment
  • ASC Dawn is the initial delivery system for
    Sequoia
  • Code development platform and scaling for Sequoia
  • 0.5 petaFLOP/s peak for ASC production usage
  • Target production 2009-2014
  • Dawn Component Scaling
  • Memory BF 0.3
  • Mem BW BF 1.0
  • Link BW BF 2.0
  • Min Bisect BF 0.001
  • SAN GB/sPF/s 384
  • F is peak FLOP/s

7
Sequoia Target Architecture in Integrated
Simulation Environment Enables a Diverse
Production Workload
  • Diverse usage models drive platform and
    simulation environment requirements
  • Will be 2D ultra-res and 3D high-res
    Quantification of Uncertainty engine
  • 3D Science capability for known unknowns and
    unknown unknowns
  • Peak of 14 petaFLOP/s with option for 20
    petaFLOP/s
  • Target production 2011-2016
  • Sequoia Component Scaling
  • Memory BF 0.08
  • Mem BW BF 0.2
  • Link BW BF 0.1
  • Min Bisect BF 0.03
  • SAN BW GB/PF/s 25.6
  • F is peak FLOP/s

8
Sequoia Targets A Highly Scalable Operating System
  • Light weight kernel on compute node
  • Optimized for scalability and reliability
  • As simple as possible. Full control
  • Extremely low OS noise
  • Direct access to interconnect hardware
  • OS features
  • Linux compatible with OS functions forwarded to
    I/O node OS
  • Support for dynamic libs runtime loading
  • Shared memory regions
  • Open source

1-N CN
Compute Nodes
  • Linux on I/O Node
  • Leverage huge Linux base community
  • Enhance TCP offload, PCIe, I/O
  • Standard File Systems Lustre, NFSv4, etc
  • Factor to Simplify
  • Aggregates N CN for I/O admin
  • Open source

FSD
SLURMD
Perf tools
totalview
Linux/Unix
Function Shipped syscalls
Lustre Client
NFSv4
UDP TCP/IP
LNet
Sequoia ION and Interconnect
I/O Node
9
A light-weight kernel diminutive noise
environment is required to scale MPI based
applications to petascale
Mac OS X 103 Desktop Linux 105
Bad
TLCC/TOSS 1010
Better
Lite-Weight Kernel 1015 Figure of merit is FTQ
sample Kurtosis
Best
10
The Livermore Model developed in 1990 made
sense of the killer-micro revolution
  • Livermore Model system perspective for
    distributed memory machines
  • Interactive and batch usage
  • Debugging and visualization
  • Dynamically support a mix of job sizes
  • Capability runs at a significant portion of
    platform
  • Capacity runs at a small fraction of platform
  • Dynamically support a range of runtimes
  • Short run-time for setup
  • Many months for science runs
  • Many users require predictable progress on jobs
  • Hard allocations for projects
  • Implemented with Moab and SLURM

11
The Livermore Model also leveraged desktop to
teraFLOP/s development by providing consistent
programming model across multiple types and
generations of platforms
MPI Comms
OpenMP
OpenMP
OpenMP
Local
Local
Local
Idea Provide a consistent programming model for
multiple platform generations and across multiple
vendors! Idea Incrementally increase
functionality over time! How to extend to
petascale?
12
Digression Scalability Lessons Learned
  • With this project you only have three things to
    worry about scalability scalability and
    scalability. Candy Culhane at first BG/L
    external review
  • For hardware scalability it is all about MTBF
  • Keep the highly replicated parts as simple as
    possible to get the job done, but not simpler
  • For software scalability its all about
    reliability
  • Factor and solve allows one to take an impossible
    problem and factor it into to problems
  • One of which is solved
  • One of which is merely difficult
  • For applications scalability it is all about MPI
    scalability
  • Low OS noise
  • Scalable collectives
  • Messaging rate

13
Each year we get faster more processors
MooresLaw
  • Historically Boost single-stream performance via
    more complex chips, first via one big feature,
    then via lots of smaller features.
  • Now Deliver more cores per chip.
  • The free lunch is over for todays sequential
    apps and many concurrent apps (expect some
    regressions). We need killer apps with lots of
    latent parallelism.
  • A generational advance gtOO is necessary to get
    above the threadslocks programming model.

Montecito
Intel CPU Trends (sources Intel, Wikipedia, K.
Olukotun)
Pentium
386
13
From Herb Sutter lthsutter_at_microsoft.comgt
14
How many cores are you coding for?
How Do We Get to Here?
You Are Here!
Microprocessor parallelism will increase
exponentially in the next decade
15
How much parallelism will be required to sustain
petaFLOP/s in 2011?
  • Hypothetical low power machines will feature 1.6M
    to 6.6M way parallelism
  • 32-64 cores per processor and up to 2-4 threads
    per core
  • Assume 1 socket nodes and 25.6K nodes
  • Hypothetical Intel terascale chip petascale
    system yields 1.5M way parallelism
  • 80 cores per processor
  • Assume 4 socket nodes and 4,608 nodes (32 SU of
    144 nodes with IBA)
  • Holey cow, this is about 12-48x BlueGene/L!

16
Multicore processors have non-intuitive impact on
other machine characteristics
  • Memory is the most critical machine
    characteristic
  • ASC applications require gt1GiB/MPI task
  • If we map MPI tasks directly to cores
  • 64GiB/node on Low Power ? 1.6PiB of memory and
    that is 4x too expensive, if we could build and
    power it
  • This drives us to think in terms of fewer MPI
    tasks/node
  • 320GiB/node on Intel ? 1.5PiB of memory is also a
    problem

From Seager Dec 1997 platforms talk
17
Multicore processors drive huge network
requirements
  • With multiple MPI tasks/node, short message
    messaging rate becomes as important as bandwidth
    and latency
  • ASC applications require gt2 mMsgs/s/MPI task
  • If we map MPI tasks directly to cores, then we
    require
  • Low power system requires 64 MPI tasks/node ? 128
    mM/s.
  • 12.5 clocks/message
  • This may be achievable, but is very high risk.
  • Intel requires 80 MPI tasks/socket and 4 socket
    nodes ? gt640 mM/s
  • 5 clocks/message
  • This is not achievable
  • This drives us to think in terms of fewer MPI
    tasks/node

18
Sequoia Target Application Programming Model
Leverages Factor and Simplify to Scale
Applications to O(1M) Parallelism
  • MPI Parallelism at top level
  • Static allocation of MPI tasks to nodes and sets
    of coresthreads
  • Allow for MPI everywhere, just in case
  • Effectively absorb multiple coresthreads in MPI
    task
  • Support multiple languages
  • C/C/Fortran03/Python
  • Allow different physics packages to express node
    concurrency in different ways

19
With Careful Use of Node Concurrency We can
Support A Wide Variety of Complex Applications
MPI_FINALIZE
OpenMP
OpenMP
OpenMP
OpenMP
MPI_INIT
TM/SE
TM/SE
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
Thread0
Thread1
Funct2
MAIN
Funct1
Funct1
MAIN
Exit
Thread2
Thread3
1-3
1-3
1-3
1-3
1-3
1-3
1-3
  • Pthreads born with MAIN
  • Only Thread0 calls functions to nest parallelism
  • Pthreads based MAIN calls OpenMP based Funct1
  • OpenMP Funct1 calls TM/SE based Funct2
  • Funct2 returns to OpenMP based Funct1
  • Funct1 returns to Pthreads based MAIN
  • MPI Tasks on a node are processes (one shown)
    with multiple OS threads (Thread0-3 shown)
  • Thread0 is Main thread Thread1-3 are helper
    threads that morph from Pthread to OpenMP worker
    to TM/SE compiler generated threads via runtime
    support
  • Hardware support to significantly reduce
    overheads for thread repurposing and OpenMP loops
    and locks

20
Sequoia Distributed Software Stack Targets
Familiar Environment for Easy Applications Port
Code Development Tools
C/C/Fortran Compilers, Python
User Space
Kernel Space
Function Shipped syscalls
APPLICATION
Parallel Math Libs
OpenMP, Threads, SE/TM
Optimized Math Libs
SOCKETS
Clib/F03 runtime
LWK, Linux
Lustre Client
UDP
TCP
LNet
MPI2
IP
ADI
Interconnect Interface
External Network
21
Consistent Software Development Tools for
Livermore Model from Desktop and Linux Clusters
to Sequoia
Gnu build tools
Math Libs
Static Analysis Tools
Compilers C/C/Fortran, Python
Runtime Tools
IDEs (Eclipse), GUIs
Emulators (for unique HW features)
Code Steering
Open Source Seamless Environment
Desktop Clusters Petascale
Vendor, ISV components are negotiable
22
Sequoia Platform Target Performance is a
Combination of Peak and Application Sustained
Performance
  • Peak of the machine is absolute maximum
    performance
  • FLOP/s FLoating point OPeration per second
  • Sustained is weighted average of five marquee
    benchmark code Figure of Merit
  • Four IDC package benchmarks and one science
    workload benchmark from SNL
  • FOM chosen to mimic grind times and factor out
    scaling issues

BlueGene/L 0.4 TF/s
Purple 0.1PF/s
23
Sequoia Benchmarks have already incentivized the
industry to work on problems relevant to our
mission needs
  • Whats missing?
  • Hydrodynamics
  • Structural mechanics
  • Quantum MD

23
24
Validation and Benchmark Efforts
  • Platforms
  • Purple (IBM Power5, AIX)
  • BGL (IBM PPC440, LWK)
  • BGP (IBM PPC450, LWK, SMP)
  • ATLAS ( AMD Opteron, TOSS)
  • Red Storm ( AMD Opteron, Catamount)
  • Franklin (AMD Opteron, CNL )
  • Phoenix (Vector, UNICOS)

25
The strategy for aggregating performance
incentivizes vendors in two ways.
1 Peak (petaFLOP/s) 2 MPI / Node lt Memory
per Node / 2 GB
awFOM wFOMAMG wFOMIRS wFOMSPhot
wFOMUMT wFOMLAMMPS
26
AMG Results
27
AMG message size distribution
An improved messaging rate would significantly
impact AMG communication performance.
28
UMT and Sphot results
29
Observations of messaging rate for UMT indicate
we need to have messaging rate as an interconnect
requirement
130
23
Messaging is very bursty, and most messaging
occurs at a high messaging rate.
30
IRS- Implicit Radiation Solver results
31
IRS Load Imbalance has two components compute
and communications
IMBALANCE (MAX / AVG)
PE Model Power5 BG/L Red Storm 512 1.1429 1.521
1.061 1,000 1.1111 1.487 1.092 1.064 2,197 1.0833
1.428 1.080 1.052 4,096 1.0667 1.352 1.067 1.030 8
,000 1.0526 1.052
32
Summary
  • Sequoia is a carefully choreographed risk
    mitigation strategy to develop and deliver a huge
    leap forward in computing power to the National
    Stockpile Stewardship Program
  • Sequoia will work for weapons science and
    integrated design codes when delivered because of
    our evolutionary approach to yield a
    revolutionary advance on multiple fronts
  • The ground work on system requirements,
    benchmarks, and SOW are in place for launch of a
    successful procurement competition for Sequoia
Write a Comment
User Comments (0)
About PowerShow.com