Title: Sequoia RFP and Benchmarking Status
1Sequoia RFP and BenchmarkingStatus
UNCLASSIFIED
- Scott Futral
- Mark K. Seager
- Tom Spelce
- Lawrence Livermore National Laboratory
- 2008 SciComp Summer Meeting
2Overview
- Sequoia Objectives
- 25-50x BlueGene/L (367TF/s) on Science Codes
- 12-24x Purple on Integrated Design Codes
- Sequoia Procurement Strategy
- Sequoia is actually a cluster of procurements
- Risk management pervades everything
- Sequoia Target Architecture
- Driven by programmatic requirements and technical
realities - Requires innovation on several fronts
Sequoia will deliver petascale computing for the
mission and pushes the envelope by 10-100x in
every dimension!
3By leveraging industry trends, Sequoia will
successfully deliver a petascale UQ engine for
the stockpile
- Sequoia Production Platform Programmatic Drivers
- UQ Engine for mission deliverables in the
2011-2015 timeframe - Programmatic drivers require unprecedented leap
forward in computing power - Program needs both Capability and Capacity
- 25-50x BGL (367TF/s) for science codes (knob
removal) - 12-24x Purple for capability runs on Purple
(8,192 MPI tasks UQ Engine) - These requirements met with current industry
trends drive us to a different target
architecture than Purple or BGL
4Predicting stockpile performance drives five
separate classes of petascale calculations
- Quantifying uncertainty (for all classes of
simulation) - Identify and model missing physics
- Improving accuracy in material property data
- Improving models for known physical processes
- Improving the performance of complex models and
algorithms in macro-scale simulation codes
Each of these mission drivers require petascale
computing
5Sequoia Strategy
- Two major deliverables
- Petascale Scaling Dawn Platform in 2009
- Petascale Sequoia Platform in 2011
- Lessons learned from previous capability and
capacity procurements - Leverage best-of-breed for platform, file system,
SAN and storage - Major Sequoia procurement is for long term
platform partnership - Three RD partnerships to incentivize bidders to
stretch goals - Risk reduction built into overall strategy from
day-one - Drive procurement with single peak mandatory
- Target PeakSustained on marquee benchmarks
- Timescale, budget, technical details as target
requirements - Include TCO factors such as power
6To Minimize Risk, Dawn Deployment Extends the
Existing Purple and BG/L Integrated Simulation
Environment
- ASC Dawn is the initial delivery system for
Sequoia - Code development platform and scaling for Sequoia
- 0.5 petaFLOP/s peak for ASC production usage
- Target production 2009-2014
- Dawn Component Scaling
- Memory BF 0.3
- Mem BW BF 1.0
- Link BW BF 2.0
- Min Bisect BF 0.001
- SAN GB/sPF/s 384
- F is peak FLOP/s
7Sequoia Target Architecture in Integrated
Simulation Environment Enables a Diverse
Production Workload
- Diverse usage models drive platform and
simulation environment requirements - Will be 2D ultra-res and 3D high-res
Quantification of Uncertainty engine - 3D Science capability for known unknowns and
unknown unknowns - Peak of 14 petaFLOP/s with option for 20
petaFLOP/s - Target production 2011-2016
- Sequoia Component Scaling
- Memory BF 0.08
- Mem BW BF 0.2
- Link BW BF 0.1
- Min Bisect BF 0.03
- SAN BW GB/PF/s 25.6
- F is peak FLOP/s
8Sequoia Targets A Highly Scalable Operating System
- Light weight kernel on compute node
- Optimized for scalability and reliability
- As simple as possible. Full control
- Extremely low OS noise
- Direct access to interconnect hardware
- OS features
- Linux compatible with OS functions forwarded to
I/O node OS - Support for dynamic libs runtime loading
- Shared memory regions
- Open source
1-N CN
Compute Nodes
- Linux on I/O Node
- Leverage huge Linux base community
- Enhance TCP offload, PCIe, I/O
- Standard File Systems Lustre, NFSv4, etc
- Factor to Simplify
- Aggregates N CN for I/O admin
- Open source
FSD
SLURMD
Perf tools
totalview
Linux/Unix
Function Shipped syscalls
Lustre Client
NFSv4
UDP TCP/IP
LNet
Sequoia ION and Interconnect
I/O Node
9A light-weight kernel diminutive noise
environment is required to scale MPI based
applications to petascale
Mac OS X 103 Desktop Linux 105
Bad
TLCC/TOSS 1010
Better
Lite-Weight Kernel 1015 Figure of merit is FTQ
sample Kurtosis
Best
10The Livermore Model developed in 1990 made
sense of the killer-micro revolution
- Livermore Model system perspective for
distributed memory machines - Interactive and batch usage
- Debugging and visualization
- Dynamically support a mix of job sizes
- Capability runs at a significant portion of
platform - Capacity runs at a small fraction of platform
- Dynamically support a range of runtimes
- Short run-time for setup
- Many months for science runs
- Many users require predictable progress on jobs
- Hard allocations for projects
- Implemented with Moab and SLURM
11The Livermore Model also leveraged desktop to
teraFLOP/s development by providing consistent
programming model across multiple types and
generations of platforms
MPI Comms
OpenMP
OpenMP
OpenMP
Local
Local
Local
Idea Provide a consistent programming model for
multiple platform generations and across multiple
vendors! Idea Incrementally increase
functionality over time! How to extend to
petascale?
12Digression Scalability Lessons Learned
- With this project you only have three things to
worry about scalability scalability and
scalability. Candy Culhane at first BG/L
external review - For hardware scalability it is all about MTBF
- Keep the highly replicated parts as simple as
possible to get the job done, but not simpler - For software scalability its all about
reliability - Factor and solve allows one to take an impossible
problem and factor it into to problems - One of which is solved
- One of which is merely difficult
- For applications scalability it is all about MPI
scalability - Low OS noise
- Scalable collectives
- Messaging rate
13Each year we get faster more processors
MooresLaw
- Historically Boost single-stream performance via
more complex chips, first via one big feature,
then via lots of smaller features. - Now Deliver more cores per chip.
- The free lunch is over for todays sequential
apps and many concurrent apps (expect some
regressions). We need killer apps with lots of
latent parallelism. - A generational advance gtOO is necessary to get
above the threadslocks programming model.
Montecito
Intel CPU Trends (sources Intel, Wikipedia, K.
Olukotun)
Pentium
386
13
From Herb Sutter lthsutter_at_microsoft.comgt
14How many cores are you coding for?
How Do We Get to Here?
You Are Here!
Microprocessor parallelism will increase
exponentially in the next decade
15How much parallelism will be required to sustain
petaFLOP/s in 2011?
- Hypothetical low power machines will feature 1.6M
to 6.6M way parallelism - 32-64 cores per processor and up to 2-4 threads
per core - Assume 1 socket nodes and 25.6K nodes
- Hypothetical Intel terascale chip petascale
system yields 1.5M way parallelism - 80 cores per processor
- Assume 4 socket nodes and 4,608 nodes (32 SU of
144 nodes with IBA) - Holey cow, this is about 12-48x BlueGene/L!
16Multicore processors have non-intuitive impact on
other machine characteristics
- Memory is the most critical machine
characteristic - ASC applications require gt1GiB/MPI task
- If we map MPI tasks directly to cores
- 64GiB/node on Low Power ? 1.6PiB of memory and
that is 4x too expensive, if we could build and
power it - This drives us to think in terms of fewer MPI
tasks/node - 320GiB/node on Intel ? 1.5PiB of memory is also a
problem
From Seager Dec 1997 platforms talk
17Multicore processors drive huge network
requirements
- With multiple MPI tasks/node, short message
messaging rate becomes as important as bandwidth
and latency - ASC applications require gt2 mMsgs/s/MPI task
- If we map MPI tasks directly to cores, then we
require - Low power system requires 64 MPI tasks/node ? 128
mM/s. - 12.5 clocks/message
- This may be achievable, but is very high risk.
- Intel requires 80 MPI tasks/socket and 4 socket
nodes ? gt640 mM/s - 5 clocks/message
- This is not achievable
- This drives us to think in terms of fewer MPI
tasks/node
18Sequoia Target Application Programming Model
Leverages Factor and Simplify to Scale
Applications to O(1M) Parallelism
- MPI Parallelism at top level
- Static allocation of MPI tasks to nodes and sets
of coresthreads - Allow for MPI everywhere, just in case
- Effectively absorb multiple coresthreads in MPI
task - Support multiple languages
- C/C/Fortran03/Python
- Allow different physics packages to express node
concurrency in different ways
19With Careful Use of Node Concurrency We can
Support A Wide Variety of Complex Applications
MPI_FINALIZE
OpenMP
OpenMP
OpenMP
OpenMP
MPI_INIT
TM/SE
TM/SE
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
Thread0
Thread1
Funct2
MAIN
Funct1
Funct1
MAIN
Exit
Thread2
Thread3
1-3
1-3
1-3
1-3
1-3
1-3
1-3
- Pthreads born with MAIN
- Only Thread0 calls functions to nest parallelism
- Pthreads based MAIN calls OpenMP based Funct1
- OpenMP Funct1 calls TM/SE based Funct2
- Funct2 returns to OpenMP based Funct1
- Funct1 returns to Pthreads based MAIN
- MPI Tasks on a node are processes (one shown)
with multiple OS threads (Thread0-3 shown) - Thread0 is Main thread Thread1-3 are helper
threads that morph from Pthread to OpenMP worker
to TM/SE compiler generated threads via runtime
support - Hardware support to significantly reduce
overheads for thread repurposing and OpenMP loops
and locks
20Sequoia Distributed Software Stack Targets
Familiar Environment for Easy Applications Port
Code Development Tools
C/C/Fortran Compilers, Python
User Space
Kernel Space
Function Shipped syscalls
APPLICATION
Parallel Math Libs
OpenMP, Threads, SE/TM
Optimized Math Libs
SOCKETS
Clib/F03 runtime
LWK, Linux
Lustre Client
UDP
TCP
LNet
MPI2
IP
ADI
Interconnect Interface
External Network
21Consistent Software Development Tools for
Livermore Model from Desktop and Linux Clusters
to Sequoia
Gnu build tools
Math Libs
Static Analysis Tools
Compilers C/C/Fortran, Python
Runtime Tools
IDEs (Eclipse), GUIs
Emulators (for unique HW features)
Code Steering
Open Source Seamless Environment
Desktop Clusters Petascale
Vendor, ISV components are negotiable
22Sequoia Platform Target Performance is a
Combination of Peak and Application Sustained
Performance
- Peak of the machine is absolute maximum
performance - FLOP/s FLoating point OPeration per second
- Sustained is weighted average of five marquee
benchmark code Figure of Merit - Four IDC package benchmarks and one science
workload benchmark from SNL - FOM chosen to mimic grind times and factor out
scaling issues
BlueGene/L 0.4 TF/s
Purple 0.1PF/s
23Sequoia Benchmarks have already incentivized the
industry to work on problems relevant to our
mission needs
- Whats missing?
- Hydrodynamics
- Structural mechanics
- Quantum MD
23
24Validation and Benchmark Efforts
- Platforms
- Purple (IBM Power5, AIX)
- BGL (IBM PPC440, LWK)
- BGP (IBM PPC450, LWK, SMP)
- ATLAS ( AMD Opteron, TOSS)
- Red Storm ( AMD Opteron, Catamount)
- Franklin (AMD Opteron, CNL )
- Phoenix (Vector, UNICOS)
25The strategy for aggregating performance
incentivizes vendors in two ways.
1 Peak (petaFLOP/s) 2 MPI / Node lt Memory
per Node / 2 GB
awFOM wFOMAMG wFOMIRS wFOMSPhot
wFOMUMT wFOMLAMMPS
26AMG Results
27AMG message size distribution
An improved messaging rate would significantly
impact AMG communication performance.
28UMT and Sphot results
29Observations of messaging rate for UMT indicate
we need to have messaging rate as an interconnect
requirement
130
23
Messaging is very bursty, and most messaging
occurs at a high messaging rate.
30IRS- Implicit Radiation Solver results
31IRS Load Imbalance has two components compute
and communications
IMBALANCE (MAX / AVG)
PE Model Power5 BG/L Red Storm 512 1.1429 1.521
1.061 1,000 1.1111 1.487 1.092 1.064 2,197 1.0833
1.428 1.080 1.052 4,096 1.0667 1.352 1.067 1.030 8
,000 1.0526 1.052
32Summary
- Sequoia is a carefully choreographed risk
mitigation strategy to develop and deliver a huge
leap forward in computing power to the National
Stockpile Stewardship Program - Sequoia will work for weapons science and
integrated design codes when delivered because of
our evolutionary approach to yield a
revolutionary advance on multiple fronts - The ground work on system requirements,
benchmarks, and SOW are in place for launch of a
successful procurement competition for Sequoia