Sequoia RFP and Benchmarking Status - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Sequoia RFP and Benchmarking Status

Description:

Lite-Weight Kernel ~1015. Figure of merit is FTQ sample Kurtosis ... AMG Results. AMG message size distribution ... impact AMG communication performance. ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 23

Provided by: spsci

Learn more at: http://www.spscicomp.org

Category:

more less

Transcript and Presenter's Notes

Title: Sequoia RFP and Benchmarking Status

1
Sequoia RFP and BenchmarkingStatus
UNCLASSIFIED

Scott Futral
Mark K. Seager
Tom Spelce
Lawrence Livermore National Laboratory
2008 SciComp Summer Meeting

2
Overview

Sequoia Objectives
25-50x BlueGene/L (367TF/s) on Science Codes
12-24x Purple on Integrated Design Codes
Sequoia Procurement Strategy
Sequoia is actually a cluster of procurements
Risk management pervades everything
Sequoia Target Architecture
Driven by programmatic requirements and technical
realities
Requires innovation on several fronts

Sequoia will deliver petascale computing for the
mission and pushes the envelope by 10-100x in
every dimension!
3
By leveraging industry trends, Sequoia will
successfully deliver a petascale UQ engine for
the stockpile

Sequoia Production Platform Programmatic Drivers
UQ Engine for mission deliverables in the
2011-2015 timeframe
Programmatic drivers require unprecedented leap
forward in computing power
Program needs both Capability and Capacity
25-50x BGL (367TF/s) for science codes (knob
removal)
12-24x Purple for capability runs on Purple
(8,192 MPI tasks UQ Engine)
These requirements met with current industry
trends drive us to a different target
architecture than Purple or BGL

4
Predicting stockpile performance drives five
separate classes of petascale calculations

Quantifying uncertainty (for all classes of
simulation)
Identify and model missing physics
Improving accuracy in material property data
Improving models for known physical processes
Improving the performance of complex models and
algorithms in macro-scale simulation codes

Each of these mission drivers require petascale
computing
5
Sequoia Strategy

Two major deliverables
Petascale Scaling Dawn Platform in 2009
Petascale Sequoia Platform in 2011
Lessons learned from previous capability and
capacity procurements
Leverage best-of-breed for platform, file system,
SAN and storage
Major Sequoia procurement is for long term
platform partnership
Three RD partnerships to incentivize bidders to
stretch goals
Risk reduction built into overall strategy from
day-one
Drive procurement with single peak mandatory
Target PeakSustained on marquee benchmarks
Timescale, budget, technical details as target
requirements
Include TCO factors such as power

6
To Minimize Risk, Dawn Deployment Extends the
Existing Purple and BG/L Integrated Simulation
Environment

ASC Dawn is the initial delivery system for
Sequoia
Code development platform and scaling for Sequoia
0.5 petaFLOP/s peak for ASC production usage
Target production 2009-2014
Dawn Component Scaling
Memory BF 0.3
Mem BW BF 1.0
Link BW BF 2.0
Min Bisect BF 0.001
SAN GB/sPF/s 384
F is peak FLOP/s

7
Sequoia Target Architecture in Integrated
Simulation Environment Enables a Diverse
Production Workload

Diverse usage models drive platform and
simulation environment requirements
Will be 2D ultra-res and 3D high-res
Quantification of Uncertainty engine
3D Science capability for known unknowns and
unknown unknowns
Peak of 14 petaFLOP/s with option for 20
petaFLOP/s
Target production 2011-2016
Sequoia Component Scaling
Memory BF 0.08
Mem BW BF 0.2
Link BW BF 0.1
Min Bisect BF 0.03
SAN BW GB/PF/s 25.6
F is peak FLOP/s

8
Sequoia Targets A Highly Scalable Operating System

Light weight kernel on compute node
Optimized for scalability and reliability
As simple as possible. Full control
Extremely low OS noise
Direct access to interconnect hardware
OS features
Linux compatible with OS functions forwarded to
I/O node OS
Support for dynamic libs runtime loading
Shared memory regions
Open source

1-N CN
Compute Nodes

Linux on I/O Node
Leverage huge Linux base community
Enhance TCP offload, PCIe, I/O
Standard File Systems Lustre, NFSv4, etc
Factor to Simplify
Aggregates N CN for I/O admin
Open source

FSD
SLURMD
Perf tools
totalview
Linux/Unix
Function Shipped syscalls
Lustre Client
NFSv4
UDP TCP/IP
LNet
Sequoia ION and Interconnect
I/O Node
9
A light-weight kernel diminutive noise
environment is required to scale MPI based
applications to petascale
Mac OS X 103 Desktop Linux 105
Bad
TLCC/TOSS 1010
Better
Lite-Weight Kernel 1015 Figure of merit is FTQ
sample Kurtosis
Best
10
The Livermore Model developed in 1990 made
sense of the killer-micro revolution

Livermore Model system perspective for
distributed memory machines
Interactive and batch usage
Debugging and visualization
Dynamically support a mix of job sizes
Capability runs at a significant portion of
platform
Capacity runs at a small fraction of platform
Dynamically support a range of runtimes
Short run-time for setup
Many months for science runs
Many users require predictable progress on jobs
Hard allocations for projects
Implemented with Moab and SLURM

11
The Livermore Model also leveraged desktop to
teraFLOP/s development by providing consistent
programming model across multiple types and
generations of platforms
MPI Comms
OpenMP
OpenMP
OpenMP
Local
Local
Local
Idea Provide a consistent programming model for
multiple platform generations and across multiple
vendors! Idea Incrementally increase
functionality over time! How to extend to
petascale?
12
Digression Scalability Lessons Learned

With this project you only have three things to
worry about scalability scalability and
scalability. Candy Culhane at first BG/L
external review
For hardware scalability it is all about MTBF
Keep the highly replicated parts as simple as
possible to get the job done, but not simpler
For software scalability its all about
reliability
Factor and solve allows one to take an impossible
problem and factor it into to problems
One of which is solved
One of which is merely difficult
For applications scalability it is all about MPI
scalability
Low OS noise
Scalable collectives
Messaging rate

13
Each year we get faster more processors
MooresLaw

Historically Boost single-stream performance via
more complex chips, first via one big feature,
then via lots of smaller features.
Now Deliver more cores per chip.
The free lunch is over for todays sequential
apps and many concurrent apps (expect some
regressions). We need killer apps with lots of
latent parallelism.
A generational advance gtOO is necessary to get
above the threadslocks programming model.

Montecito
Intel CPU Trends (sources Intel, Wikipedia, K.
Olukotun)
Pentium
386
13
From Herb Sutter lthsutter_at_microsoft.comgt
14
How many cores are you coding for?
How Do We Get to Here?
You Are Here!
Microprocessor parallelism will increase
exponentially in the next decade
15
How much parallelism will be required to sustain
petaFLOP/s in 2011?

Hypothetical low power machines will feature 1.6M
to 6.6M way parallelism
32-64 cores per processor and up to 2-4 threads
per core
Assume 1 socket nodes and 25.6K nodes
Hypothetical Intel terascale chip petascale
system yields 1.5M way parallelism
80 cores per processor
Assume 4 socket nodes and 4,608 nodes (32 SU of
144 nodes with IBA)
Holey cow, this is about 12-48x BlueGene/L!

16
Multicore processors have non-intuitive impact on
other machine characteristics

Memory is the most critical machine
characteristic
ASC applications require gt1GiB/MPI task
If we map MPI tasks directly to cores
64GiB/node on Low Power ? 1.6PiB of memory and
that is 4x too expensive, if we could build and
power it
This drives us to think in terms of fewer MPI
tasks/node
320GiB/node on Intel ? 1.5PiB of memory is also a
problem

From Seager Dec 1997 platforms talk
17
Multicore processors drive huge network
requirements

With multiple MPI tasks/node, short message
messaging rate becomes as important as bandwidth
and latency
ASC applications require gt2 mMsgs/s/MPI task
If we map MPI tasks directly to cores, then we
require
Low power system requires 64 MPI tasks/node ? 128
mM/s.
12.5 clocks/message
This may be achievable, but is very high risk.
Intel requires 80 MPI tasks/socket and 4 socket
nodes ? gt640 mM/s
5 clocks/message
This is not achievable
This drives us to think in terms of fewer MPI
tasks/node

18
Sequoia Target Application Programming Model
Leverages Factor and Simplify to Scale
Applications to O(1M) Parallelism

MPI Parallelism at top level
Static allocation of MPI tasks to nodes and sets
of coresthreads
Allow for MPI everywhere, just in case
Effectively absorb multiple coresthreads in MPI
task
Support multiple languages
C/C/Fortran03/Python
Allow different physics packages to express node
concurrency in different ways

19
With Careful Use of Node Concurrency We can
Support A Wide Variety of Complex Applications
MPI_FINALIZE
OpenMP
OpenMP
OpenMP
OpenMP
MPI_INIT
TM/SE
TM/SE
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
MPI Call
Thread0
Thread1
Funct2
MAIN
Funct1
Funct1
MAIN
Exit
Thread2
Thread3
1-3
1-3
1-3
1-3
1-3
1-3
1-3

Pthreads born with MAIN
Only Thread0 calls functions to nest parallelism
Pthreads based MAIN calls OpenMP based Funct1
OpenMP Funct1 calls TM/SE based Funct2
Funct2 returns to OpenMP based Funct1
Funct1 returns to Pthreads based MAIN

MPI Tasks on a node are processes (one shown)
with multiple OS threads (Thread0-3 shown)
Thread0 is Main thread Thread1-3 are helper
threads that morph from Pthread to OpenMP worker
to TM/SE compiler generated threads via runtime
support
Hardware support to significantly reduce
overheads for thread repurposing and OpenMP loops
and locks

20
Sequoia Distributed Software Stack Targets
Familiar Environment for Easy Applications Port
Code Development Tools
C/C/Fortran Compilers, Python
User Space
Kernel Space
Function Shipped syscalls
APPLICATION
Parallel Math Libs
OpenMP, Threads, SE/TM
Optimized Math Libs
SOCKETS
Clib/F03 runtime
LWK, Linux
Lustre Client
UDP
TCP
LNet
MPI2
IP
ADI
Interconnect Interface
External Network
21
Consistent Software Development Tools for
Livermore Model from Desktop and Linux Clusters
to Sequoia
Gnu build tools
Math Libs
Static Analysis Tools
Compilers C/C/Fortran, Python
Runtime Tools
IDEs (Eclipse), GUIs
Emulators (for unique HW features)
Code Steering
Open Source Seamless Environment
Desktop Clusters Petascale
Vendor, ISV components are negotiable
22
Sequoia Platform Target Performance is a
Combination of Peak and Application Sustained
Performance

Peak of the machine is absolute maximum
performance
FLOP/s FLoating point OPeration per second
Sustained is weighted average of five marquee
benchmark code Figure of Merit
Four IDC package benchmarks and one science
workload benchmark from SNL
FOM chosen to mimic grind times and factor out
scaling issues

BlueGene/L 0.4 TF/s
Purple 0.1PF/s
23
Sequoia Benchmarks have already incentivized the
industry to work on problems relevant to our
mission needs

Whats missing?
Hydrodynamics
Structural mechanics
Quantum MD

23
24
Validation and Benchmark Efforts

Platforms
Purple (IBM Power5, AIX)
BGL (IBM PPC440, LWK)
BGP (IBM PPC450, LWK, SMP)
ATLAS ( AMD Opteron, TOSS)
Red Storm ( AMD Opteron, Catamount)
Franklin (AMD Opteron, CNL )
Phoenix (Vector, UNICOS)

25
The strategy for aggregating performance
incentivizes vendors in two ways.
1 Peak (petaFLOP/s) 2 MPI / Node lt Memory
per Node / 2 GB
awFOM wFOMAMG wFOMIRS wFOMSPhot
wFOMUMT wFOMLAMMPS
26
AMG Results
27
AMG message size distribution
An improved messaging rate would significantly
impact AMG communication performance.
28
UMT and Sphot results
29
Observations of messaging rate for UMT indicate
we need to have messaging rate as an interconnect
requirement
130
23
Messaging is very bursty, and most messaging
occurs at a high messaging rate.
30
IRS- Implicit Radiation Solver results
31
IRS Load Imbalance has two components compute
and communications
IMBALANCE (MAX / AVG)
PE Model Power5 BG/L Red Storm 512 1.1429 1.521
1.061 1,000 1.1111 1.487 1.092 1.064 2,197 1.0833
1.428 1.080 1.052 4,096 1.0667 1.352 1.067 1.030 8
,000 1.0526 1.052
32
Summary

Sequoia is a carefully choreographed risk
mitigation strategy to develop and deliver a huge
leap forward in computing power to the National
Stockpile Stewardship Program
Sequoia will work for weapons science and
integrated design codes when delivered because of
our evolutionary approach to yield a
revolutionary advance on multiple fronts
The ground work on system requirements,
benchmarks, and SOW are in place for launch of a
successful procurement competition for Sequoia