Title: Science on Supercomputers:
1Science on Supercomputers
- Pushing the (back of) the envelope
Jeffrey P. Gardner
Pittsburgh Supercomputing Center Carnegie Mellon
University University of Pittsburgh
2Outline
- History (the past)
- Characteristics of scientific codes
- Scientific computing, supercomputers, and the
Good Old Days - Reality (the present)
- Is there anything super about computers
anymore? - Why network means more net work on your part.
- Fantasy (the future)
- Strategies for turning a huge pile of processors
into something scientists can actually use.
3A (very brief) Introduction of Scientific
Computing
4Properties of interesting scientific datasets
- Very large dataset where
- Calculation is tightly-coupled
5Example Science ApplicationCosmology
- Cosmological N-Body
- simulation
- 100,000,000 particles
- 1 TB of RAM
To resolve the gravitational force on any single
particle requires the entire dataset
read-only coupling
100 million light years
6Example Science ApplicationCosmology
- Cosmological N-Body
- simulation
- 100,000,000 particles
- 1 TB of RAM
To resolve the hydrodynamic forces requires
information exchange between particles
read-write coupling
100 million light years
7Scientific Computing
- Transaction Processing1
- A transaction is an information processing
operation that cannot be subdivided into smaller
operations. Each transaction must succeed or fail
as a complete unit it cannot remain in an
intermediate state.2 - Functional definition
- A transaction is any computational task
- That cannot be easily subdivided because the
overhead in doing so would exceed the time
required for the non-divided form to complete. - Where any further subdivisions cannot be written
in such a way that they are independent of one
another.
1term borrowed (and generalized with apologies)
from database management
2From Wikipedia
8Scientific Computing
- Functional definition
- A transaction is any computational task
- That cannot be easily subdivided because the
overhead in doing so would exceed the time
required for the non-divided to complete.
- Cosmological N-Body
- simulation
- 100,000,000 particles
- 1 TB of RAM
To resolve the gravitational force on any single
particle requires the entire dataset
read-only coupling
9Scientific Computing
- Functional definition
- A transaction is any computational task
- Where any further subdivisions cannot be written
in such a way that they are independent of one
another.
- Cosmological N-Body
- simulation
- 100,000,000 particles
- 1 TB of RAM
To resolve the hydrodynamic forces requires
information exchange between particle s
read-write coupling
10Scientific Computing
- In most business and
- web applications
- A single CPU usually processes many transactions
per second - Transaction sizes are typically small
11Scientific Computing
- In many science
- applications
- A single transaction can take CPU hours, days, or
years - Transaction sizes can be extremely large
12What Made Computers Super?
- Since the transaction is memory-resident in order
to not be I/O bound, the next bottleneck is
memory. - The original Supercomputers differed from
ordinary computers in their memory bandwidth
and latency characteristics.
13The Golden Age of Supercomputing
- 1976-1982 The Cray-1 is the most powerful
computer in the world - The Cray-1 is a vector platform
- i.e. it performs the same operation on many
contiguous memory elements in one clock tick. - Memory subsystem was optimized to feed data to
the processor at its maximum flop rate.
14The Golden Age of Supercomputing
- 1985-1989 The Cray-2 is the most powerful
computer in the world - The Cray-2 is also a vector platform
15Scientists Liked Supercomputers.They were simple
to program!
- They were serial machines
- Caches? We dont need no stinkin caches!
- Scalar machines had no memory latency
- This is as close as you get to an ideal computer
- Vector machines offered substantial performance
increases over scalar machines if you could
vectorize your code.
16Triumph of the Masses
- In the 1990s, commercial off-the-shelf (COTS)
technology became so cheap, it was no longer
cost-effective to produce fully-custom hardware
17Triumph of the Masses
- Instead of producing faster processors with
faster memory, supercomputer companies built
machines with lots of processors in them.
A single processor Cray-2
A 1024-processor Cray (CRI) T3D
18Triumph of the Masses
- These were known as massively parallel platforms,
or MPPs.
A single processor Cray-2
A 1024-processor Cray T3D
19Triumph of the Masses(?)
A single processor Cray-2, The worlds fastest
computer in 1989
A 1024-processor Cray T3D, The worlds fastest
computer in 1994 (almost)
20Part II The Present
- Why network means more net work on your part
21The Social Impact of MPPs
- The transition from serial supercomputers to MPPs
actually resulted in far fewer scientists using
supercomputers. - MPPs are really hard to program!
- Developing scientific applications for MPPs
became an area of study in its own right - High Performance Computing (HPC)
22Characteristics of HPC Codes
- Large dataset
- Data must be distributed across many compute nodes
The MPP memory hierarchy
The CPU memory hierarchy
Main memory
100 cycles
L2 cache
10 cycles
L1 cache
2 cycles
Processor Registers
An N-Body cosomology simulation
23What makes computers super anymore?
PSC Terascale Compute System (TCS) in
2000 Custom interconnect fabric by Quadrics
Cray T3D in 1994 Cray-built interconnect fabric
PSC Cray XT3 in 2006 Cray-built interconnect
fabric
24What makes computers super anymore?
- I would propose the following definition
- A supercomputer differs from a pile of
workstations in that - a supercomputer is optimized to spread a single
large transaction across many many processors. - In practice, this means that the network
interconnect fabric is identified as the
principle bottleneck.
25What makes computers super anymore?
Googles 30-acre campus in The Dalles, Oregon
26Review Hallmarks of Computing
FORTRAN heralded as the worlds first
high-level language
1956
Seymour Cray develops the CDC 6600, the first
supercomputer
1966
Seymour Cray founds Cray Research Inc (CRI)
1972
Cray-1 marks the beginning of the Golden Age of
supercomputing
1976
1986
Pittsburgh Supercomputer Center is founded
Cray-2 marks the end of the Golden Age of
supercomputing
1989
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1990s
1998
Google Inc. is founded
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
27Review HPC
- High-Performance Computing (HPC) refers to a type
of computation whereby a single, large
transaction is spread across 100s to 1000s of
processors. - In general, this kind of computation is sensitive
to network bandwidth and latency. - Therefore, most modern-day supercomputers seek
to maximize interconnect bandwidth and minimize
interconnect latency within economic limits.
28- Naïve algorithm is Order N2
- Gasoline N-Body Treecode (Order N log N)
- Began development in 1994and continues to this
day
PE
kd-tree (subset of Binary Space Partitioning tree)
29Example HPC ApplicationCosmological N-Body
Simulation
30Cosmological N-Body Simulation
PROBLEM
- Everything in the Universe attracts everything
else - Dataset is far too large to replicate in every
PEs memory - Difficult to parallelize
31Cosmological N-Body Simulation
PROBLEM
- Everything in the Universe attracts everything
else - Dataset is far too large to replicate in every
PEs memory - Difficult to parallelize
- Only 1 in 3000 memory fetches can result in an
off-processor message being sent!
32Characteristics of HPC Codes
- Large dataset
- Data must be distributed across many compute nodes
The MPP memory hierarchy
Main memory
100 cycles
L2 cache
10 cycles
L1 cache
2 cycles
Processor Registers
An N-Body cosomology simulation
33Features
- Advanced interprocessor data caching
- Application data is organized into cache-lines
- Read cache
- Requests for off-PE data result in fetching of
cache line - Cache line is stored locally and used for future
requests - Write cache
- Updates to off-PE data are processed locally,
then flushed to remote thread when necessary - lt 1 in 100,000 off-PE requests actually result in
communication.
34Features
- Load Balancing
- The amount of work each particle required for
step t is tracked. - This information is used to distribute work
evenly amongst processors for step t1
35Performance
85 linearity on 512 PEs with pure MPI (Cray XT3)
92 linearity on 512 PEs with one-sided comms
(Cray T3E Shmem)
92 linearity on 2048 PEs on Cray XT3 for optimal
problem size (gt100,000 particles per processor)
36Features
- Portability
- Interprocessor communication by high-level
requests to Machine-Dependent Layer (MDL) - Only 800 lines of code per architecture
- MDL is rewritten to take advantage of each
parallel architecture (e.g. one-sided
communication). - MPI-1, POSIX Threads, SHMEM, Quadrics, more
Communication
37Applications
Galaxy Formation (10 million particles)
38Applications
Solar System Planet Formation (1 million
particles)
39Applications
Asteroid Collisions (2000 particles)
40Applications
Piles of Sand (?!) (1000 particles)
41Summary
-
- N-Body simulation are difficult to parallelize
- Gravity says everything interacts with
everything else - GASOLINE achieves high scalability by using
several beneficial concepts - Interprocessor data caching for both reads and
writes - Maximal exploitation of any parallel architecture
- Load balancing on a per-particle basis
- GASOLINE proved useful for a wide range of
applications that simulate particle interactions - Flexible client-server architecture aids in
porting to new science domains
42Part III The Future
- Turning a huge pile of processors into something
that scientists can actually use.
43How to turn simulation output into scientific
knowledge
Using 300 processors (circa 1996)
Step 1 Run simulation
44How to turn simulation output into scientific
knowledge
Using 1000 processors (circa 2000)
Step 2 Analyze simulation on server
Step 1 Run simulation
45How to turn simulation output into scientific
knowledge
Using 2000 processors (circa 2005)
(unhappy scientist)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
46How to turn simulation output into scientific
knowledge
Using 100,000 processors? (circa 2012)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
The NSF has announced that it will be providing
200 million to build and operate a Petaflop
machine by 2012.
47Turning TeraFlops into Scientific Understanding
- Problem
- The size of simulations is no longer limited by
the scalability of the simulation code, but by
the scientists inability to process the resultant
data.
48Turning TeraFlops into Scientific Understanding
- As MPPs increase in processor count, analysis
tools must also run on MPPs! - PROBLEM
- Scientists usually write their own analysis
programs - Parallel program are hard to write!
- HPC world is dominated by simulations
- Code is often reused for many years by many
people - Therefore, you can afford to spend lots of time
writing the code. - Example Gasoline required 10 FTE-years of
development!
49Turning TeraFlops into Scientific Understanding
- Data analysis implies
- Rapidly changing scientific inqueries
- Much less code reuse
- Data analysis requires rapid algorithm
development! - We need to rethink how we as scientists interact
with our data!
50A Solution(?) N tropy
- Scientists tend to write their own code
- So give them something that makes that easier for
them. - Build a framework that is
- Sophisticated enough to take care of all of the
parallel bits for you - Flexible enough to be used for a large variety of
data analysis applications
51N tropy A framework for multiprocessor
development
- GOAL Minimize development time for parallel
applications. - GOAL Enable scientists with no parallel
programming background (or time to learn) to
still implement their algorithms in parallel by
writing only serial code. - GOAL Provide seamless scalability from single
processor machines to MPPspotentially even
several MPPs in a computational Grid. - GOAL Do not restrict inquiry space.
52Methodology
- Limited Data Structures
- Astronomy deals with point-like data in an
N-dimension parameter space - Most efficient methods on these kind of data use
trees. - Limited Methods
- Analysis methods perform a limited number of
fundamental operations on these data structures.
53N tropy Design
- GASOLINE already provides a number of advanced
services - GASOLINE benefits to keep
- Flexible client-server scheduling architecture
- Threads respond to service requests issued by
master. - To do a new task, simply add a new service.
- Portability
- Interprocessor communication occurs by high-level
requests to Machine-Dependent Layer (MDL) which
is rewritten to take advantage of each parallel
architecture. - Advanced interprocessor data caching
- lt 1 in 100,000 off-PE requests actually result in
communication.
54N tropy Design
- Dynamic load balancing (available now)
- Workload and processor domain boundaries can be
dynamically reallocated as computation
progresses. - Data pre-fetching (To be implemented)
- Predict request off-PE data that will be needed
for upcoming tree nodes.
55N tropy Design
- Computing across grid nodes
- Much more difficult than between nodes on a
tightly-coupled parallel machine - Network latencies between grid resources 1000
times higher than nodes on a single parallel
machine. - Nodes on a far grid resources must be treated
differently than the processor next door - Data mirroring or aggressive prefetching.
- Sophisticated workload management, synchronization
56N tropy Features
- By using N tropy you will get a lot of features
for free - Tree objects and methods
- Highly optimized and flexible
- Automatic parallelization and scalability
- You only write serial bits of code!
- Portability
- Interprocessor communication occurs by high-level
requests to Machine-Dependent Layer (MDL) which
is rewritten to take advantage of each parallel
architecture. - MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS),
SGI Altix
57N tropy Features
- By using N tropy you will get a lot of features
for free - Collectives
- AllToAll, AllGather, AllReduce, etc.
- Automatic reduction variables
- All of your routines can return scalars to be
reduced across all processors - Timers
- 4 automatic N tropy timers
- 10 custom timers
- Automatic communication and I/O statistics
- Quickly identify bottlenecks
58Serial Performance
- N tropy vs. an existing serial n-point
correlation function calculator - N tropy is 6 to 30 times faster in serial!
- Conclusions
- Not only does it takes much less time to write an
application using N tropy, - You application may run faster than if you wrote
it from scratch!
59Performance
10 million particles Spatial 3-Point 3-gt4 Mpc
This problem is substantially harder than gravity!
3 FTE months of development time!
60N tropy Meaningful Benchmarks
- The purpose of this framework is to minimize
development time! - Development time for
- N-point correlation function calculator
- 3 months
- Friends-of-Friends group finder
- 3 weeks
- N-body gravity code
- 1 day!
(OK, I cheated a bit and used existing serial
N-body code fragments)
61N tropy Conceptual Schematic
Web Service Layer (at least from Python)
WSDL? SOAP?
Key Framework Components Tree Services User
Supplied
VO
Computational Steering Layer C, C, Python
(Fortran?)
Framework (Black Box)
Dynamic Workload Management
Domain Decomposition/ Tree Building
User tree and particle data
Tree Traversal
Collectives
Parallel I/O
User serial I/O routines
User tree traversal routines
User serial collective staging and processing
routines
62Summary
Prehistoric times
FORTRAN is heralded as the first high-level
language.
Scientists run on serial supercomputers.
Scientists write many programs for them.
Scientists are happy.
Ancient times
MPPs are born. Scientists scratch their heads
and figure out how to parallelize their
algorithms.
early 1990s
mid 1990s
Scientists start writing scalable code for MPPs.
After much effort, scientists are kind of happy
again.
Scientists no longer run on their simulations on
the biggest MPPs because they cannot analyze the
output. Scientists are seriously bummed.
early 2000s
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
63Summary
- N tropy is an attempt to allow scientists to
rapidly develop their analysis codes for a
multiprocessor environment. - Our results so far show that it is worthwhile to
invest time developing a individual frameworks
that are - Serially optimized
- Scalable
- Flexible enough to be customized to many
different applications, even applications that
you do not currently envision. - Is this a solution for the 100,000 processor
world of tomorrow??
64Pittsburgh Supercomputing Center
- Founded in 1986
- Joint venture between Carnegie Mellon University,
University of Pittsburgh, and Westinghouse
Electric Co. - Funded by several federal agencies as well as
private industries. - Main source of support is National Science
Foundation, Office of Cyberinfrastructure
65Pittsburgh Supercomputing Center
- PSC is the third largest NSF sponsored
supercomputing center - BUT we provide over 60 of the computer time used
by the NSF research - AND PSC is the only academic super- computing
center in the U.S. to have had the most powerful
supercomputer in the world (for unclassified
research)
66Pittsburgh Supercomputing Center
- GOAL To use cutting edge computer technology to
do science that would not otherwise be possible
67Conclusions
- Most data analysis in astronomy is done using
trees as the fundamental data structure. - Most operations on these tree structures are
functionally identical. - Based on our studies so far, it appears feasible
to construct a general purpose multiprocessor
framework that users can rapidly customize to
their needs.
68Cosmological N-Body Simulation
Timings
- Time required for 1 floating point operation
- 0.25 ns
- Time required for 1 memory fetch
- 10ns (40 floats)
- Time required for 1 off-processor fetch
- 10ms (40,000 floats)
- Lesson Only 1 in 1000 memory fetches can result
in network activity!
69The very first Super Computer
- 1929 New York World newspaper coins the term
super computer when talking about a giant
tabulator custom-built by IBM for Columbia
University
70Review Hallmarks of Computing
FORTRAN heralded as the worlds first
high-level language
1956
Seymour Cray develops the CDC 6600, the first
supercomputer
1966
Seymour Cray founds Cray Research Inc (CRI)
1972
Cray-1 marks the beginning of the Golden Age of
supercomputing
1976
1986
Pittsburgh Supercomputer Center is founded
Cray-2 marks the end of the Golden Age of
supercomputing
1989
Seymour Cray leaves CRI and founds Cray Computer
Corp. (CCC)
1989
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1990s
1995
Cray Computer Corporation (CCC) goes bankrupt
1996
Cray Research Inc. acquired by SGI
1998
Google Inc. is founded
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
71The T3D MPP
- 1024 Dec Alpha processors (COTS)
- 128MB of RAM per processor (COTS)
- Cray Custom-built network fabric ()
A 1024-processor Cray T3D in 1994
72General characteristics of MPPs
- COTS processors
- COTS memory subsystem
- Linux-based kernel
- Custom networking
- Custom networking in MPPs has replaced the custom
memory systems of vector machines
The 2068 processor Cray XT3 at PSC in 2006
Why??
73Example Science ApplicationsWeather Prediction
Looking for Tornados (credits PSC, Center for
Analysis and Prediction of Storms)
74Reasons for being sensitive to communication
latency
- A given processor (PE) may touch a very large
subsample of the total dataset - Example self-gravitating system
- PEs must exchange information many times during a
single transaction - Example along domain boundaries of a fluid
calculation
75Features
-
- Flexible client-server scheduling architecture
- Threads respond to service requests issued by
master. - To do a new task, simply add a new service.
- Computational steering involves trivial serial
programming
76Design
Serial layers
Parallel layers
Gasoline Functional Layout
Computational Steering Layer
Executes on master processor only
Coordinates execution and data distribution among
processors
Parallel Management Layer
Serial Layer
Executes independently on all processors
Gravity Calculator
Hydro Calculator
Machine Dependent Layer (MDL)
Interprocessor communication
77Cosmological N-Body Simulation
SCIENCE
- Simulate how structure in the Universe forms from
initial linear density fluctuations - Linear fluctuations in early Universe supplied by
cosmological theory. - Calculate non-linear final states of these
fluctuations. - See if these look anything like the real
Universe. - No? Go to step 1