Science on Supercomputers:

About This Presentation

Title:

Science on Supercomputers:

Description:

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University of Pittsburgh – PowerPoint PPT presentation

Number of Views:302

Avg rating:3.0/5.0

Slides: 78

Provided by: Jeffr366

Category:

more less

Transcript and Presenter's Notes

Title: Science on Supercomputers:

1
Science on Supercomputers

Pushing the (back of) the envelope

Jeffrey P. Gardner
Pittsburgh Supercomputing Center Carnegie Mellon
University University of Pittsburgh
2
Outline

History (the past)
Characteristics of scientific codes
Scientific computing, supercomputers, and the
Good Old Days
Reality (the present)
Is there anything super about computers
anymore?
Why network means more net work on your part.
Fantasy (the future)
Strategies for turning a huge pile of processors
into something scientists can actually use.

3
A (very brief) Introduction of Scientific
Computing
4
Properties of interesting scientific datasets

Very large dataset where
Calculation is tightly-coupled

5
Example Science ApplicationCosmology

Cosmological N-Body
simulation
100,000,000 particles
1 TB of RAM

To resolve the gravitational force on any single
particle requires the entire dataset
read-only coupling
100 million light years
6
Example Science ApplicationCosmology

Cosmological N-Body
simulation
100,000,000 particles
1 TB of RAM

To resolve the hydrodynamic forces requires
information exchange between particles
read-write coupling
100 million light years
7
Scientific Computing

Transaction Processing1
A transaction is an information processing
operation that cannot be subdivided into smaller
operations. Each transaction must succeed or fail
as a complete unit it cannot remain in an
intermediate state.2
Functional definition
A transaction is any computational task
That cannot be easily subdivided because the
overhead in doing so would exceed the time
required for the non-divided form to complete.
Where any further subdivisions cannot be written
in such a way that they are independent of one
another.

1term borrowed (and generalized with apologies)
from database management
2From Wikipedia
8
Scientific Computing

Functional definition
A transaction is any computational task
That cannot be easily subdivided because the
overhead in doing so would exceed the time
required for the non-divided to complete.

Cosmological N-Body
simulation
100,000,000 particles
1 TB of RAM

To resolve the gravitational force on any single
particle requires the entire dataset
read-only coupling
9
Scientific Computing

Functional definition
A transaction is any computational task
Where any further subdivisions cannot be written
in such a way that they are independent of one
another.

Cosmological N-Body
simulation
100,000,000 particles
1 TB of RAM

To resolve the hydrodynamic forces requires
information exchange between particle s
read-write coupling
10
Scientific Computing

In most business and
web applications
A single CPU usually processes many transactions
per second
Transaction sizes are typically small

11
Scientific Computing

In many science
applications
A single transaction can take CPU hours, days, or
years
Transaction sizes can be extremely large

12
What Made Computers Super?

Since the transaction is memory-resident in order
to not be I/O bound, the next bottleneck is
memory.
The original Supercomputers differed from
ordinary computers in their memory bandwidth
and latency characteristics.

13
The Golden Age of Supercomputing

1976-1982 The Cray-1 is the most powerful
computer in the world
The Cray-1 is a vector platform
i.e. it performs the same operation on many
contiguous memory elements in one clock tick.
Memory subsystem was optimized to feed data to
the processor at its maximum flop rate.

14
The Golden Age of Supercomputing

1985-1989 The Cray-2 is the most powerful
computer in the world
The Cray-2 is also a vector platform

15
Scientists Liked Supercomputers.They were simple
to program!

They were serial machines
Caches? We dont need no stinkin caches!
Scalar machines had no memory latency
This is as close as you get to an ideal computer
Vector machines offered substantial performance
increases over scalar machines if you could
vectorize your code.

16
Triumph of the Masses

In the 1990s, commercial off-the-shelf (COTS)
technology became so cheap, it was no longer
cost-effective to produce fully-custom hardware

17
Triumph of the Masses

Instead of producing faster processors with
faster memory, supercomputer companies built
machines with lots of processors in them.

A single processor Cray-2
A 1024-processor Cray (CRI) T3D
18
Triumph of the Masses

These were known as massively parallel platforms,
or MPPs.

A single processor Cray-2
A 1024-processor Cray T3D
19
Triumph of the Masses(?)
A single processor Cray-2, The worlds fastest
computer in 1989
A 1024-processor Cray T3D, The worlds fastest
computer in 1994 (almost)
20
Part II The Present

Why network means more net work on your part

21
The Social Impact of MPPs

The transition from serial supercomputers to MPPs
actually resulted in far fewer scientists using
supercomputers.
MPPs are really hard to program!
Developing scientific applications for MPPs
became an area of study in its own right
High Performance Computing (HPC)

22
Characteristics of HPC Codes

Large dataset
Data must be distributed across many compute nodes

The MPP memory hierarchy
The CPU memory hierarchy
Main memory
100 cycles
L2 cache
10 cycles
L1 cache
2 cycles
Processor Registers
An N-Body cosomology simulation
23
What makes computers super anymore?
PSC Terascale Compute System (TCS) in
2000 Custom interconnect fabric by Quadrics
Cray T3D in 1994 Cray-built interconnect fabric
PSC Cray XT3 in 2006 Cray-built interconnect
fabric
24
What makes computers super anymore?

I would propose the following definition
A supercomputer differs from a pile of
workstations in that
a supercomputer is optimized to spread a single
large transaction across many many processors.
In practice, this means that the network
interconnect fabric is identified as the
principle bottleneck.

25
What makes computers super anymore?
Googles 30-acre campus in The Dalles, Oregon
26
Review Hallmarks of Computing
FORTRAN heralded as the worlds first
high-level language
1956
Seymour Cray develops the CDC 6600, the first
supercomputer
1966
Seymour Cray founds Cray Research Inc (CRI)
1972
Cray-1 marks the beginning of the Golden Age of
supercomputing
1976
1986
Pittsburgh Supercomputer Center is founded
Cray-2 marks the end of the Golden Age of
supercomputing
1989
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1990s
1998
Google Inc. is founded
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
27
Review HPC

High-Performance Computing (HPC) refers to a type
of computation whereby a single, large
transaction is spread across 100s to 1000s of
processors.
In general, this kind of computation is sensitive
to network bandwidth and latency.
Therefore, most modern-day supercomputers seek
to maximize interconnect bandwidth and minimize
interconnect latency within economic limits.

Naïve algorithm is Order N2
Gasoline N-Body Treecode (Order N log N)
Began development in 1994and continues to this
day

PE
kd-tree (subset of Binary Space Partitioning tree)
29
Example HPC ApplicationCosmological N-Body
Simulation
30
Cosmological N-Body Simulation
PROBLEM

Everything in the Universe attracts everything
else
Dataset is far too large to replicate in every
PEs memory
Difficult to parallelize

31
Cosmological N-Body Simulation
PROBLEM

Everything in the Universe attracts everything
else
Dataset is far too large to replicate in every
PEs memory
Difficult to parallelize

Only 1 in 3000 memory fetches can result in an
off-processor message being sent!

32
Characteristics of HPC Codes

Large dataset
Data must be distributed across many compute nodes

The MPP memory hierarchy
Main memory
100 cycles
L2 cache
10 cycles
L1 cache
2 cycles
Processor Registers
An N-Body cosomology simulation
33
Features

Advanced interprocessor data caching
Application data is organized into cache-lines
Read cache
Requests for off-PE data result in fetching of
cache line
Cache line is stored locally and used for future
requests
Write cache
Updates to off-PE data are processed locally,
then flushed to remote thread when necessary
lt 1 in 100,000 off-PE requests actually result in
communication.

34
Features

Load Balancing
The amount of work each particle required for
step t is tracked.
This information is used to distribute work
evenly amongst processors for step t1

35
Performance
85 linearity on 512 PEs with pure MPI (Cray XT3)
92 linearity on 512 PEs with one-sided comms
(Cray T3E Shmem)
92 linearity on 2048 PEs on Cray XT3 for optimal
problem size (gt100,000 particles per processor)
36
Features

Portability
Interprocessor communication by high-level
requests to Machine-Dependent Layer (MDL)
Only 800 lines of code per architecture
MDL is rewritten to take advantage of each
parallel architecture (e.g. one-sided
communication).
MPI-1, POSIX Threads, SHMEM, Quadrics, more

Communication
37
Applications
Galaxy Formation (10 million particles)
38
Applications
Solar System Planet Formation (1 million
particles)

39
Applications
Asteroid Collisions (2000 particles)

40
Applications
Piles of Sand (?!) (1000 particles)

41
Summary

N-Body simulation are difficult to parallelize
Gravity says everything interacts with
everything else
GASOLINE achieves high scalability by using
several beneficial concepts
Interprocessor data caching for both reads and
writes
Maximal exploitation of any parallel architecture
Load balancing on a per-particle basis
GASOLINE proved useful for a wide range of
applications that simulate particle interactions
Flexible client-server architecture aids in
porting to new science domains

42
Part III The Future

Turning a huge pile of processors into something
that scientists can actually use.

43
How to turn simulation output into scientific
knowledge
Using 300 processors (circa 1996)
Step 1 Run simulation
44
How to turn simulation output into scientific
knowledge
Using 1000 processors (circa 2000)
Step 2 Analyze simulation on server
Step 1 Run simulation
45
How to turn simulation output into scientific
knowledge
Using 2000 processors (circa 2005)
(unhappy scientist)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
46
How to turn simulation output into scientific
knowledge
Using 100,000 processors? (circa 2012)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
The NSF has announced that it will be providing
200 million to build and operate a Petaflop
machine by 2012.
47
Turning TeraFlops into Scientific Understanding

Problem
The size of simulations is no longer limited by
the scalability of the simulation code, but by
the scientists inability to process the resultant
data.

48
Turning TeraFlops into Scientific Understanding

As MPPs increase in processor count, analysis
tools must also run on MPPs!
PROBLEM
Scientists usually write their own analysis
programs
Parallel program are hard to write!
HPC world is dominated by simulations
Code is often reused for many years by many
people
Therefore, you can afford to spend lots of time
writing the code.
Example Gasoline required 10 FTE-years of
development!

49
Turning TeraFlops into Scientific Understanding

Data analysis implies
Rapidly changing scientific inqueries
Much less code reuse
Data analysis requires rapid algorithm
development!
We need to rethink how we as scientists interact
with our data!

50
A Solution(?) N tropy

Scientists tend to write their own code
So give them something that makes that easier for
them.
Build a framework that is
Sophisticated enough to take care of all of the
parallel bits for you
Flexible enough to be used for a large variety of
data analysis applications

51
N tropy A framework for multiprocessor
development

GOAL Minimize development time for parallel
applications.
GOAL Enable scientists with no parallel
programming background (or time to learn) to
still implement their algorithms in parallel by
writing only serial code.
GOAL Provide seamless scalability from single
processor machines to MPPspotentially even
several MPPs in a computational Grid.
GOAL Do not restrict inquiry space.

52
Methodology

Limited Data Structures
Astronomy deals with point-like data in an
N-dimension parameter space
Most efficient methods on these kind of data use
trees.
Limited Methods
Analysis methods perform a limited number of
fundamental operations on these data structures.

53
N tropy Design

GASOLINE already provides a number of advanced
services
GASOLINE benefits to keep
Flexible client-server scheduling architecture
Threads respond to service requests issued by
master.
To do a new task, simply add a new service.
Portability
Interprocessor communication occurs by high-level
requests to Machine-Dependent Layer (MDL) which
is rewritten to take advantage of each parallel
architecture.
Advanced interprocessor data caching
lt 1 in 100,000 off-PE requests actually result in
communication.

54
N tropy Design

Dynamic load balancing (available now)
Workload and processor domain boundaries can be
dynamically reallocated as computation
progresses.
Data pre-fetching (To be implemented)
Predict request off-PE data that will be needed
for upcoming tree nodes.

55
N tropy Design

Computing across grid nodes
Much more difficult than between nodes on a
tightly-coupled parallel machine
Network latencies between grid resources 1000
times higher than nodes on a single parallel
machine.
Nodes on a far grid resources must be treated
differently than the processor next door
Data mirroring or aggressive prefetching.
Sophisticated workload management, synchronization

56
N tropy Features

By using N tropy you will get a lot of features
for free
Tree objects and methods
Highly optimized and flexible
Automatic parallelization and scalability
You only write serial bits of code!
Portability
Interprocessor communication occurs by high-level
requests to Machine-Dependent Layer (MDL) which
is rewritten to take advantage of each parallel
architecture.
MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS),
SGI Altix

57
N tropy Features

By using N tropy you will get a lot of features
for free
Collectives
AllToAll, AllGather, AllReduce, etc.
Automatic reduction variables
All of your routines can return scalars to be
reduced across all processors
Timers
4 automatic N tropy timers
10 custom timers
Automatic communication and I/O statistics
Quickly identify bottlenecks

58
Serial Performance

N tropy vs. an existing serial n-point
correlation function calculator
N tropy is 6 to 30 times faster in serial!
Conclusions
Not only does it takes much less time to write an
application using N tropy,
You application may run faster than if you wrote
it from scratch!

59
Performance
10 million particles Spatial 3-Point 3-gt4 Mpc
This problem is substantially harder than gravity!
3 FTE months of development time!
60
N tropy Meaningful Benchmarks

The purpose of this framework is to minimize
development time!
Development time for
N-point correlation function calculator
3 months
Friends-of-Friends group finder
3 weeks
N-body gravity code
1 day!

(OK, I cheated a bit and used existing serial
N-body code fragments)
61
N tropy Conceptual Schematic
Web Service Layer (at least from Python)
WSDL? SOAP?
Key Framework Components Tree Services User
Supplied
VO
Computational Steering Layer C, C, Python
(Fortran?)
Framework (Black Box)
Dynamic Workload Management
Domain Decomposition/ Tree Building
User tree and particle data
Tree Traversal
Collectives
Parallel I/O
User serial I/O routines
User tree traversal routines
User serial collective staging and processing
routines
62
Summary
Prehistoric times
FORTRAN is heralded as the first high-level
language.
Scientists run on serial supercomputers.
Scientists write many programs for them.
Scientists are happy.
Ancient times
MPPs are born. Scientists scratch their heads
and figure out how to parallelize their
algorithms.
early 1990s
mid 1990s
Scientists start writing scalable code for MPPs.
After much effort, scientists are kind of happy
again.
Scientists no longer run on their simulations on
the biggest MPPs because they cannot analyze the
output. Scientists are seriously bummed.
early 2000s
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
63
Summary

N tropy is an attempt to allow scientists to
rapidly develop their analysis codes for a
multiprocessor environment.
Our results so far show that it is worthwhile to
invest time developing a individual frameworks
that are
Serially optimized
Scalable
Flexible enough to be customized to many
different applications, even applications that
you do not currently envision.
Is this a solution for the 100,000 processor
world of tomorrow??

64
Pittsburgh Supercomputing Center

Founded in 1986
Joint venture between Carnegie Mellon University,
University of Pittsburgh, and Westinghouse
Electric Co.
Funded by several federal agencies as well as
private industries.
Main source of support is National Science
Foundation, Office of Cyberinfrastructure

65
Pittsburgh Supercomputing Center

PSC is the third largest NSF sponsored
supercomputing center
BUT we provide over 60 of the computer time used
by the NSF research
AND PSC is the only academic super- computing
center in the U.S. to have had the most powerful
supercomputer in the world (for unclassified
research)

66
Pittsburgh Supercomputing Center

GOAL To use cutting edge computer technology to
do science that would not otherwise be possible

67
Conclusions

Most data analysis in astronomy is done using
trees as the fundamental data structure.
Most operations on these tree structures are
functionally identical.
Based on our studies so far, it appears feasible
to construct a general purpose multiprocessor
framework that users can rapidly customize to
their needs.

68
Cosmological N-Body Simulation
Timings

Time required for 1 floating point operation
0.25 ns
Time required for 1 memory fetch
10ns (40 floats)
Time required for 1 off-processor fetch
10ms (40,000 floats)
Lesson Only 1 in 1000 memory fetches can result
in network activity!

69
The very first Super Computer

1929 New York World newspaper coins the term
super computer when talking about a giant
tabulator custom-built by IBM for Columbia
University

70
Review Hallmarks of Computing
FORTRAN heralded as the worlds first
high-level language
1956
Seymour Cray develops the CDC 6600, the first
supercomputer
1966
Seymour Cray founds Cray Research Inc (CRI)
1972
Cray-1 marks the beginning of the Golden Age of
supercomputing
1976
1986
Pittsburgh Supercomputer Center is founded
Cray-2 marks the end of the Golden Age of
supercomputing
1989
Seymour Cray leaves CRI and founds Cray Computer
Corp. (CCC)
1989
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1990s
1995
Cray Computer Corporation (CCC) goes bankrupt
1996
Cray Research Inc. acquired by SGI
1998
Google Inc. is founded
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
71
The T3D MPP

1024 Dec Alpha processors (COTS)
128MB of RAM per processor (COTS)
Cray Custom-built network fabric ()

A 1024-processor Cray T3D in 1994
72
General characteristics of MPPs

COTS processors
COTS memory subsystem
Linux-based kernel
Custom networking
Custom networking in MPPs has replaced the custom
memory systems of vector machines

The 2068 processor Cray XT3 at PSC in 2006
Why??
73
Example Science ApplicationsWeather Prediction
Looking for Tornados (credits PSC, Center for
Analysis and Prediction of Storms)
74
Reasons for being sensitive to communication
latency

A given processor (PE) may touch a very large
subsample of the total dataset
Example self-gravitating system
PEs must exchange information many times during a
single transaction
Example along domain boundaries of a fluid
calculation

75
Features

Flexible client-server scheduling architecture
Threads respond to service requests issued by
master.
To do a new task, simply add a new service.
Computational steering involves trivial serial
programming

76
Design
Serial layers
Parallel layers
Gasoline Functional Layout
Computational Steering Layer
Executes on master processor only
Coordinates execution and data distribution among
processors
Parallel Management Layer
Serial Layer
Executes independently on all processors
Gravity Calculator
Hydro Calculator
Machine Dependent Layer (MDL)
Interprocessor communication
77
Cosmological N-Body Simulation
SCIENCE

Simulate how structure in the Universe forms from
initial linear density fluctuations
Linear fluctuations in early Universe supplied by
cosmological theory.
Calculate non-linear final states of these
fluctuations.
See if these look anything like the real
Universe.
No? Go to step 1

Write a Comment

User Comments (0)