Science on Supercomputers: - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Science on Supercomputers:

Description:

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University of Pittsburgh – PowerPoint PPT presentation

Number of Views:299
Avg rating:3.0/5.0
Slides: 78
Provided by: Jeffr366
Category:

less

Transcript and Presenter's Notes

Title: Science on Supercomputers:


1
Science on Supercomputers
  • Pushing the (back of) the envelope

Jeffrey P. Gardner
Pittsburgh Supercomputing Center Carnegie Mellon
University University of Pittsburgh
2
Outline
  • History (the past)
  • Characteristics of scientific codes
  • Scientific computing, supercomputers, and the
    Good Old Days
  • Reality (the present)
  • Is there anything super about computers
    anymore?
  • Why network means more net work on your part.
  • Fantasy (the future)
  • Strategies for turning a huge pile of processors
    into something scientists can actually use.

3
A (very brief) Introduction of Scientific
Computing
4
Properties of interesting scientific datasets
  • Very large dataset where
  • Calculation is tightly-coupled

5
Example Science ApplicationCosmology
  • Cosmological N-Body
  • simulation
  • 100,000,000 particles
  • 1 TB of RAM

To resolve the gravitational force on any single
particle requires the entire dataset
read-only coupling
100 million light years
6
Example Science ApplicationCosmology
  • Cosmological N-Body
  • simulation
  • 100,000,000 particles
  • 1 TB of RAM

To resolve the hydrodynamic forces requires
information exchange between particles
read-write coupling
100 million light years
7
Scientific Computing
  • Transaction Processing1
  • A transaction is an information processing
    operation that cannot be subdivided into smaller
    operations. Each transaction must succeed or fail
    as a complete unit it cannot remain in an
    intermediate state.2
  • Functional definition
  • A transaction is any computational task
  • That cannot be easily subdivided because the
    overhead in doing so would exceed the time
    required for the non-divided form to complete.
  • Where any further subdivisions cannot be written
    in such a way that they are independent of one
    another.

1term borrowed (and generalized with apologies)
from database management
2From Wikipedia
8
Scientific Computing
  • Functional definition
  • A transaction is any computational task
  • That cannot be easily subdivided because the
    overhead in doing so would exceed the time
    required for the non-divided to complete.
  • Cosmological N-Body
  • simulation
  • 100,000,000 particles
  • 1 TB of RAM

To resolve the gravitational force on any single
particle requires the entire dataset
read-only coupling
9
Scientific Computing
  • Functional definition
  • A transaction is any computational task
  • Where any further subdivisions cannot be written
    in such a way that they are independent of one
    another.
  • Cosmological N-Body
  • simulation
  • 100,000,000 particles
  • 1 TB of RAM

To resolve the hydrodynamic forces requires
information exchange between particle s
read-write coupling
10
Scientific Computing
  • In most business and
  • web applications
  • A single CPU usually processes many transactions
    per second
  • Transaction sizes are typically small

11
Scientific Computing
  • In many science
  • applications
  • A single transaction can take CPU hours, days, or
    years
  • Transaction sizes can be extremely large

12
What Made Computers Super?
  • Since the transaction is memory-resident in order
    to not be I/O bound, the next bottleneck is
    memory.
  • The original Supercomputers differed from
    ordinary computers in their memory bandwidth
    and latency characteristics.

13
The Golden Age of Supercomputing
  • 1976-1982 The Cray-1 is the most powerful
    computer in the world
  • The Cray-1 is a vector platform
  • i.e. it performs the same operation on many
    contiguous memory elements in one clock tick.
  • Memory subsystem was optimized to feed data to
    the processor at its maximum flop rate.

14
The Golden Age of Supercomputing
  • 1985-1989 The Cray-2 is the most powerful
    computer in the world
  • The Cray-2 is also a vector platform

15
Scientists Liked Supercomputers.They were simple
to program!
  • They were serial machines
  • Caches? We dont need no stinkin caches!
  • Scalar machines had no memory latency
  • This is as close as you get to an ideal computer
  • Vector machines offered substantial performance
    increases over scalar machines if you could
    vectorize your code.

16
Triumph of the Masses
  • In the 1990s, commercial off-the-shelf (COTS)
    technology became so cheap, it was no longer
    cost-effective to produce fully-custom hardware

17
Triumph of the Masses
  • Instead of producing faster processors with
    faster memory, supercomputer companies built
    machines with lots of processors in them.

A single processor Cray-2
A 1024-processor Cray (CRI) T3D
18
Triumph of the Masses
  • These were known as massively parallel platforms,
    or MPPs.

A single processor Cray-2
A 1024-processor Cray T3D
19
Triumph of the Masses(?)
A single processor Cray-2, The worlds fastest
computer in 1989
A 1024-processor Cray T3D, The worlds fastest
computer in 1994 (almost)
20
Part II The Present
  • Why network means more net work on your part

21
The Social Impact of MPPs
  • The transition from serial supercomputers to MPPs
    actually resulted in far fewer scientists using
    supercomputers.
  • MPPs are really hard to program!
  • Developing scientific applications for MPPs
    became an area of study in its own right
  • High Performance Computing (HPC)

22
Characteristics of HPC Codes
  • Large dataset
  • Data must be distributed across many compute nodes

The MPP memory hierarchy
The CPU memory hierarchy
Main memory
100 cycles
L2 cache
10 cycles
L1 cache
2 cycles
Processor Registers
An N-Body cosomology simulation
23
What makes computers super anymore?
PSC Terascale Compute System (TCS) in
2000 Custom interconnect fabric by Quadrics
Cray T3D in 1994 Cray-built interconnect fabric
PSC Cray XT3 in 2006 Cray-built interconnect
fabric
24
What makes computers super anymore?
  • I would propose the following definition
  • A supercomputer differs from a pile of
    workstations in that
  • a supercomputer is optimized to spread a single
    large transaction across many many processors.
  • In practice, this means that the network
    interconnect fabric is identified as the
    principle bottleneck.

25
What makes computers super anymore?
Googles 30-acre campus in The Dalles, Oregon
26
Review Hallmarks of Computing
FORTRAN heralded as the worlds first
high-level language
1956
Seymour Cray develops the CDC 6600, the first
supercomputer
1966
Seymour Cray founds Cray Research Inc (CRI)
1972
Cray-1 marks the beginning of the Golden Age of
supercomputing
1976
1986
Pittsburgh Supercomputer Center is founded
Cray-2 marks the end of the Golden Age of
supercomputing
1989
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1990s
1998
Google Inc. is founded
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
27
Review HPC
  • High-Performance Computing (HPC) refers to a type
    of computation whereby a single, large
    transaction is spread across 100s to 1000s of
    processors.
  • In general, this kind of computation is sensitive
    to network bandwidth and latency.
  • Therefore, most modern-day supercomputers seek
    to maximize interconnect bandwidth and minimize
    interconnect latency within economic limits.

28
  • Naïve algorithm is Order N2
  • Gasoline N-Body Treecode (Order N log N)
  • Began development in 1994and continues to this
    day

PE
kd-tree (subset of Binary Space Partitioning tree)
29
Example HPC ApplicationCosmological N-Body
Simulation
30
Cosmological N-Body Simulation
PROBLEM
  • Everything in the Universe attracts everything
    else
  • Dataset is far too large to replicate in every
    PEs memory
  • Difficult to parallelize

31
Cosmological N-Body Simulation
PROBLEM
  • Everything in the Universe attracts everything
    else
  • Dataset is far too large to replicate in every
    PEs memory
  • Difficult to parallelize
  • Only 1 in 3000 memory fetches can result in an
    off-processor message being sent!

32
Characteristics of HPC Codes
  • Large dataset
  • Data must be distributed across many compute nodes

The MPP memory hierarchy
Main memory
100 cycles
L2 cache
10 cycles
L1 cache
2 cycles
Processor Registers
An N-Body cosomology simulation
33
Features
  • Advanced interprocessor data caching
  • Application data is organized into cache-lines
  • Read cache
  • Requests for off-PE data result in fetching of
    cache line
  • Cache line is stored locally and used for future
    requests
  • Write cache
  • Updates to off-PE data are processed locally,
    then flushed to remote thread when necessary
  • lt 1 in 100,000 off-PE requests actually result in
    communication.

34
Features
  • Load Balancing
  • The amount of work each particle required for
    step t is tracked.
  • This information is used to distribute work
    evenly amongst processors for step t1

35
Performance
85 linearity on 512 PEs with pure MPI (Cray XT3)
92 linearity on 512 PEs with one-sided comms
(Cray T3E Shmem)
92 linearity on 2048 PEs on Cray XT3 for optimal
problem size (gt100,000 particles per processor)
36
Features
  • Portability
  • Interprocessor communication by high-level
    requests to Machine-Dependent Layer (MDL)
  • Only 800 lines of code per architecture
  • MDL is rewritten to take advantage of each
    parallel architecture (e.g. one-sided
    communication).
  • MPI-1, POSIX Threads, SHMEM, Quadrics, more

Communication
37
Applications
Galaxy Formation (10 million particles)
38
Applications
Solar System Planet Formation (1 million
particles)

39
Applications
Asteroid Collisions (2000 particles)

40
Applications
Piles of Sand (?!) (1000 particles)

41
Summary
  • N-Body simulation are difficult to parallelize
  • Gravity says everything interacts with
    everything else
  • GASOLINE achieves high scalability by using
    several beneficial concepts
  • Interprocessor data caching for both reads and
    writes
  • Maximal exploitation of any parallel architecture
  • Load balancing on a per-particle basis
  • GASOLINE proved useful for a wide range of
    applications that simulate particle interactions
  • Flexible client-server architecture aids in
    porting to new science domains

42
Part III The Future
  • Turning a huge pile of processors into something
    that scientists can actually use.

43
How to turn simulation output into scientific
knowledge
Using 300 processors (circa 1996)
Step 1 Run simulation
44
How to turn simulation output into scientific
knowledge
Using 1000 processors (circa 2000)
Step 2 Analyze simulation on server
Step 1 Run simulation
45
How to turn simulation output into scientific
knowledge
Using 2000 processors (circa 2005)
(unhappy scientist)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
46
How to turn simulation output into scientific
knowledge
Using 100,000 processors? (circa 2012)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
The NSF has announced that it will be providing
200 million to build and operate a Petaflop
machine by 2012.
47
Turning TeraFlops into Scientific Understanding
  • Problem
  • The size of simulations is no longer limited by
    the scalability of the simulation code, but by
    the scientists inability to process the resultant
    data.

48
Turning TeraFlops into Scientific Understanding
  • As MPPs increase in processor count, analysis
    tools must also run on MPPs!
  • PROBLEM
  • Scientists usually write their own analysis
    programs
  • Parallel program are hard to write!
  • HPC world is dominated by simulations
  • Code is often reused for many years by many
    people
  • Therefore, you can afford to spend lots of time
    writing the code.
  • Example Gasoline required 10 FTE-years of
    development!

49
Turning TeraFlops into Scientific Understanding
  • Data analysis implies
  • Rapidly changing scientific inqueries
  • Much less code reuse
  • Data analysis requires rapid algorithm
    development!
  • We need to rethink how we as scientists interact
    with our data!

50
A Solution(?) N tropy
  • Scientists tend to write their own code
  • So give them something that makes that easier for
    them.
  • Build a framework that is
  • Sophisticated enough to take care of all of the
    parallel bits for you
  • Flexible enough to be used for a large variety of
    data analysis applications

51
N tropy A framework for multiprocessor
development
  • GOAL Minimize development time for parallel
    applications.
  • GOAL Enable scientists with no parallel
    programming background (or time to learn) to
    still implement their algorithms in parallel by
    writing only serial code.
  • GOAL Provide seamless scalability from single
    processor machines to MPPspotentially even
    several MPPs in a computational Grid.
  • GOAL Do not restrict inquiry space.

52
Methodology
  • Limited Data Structures
  • Astronomy deals with point-like data in an
    N-dimension parameter space
  • Most efficient methods on these kind of data use
    trees.
  • Limited Methods
  • Analysis methods perform a limited number of
    fundamental operations on these data structures.

53
N tropy Design
  • GASOLINE already provides a number of advanced
    services
  • GASOLINE benefits to keep
  • Flexible client-server scheduling architecture
  • Threads respond to service requests issued by
    master.
  • To do a new task, simply add a new service.
  • Portability
  • Interprocessor communication occurs by high-level
    requests to Machine-Dependent Layer (MDL) which
    is rewritten to take advantage of each parallel
    architecture.
  • Advanced interprocessor data caching
  • lt 1 in 100,000 off-PE requests actually result in
    communication.

54
N tropy Design
  • Dynamic load balancing (available now)
  • Workload and processor domain boundaries can be
    dynamically reallocated as computation
    progresses.
  • Data pre-fetching (To be implemented)
  • Predict request off-PE data that will be needed
    for upcoming tree nodes.

55
N tropy Design
  • Computing across grid nodes
  • Much more difficult than between nodes on a
    tightly-coupled parallel machine
  • Network latencies between grid resources 1000
    times higher than nodes on a single parallel
    machine.
  • Nodes on a far grid resources must be treated
    differently than the processor next door
  • Data mirroring or aggressive prefetching.
  • Sophisticated workload management, synchronization

56
N tropy Features
  • By using N tropy you will get a lot of features
    for free
  • Tree objects and methods
  • Highly optimized and flexible
  • Automatic parallelization and scalability
  • You only write serial bits of code!
  • Portability
  • Interprocessor communication occurs by high-level
    requests to Machine-Dependent Layer (MDL) which
    is rewritten to take advantage of each parallel
    architecture.
  • MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS),
    SGI Altix

57
N tropy Features
  • By using N tropy you will get a lot of features
    for free
  • Collectives
  • AllToAll, AllGather, AllReduce, etc.
  • Automatic reduction variables
  • All of your routines can return scalars to be
    reduced across all processors
  • Timers
  • 4 automatic N tropy timers
  • 10 custom timers
  • Automatic communication and I/O statistics
  • Quickly identify bottlenecks

58
Serial Performance
  • N tropy vs. an existing serial n-point
    correlation function calculator
  • N tropy is 6 to 30 times faster in serial!
  • Conclusions
  • Not only does it takes much less time to write an
    application using N tropy,
  • You application may run faster than if you wrote
    it from scratch!

59
Performance
10 million particles Spatial 3-Point 3-gt4 Mpc
This problem is substantially harder than gravity!
3 FTE months of development time!
60
N tropy Meaningful Benchmarks
  • The purpose of this framework is to minimize
    development time!
  • Development time for
  • N-point correlation function calculator
  • 3 months
  • Friends-of-Friends group finder
  • 3 weeks
  • N-body gravity code
  • 1 day!

(OK, I cheated a bit and used existing serial
N-body code fragments)
61
N tropy Conceptual Schematic
Web Service Layer (at least from Python)
WSDL? SOAP?
Key Framework Components Tree Services User
Supplied
VO
Computational Steering Layer C, C, Python
(Fortran?)
Framework (Black Box)
Dynamic Workload Management
Domain Decomposition/ Tree Building
User tree and particle data
Tree Traversal
Collectives
Parallel I/O
User serial I/O routines
User tree traversal routines
User serial collective staging and processing
routines
62
Summary
Prehistoric times
FORTRAN is heralded as the first high-level
language.
Scientists run on serial supercomputers.
Scientists write many programs for them.
Scientists are happy.
Ancient times
MPPs are born. Scientists scratch their heads
and figure out how to parallelize their
algorithms.
early 1990s
mid 1990s
Scientists start writing scalable code for MPPs.
After much effort, scientists are kind of happy
again.
Scientists no longer run on their simulations on
the biggest MPPs because they cannot analyze the
output. Scientists are seriously bummed.
early 2000s
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
63
Summary
  • N tropy is an attempt to allow scientists to
    rapidly develop their analysis codes for a
    multiprocessor environment.
  • Our results so far show that it is worthwhile to
    invest time developing a individual frameworks
    that are
  • Serially optimized
  • Scalable
  • Flexible enough to be customized to many
    different applications, even applications that
    you do not currently envision.
  • Is this a solution for the 100,000 processor
    world of tomorrow??

64
Pittsburgh Supercomputing Center
  • Founded in 1986
  • Joint venture between Carnegie Mellon University,
    University of Pittsburgh, and Westinghouse
    Electric Co.
  • Funded by several federal agencies as well as
    private industries.
  • Main source of support is National Science
    Foundation, Office of Cyberinfrastructure

65
Pittsburgh Supercomputing Center
  • PSC is the third largest NSF sponsored
    supercomputing center
  • BUT we provide over 60 of the computer time used
    by the NSF research
  • AND PSC is the only academic super- computing
    center in the U.S. to have had the most powerful
    supercomputer in the world (for unclassified
    research)

66
Pittsburgh Supercomputing Center
  • GOAL To use cutting edge computer technology to
    do science that would not otherwise be possible

67
Conclusions
  • Most data analysis in astronomy is done using
    trees as the fundamental data structure.
  • Most operations on these tree structures are
    functionally identical.
  • Based on our studies so far, it appears feasible
    to construct a general purpose multiprocessor
    framework that users can rapidly customize to
    their needs.

68
Cosmological N-Body Simulation
Timings
  • Time required for 1 floating point operation
  • 0.25 ns
  • Time required for 1 memory fetch
  • 10ns (40 floats)
  • Time required for 1 off-processor fetch
  • 10ms (40,000 floats)
  • Lesson Only 1 in 1000 memory fetches can result
    in network activity!

69
The very first Super Computer
  • 1929 New York World newspaper coins the term
    super computer when talking about a giant
    tabulator custom-built by IBM for Columbia
    University

70
Review Hallmarks of Computing
FORTRAN heralded as the worlds first
high-level language
1956
Seymour Cray develops the CDC 6600, the first
supercomputer
1966
Seymour Cray founds Cray Research Inc (CRI)
1972
Cray-1 marks the beginning of the Golden Age of
supercomputing
1976
1986
Pittsburgh Supercomputer Center is founded
Cray-2 marks the end of the Golden Age of
supercomputing
1989
Seymour Cray leaves CRI and founds Cray Computer
Corp. (CCC)
1989
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1990s
1995
Cray Computer Corporation (CCC) goes bankrupt
1996
Cray Research Inc. acquired by SGI
1998
Google Inc. is founded
20??
Google achieves world domination Scientists
still program in a high-level language they
call FORTRAN
71
The T3D MPP
  • 1024 Dec Alpha processors (COTS)
  • 128MB of RAM per processor (COTS)
  • Cray Custom-built network fabric ()

A 1024-processor Cray T3D in 1994
72
General characteristics of MPPs
  • COTS processors
  • COTS memory subsystem
  • Linux-based kernel
  • Custom networking
  • Custom networking in MPPs has replaced the custom
    memory systems of vector machines

The 2068 processor Cray XT3 at PSC in 2006
Why??
73
Example Science ApplicationsWeather Prediction
Looking for Tornados (credits PSC, Center for
Analysis and Prediction of Storms)
74
Reasons for being sensitive to communication
latency
  • A given processor (PE) may touch a very large
    subsample of the total dataset
  • Example self-gravitating system
  • PEs must exchange information many times during a
    single transaction
  • Example along domain boundaries of a fluid
    calculation

75
Features
  • Flexible client-server scheduling architecture
  • Threads respond to service requests issued by
    master.
  • To do a new task, simply add a new service.
  • Computational steering involves trivial serial
    programming

76
Design
Serial layers
Parallel layers
Gasoline Functional Layout
Computational Steering Layer
Executes on master processor only
Coordinates execution and data distribution among
processors
Parallel Management Layer
Serial Layer
Executes independently on all processors
Gravity Calculator
Hydro Calculator
Machine Dependent Layer (MDL)
Interprocessor communication
77
Cosmological N-Body Simulation
SCIENCE
  • Simulate how structure in the Universe forms from
    initial linear density fluctuations
  • Linear fluctuations in early Universe supplied by
    cosmological theory.
  • Calculate non-linear final states of these
    fluctuations.
  • See if these look anything like the real
    Universe.
  • No? Go to step 1
Write a Comment
User Comments (0)
About PowerShow.com