Title: THE PANEL
1THE PANEL
- Thomas Sterling
- California Institute of Technology
- NASA Jet Propulsion Laboratory
- February 23, 1999
22nd Conference on Enabling Technologies for
Peta(fl)ops Computing
- Thomas Sterling
- California Institute of Technology
- NASA Jet Propulsion Laboratory
- February 16, 1999
3(No Transcript)
4Why We Were There
- 5 years later, a 2nd look at a daunting prospect
- 6 in-depth workshops
- 8 point design studies sponsored by NSF
- 2-year study sponsored by DARPA, NSA, and NASA
- Teraflops
- ASCI
- PITAC
- Contending views of an uncertain future in HEC
5Goals
- Conduct an open forum on Petaflops computing
- Examine our new understanding
- issues
- opportunities
- challenges
- directions
- Present results from important research
- Expose contending viewpoints on alternatives
- Frank discussions about the future of HEC
research - Better define this as an interdisciplinary
pursuit - Establish the inter-relationship of academia,
industry, and government
6Information Packet
- The BAG
- Enabling Technologies for Petaflops Computing,
MIT Press 1995 - PAWS/PetaSoft Procedings
- PAL Notes/Proceedings
- POWR Workshop Proceedings
- Petaflops II Conference Information Packet
- Petaflops I Group Photograph
- HTMT Technical Note
- Peanut MMs
7What are the major findings from this conference?
- Pflops are needed, as soon as possible
- 2010 or before
- Enough chips can be glued together
- not sure about power or cost but thats only
money (yours) - no insight on efficiency
- Algorithms can play an important role in
recovering from the sins of architecture - Adaptive control critical to effective resource
management and task allocation/scheduling - Little Federal will/vision to support systems
architecture research
8Open Issues?
- Efficiency
- How much bandwidth is needed and where
- Latency management strategy - roles for
- language
- compiler
- runtime system
- architecture
- True operational requirements of Pflops apps
- New relationship between compiler and runtime
- Impact of exotic strategies
- How to fund research in systems architecture
9Major Obstacles and Areas to be Explored?
- Pflops Applications requirements/characteristics
- bandwidth, locality, granularity, access patters,
... - Adaptive resource management policies and
mechanisms - Bandwidth
- Programming model
- Funding
- sufficient
- long term
- multi-agency
10Recommendations for Rapid Deployment of Effective
Pflops?
- No more workshops/conferences, until
- Point design studies in Pflops scaled
- applications
- algorithms
- system software methodologies
- The Other Path - explore it, exploit it
- Sponsor Pflops systems architecture and software
research - We need to build
- Strong focused academic/industry/govt consortium
11Observations
- Longest/shortest 5 years of my professional life
- So much done, so much to do
- Genesis of a wonderful interdisciplinary
community - Failure of federal will, abrogation of
responsibility - ASCI is good
- ASCI is lonely
- HTMT needs intellectual consideration by the
community - Completion of an exciting and important process
12What are the Major Obstacles for sustained
Petaflops-scale Performance for Real Apps?
- Getting a Petaflops computer
13INTEGRATED SMP - WDM
DRAM - 4 GBYTES - HIGHLY INTERLEAVED
MULTI-LAMBDA AON
CROSS BAR
coherence
640 GBYTES/SEC
2nd LEVEL CACHE 96 MBYTES
64 bytes wide
160 gbytes/sec
VLIW/RISC CORE 24 GFLOPS 6 ghz
...
14COTS PetaFlop System
128 die/box 4 CPU/die
3
4
...
5
2
16
1
17
64
ALL-OPTICAL SWITCH
18
63
...
...
32
49
48
Multi-Die Multi-Processor
...
33
47
46
I/O
10 meters 50 NS Delay
15COTS PetaFlop System
- 8192 Dies (4 CPU/die-minimum)
- Each Die is 120 GFlops
- 1 PetaFlop Peak
- Power 8192 x200 Watts 1.6 MegaWatts
- Extra Main Memory gt3 MegaWatts (512 TBytes)
- 15.36 TFlops/Rack (128 die)
- 30 KWatts/Rack - thus 64 racks - 30 inch
- Common System I/O
- 2 Level Main Memory
- Optical Interconnect
- OC768 Channels (40 GHz)
- 128 Channels per Die (DWDM)-5.12 THz
- ALL Optical Switching
- Bisection Bandwidth of 50 TBytes/sec
- 15 TFlops/rack.1bytes/flop/sec32 racks
- Rack Bandwidth - 15 TFlops.1 12 THz
16What are the Major Obstacles for sustained
Petaflops-scale Performance for Real Apps?
- Getting a Petaflops computer
- Getting Petaflops Apps
17Applications Areas for Petaflops
- Materials simulations between microscale and
macroscale (bulk materials) - Coupled electro-mechanical simulations of
nano-scale structures (micromachines) - Full plant optimization for complex processes
(chemical, manufacturing problems) - High-resolution reacting flow problems
(combustion, chemical mixing, multiphase flow) - High-realism immersive virtual by on realtime
radiosity modeling and complex scenes - Time dependent simulations of complex
biomolecules (membranes, synthesis and dna) - Multidisciplinary optimization problems combining
structures, fluids and geometry - Modeling of integrated earth systems (ocean,
atmosphere, bio-geosphere) - Improved 4d/6d data assimilation applied to
remote sensing and environmental models - Computational cosmology (particle models,
astrophysical fluids and radiation transport) - Computational testing and simulation to replace
weapons testing (stockpile stewardship) - Simulation of plasma fusion devices for
controlled fusion (to optimize future reactors) - Design of new chemical compounds and synthesis
pathways (environmental safety and cost
improvements) - Comprehensive modeling of groundwater and oil
reservoirs (contamination and management) - Modeling of complex transportation, communication
and economic systems
18Rational Drug Design
Nanotechnology
Tomographic Reconstruction
Phylogenetic Trees
Biomolecular Dynamics
Neural Networks
Crystallography
Fracture Mechanics
MRI Imaging
Reservoir Modelling
Molecular Modelling
Biosphere/Geosphere
Diffraction Inversion Problems
Distribution Networks
Chemical Dynamics
Atomic Scattering
Electrical Grids
Flow in Porous Media
Pipeline Flows
Data Assimilation
Signal Processing
Condensed Matter Electronic Structure
Plasma Processing
Chemical Reactors
Cloud Physics
Electronic Structure
Boilers
Combustion
Actinide Chemistry
Radiation
Fourier Methods
Graph Theoretic
CVD
Quantum Chemistry
Reaction-Diffusion
Chemical Reactors
Cosmology
Transport
n-body
Astrophysics
Multiphase Flow
Manufacturing Systems
CFD
Basic Algorithms Numerical Methods
Discrete Events
PDE
Weather and Climate
Air Traffic Control
Military Logistics
Structural Mechanics
Seismic Processing
Population Genetics
Monte Carlo
ODE
Multibody Dynamics
Geophysical Fluids
VLSI Design
Transportation Systems
Aerodynamics
Raster Graphics
Economics
Fields
Orbital Mechanics
Nuclear Structure
Ecosystems
QCD
Pattern Matching
Symbolic Processing
Neutron Transport
Economics Models
Genome Processing
Virtual Reality
Astrophysics
Cryptography
Electromagnetics
Computer Vision
Virtual Prototypes
Intelligent Search
Multimedia Collaboration Tools
Computer Algebra
Databases
Magnet Design
Computational Steering
Scientific Visualization
Data Minning
Automated Deduction
Number Theory
CAD
Intelligent Agents
19Bodega Bay Applications Workshop
- Artificial Intelligence
- Astrophysics
- Climate
- Computational Biology
- Computational Chemistry
- Computational Physics
- Cryptography
- Digital Libraries and Multimedia
- Dynamical Systems
- Economics
- Computational Electromagnetics
- Electronic Device Simulation
- Fluid Dynamics
- Geophysics
- Graph Theory
- Mathematics and Logic
- Medicine
- Multidisciplinary Problems
- Optimization
- Particle-in-cell models
- Real-time/Time critical
- Signal Processing
- Shock physics
- Structural mechanics
- Vision and Geometric Computing
20What are the Major Obstacles for sustained
Petaflops-scale Performance for Real Apps?
- Getting a Petaflops computer
- Getting Petaflops Apps
- Getting Efficiency
21Getting Efficiency
- Overhead
- work to manage program concurrency and resource
parallelism - imposes upper bounds on scalability/granularity
- Latency
- distance in time (cycles) of service access
requests - duration of waiting by operation sequence
- Contention
- time to service from shared resources
- Starvation
- sufficient concurrency to fill available
resources - balance of workloads to engage all resources
22(No Transcript)
230.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
24HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
Split-Phase Synchronization to SRAM
done
start
C-Buffer
I-Queue
A-Queue
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
Re-Use
T-Queue
D-Queue
Run Time System
SRAM-PIM
DMA to DRAM-PIM
25Getting Efficiency
- Contention
- hardware for bandwidth, logic throughput,
hardware arbitration - Latency
- multithreaded processor with hardware context
switching - percolation for proactive prestaging of
executables - PIM-DRAM PIM-SRAM provides smart data oriented
mechanisms - Overhead
- hardware context switching
- in PIM smart synchronization and context
management - proactive percolation performed in PIM
- Starvation
- dynamic load balancing
- high speed processor for reduced parallelism
- expose/exploit fine grain parallelism
26Petaflops I Group Photograph