Title: Gordon Bell
1ISHPCInternational Symposium on High-Performance
Computing26 May 1999
- Gordon Bell
- http//www.research.microsoft.com/users/gbell
- Microsoft
2What a difference spending gt10X/system 25 years
makes!
40 Tflops ESRDC c2002 (Artists view)
150 Mflops CDC 7600 Cray 1 LLNL c1978
3Supercomputers(t)
- Time M structure example
- 1950 1 mainframes many...
- 1960 3 instruction //sm IBM / CDC
- mainframe SMP
- 1970 10 pipelining 7600 / Cray 1
- 1980 30 vectors SCI Crays
- 1990 250 MIMDs mC, SMP, DSM Crays/MPP
- 2000 1,000 ASCI, COTS MPP Grid, Legion
4Supercomputing speed at any price, using
parallelism
- Intra processor
- Memory overlap instruction lookahead
- Functional parallelism (2-4)
- Pipelining (10)
- SIMD ala ILLIAC 2d array of 64 pe vs vectors
- Wide instruction word (2-4)
- MTA (10-20) with parallelization of a stream
- MIMDs multiprocessors parallelization allows
programs to stay with ONE stream - SMP (4-64)
- Distributed Shared Memory SMPs 100
- MIMD multicomputers force multi-streams
- Multicomputers aka MPP aka clusters (10K)
- Grid 100K
5Growth in Computational Resources Used for UK
Weather Forecasting
1010/ 50 yrs 1.5850
- 10T
- 1T
- 100G
- 10G
- 1G
- 100M
- 10M
- 1M
- 100K
- 10K
- 1K
- 100
- 10
YMP
205
195
KDF9
Mercury
Leo
1950
2000
6Talk plan
- The very beginning build it yourself
- Supercomputing with one computer the Cray era
1960-1995 - Supercomputing with many computersparallel
computing 1987- - SCI what was learned?
- Why I gave up to shared memory
- From the humble beginnings
- Petaflops when, how, how much
- New ideas Now, Legion, Grid, Globus
- Beowulf build it yourself
7Supercomputer old definition(s)
- In the beginning everyone built their own
computer - Largest computer of the day
- Scientific and engineering apps
- Large government defense, weather, aero,
laboratories and centers are first buyers - Price is no object 3M 30M, 50M, 150 250M
- Worldwide market 3-5, xx, or xxx?
8Supercomputing new definition
- Was a single, sequential program
- Has become a single, large scale job/program
composed of many programs running in parallel - Distributed within a room
- Evolving to be distributed in a region andglobe
- Cost, effort, and time is extraordinary
- Back to the future build your own super with
shrink-wrap software!
9Manchester the first computer. Baby, Mark I, and
Atlas
10von Neumann computers Rand JohniacWhen
laboratories built their own computers
11Cray1925-1996seegbellhomepage
12CDC 1604 6600
13CDC 7600 pipelining
14CDC STAR ETA10
Scalar matters
15Cray 1 6 from LLNL.Located at The Computer
Museum History Center, Moffett Field
16Cray 1 150 Kw. MG set heat exchanger
17Cray XMP/4Proc.c1984
18A look at the beginning of the new beginning
19SCI (Strategic Computing Initiative)funded by
DARPA and aimed at a Teraflops!Era of State
computers and many efforts to build high speed
computers lead to HPCCThinking Machines, Intel
supers,Cray T3 series
20Minisupercomputers a market whose time never
came. Alliant, Convex, ArdentStellar
Stardent 0,
21Cydrome and Multiflow prelude to wide word
parallelism in Merced
- Minisupers with VLIW attack the market
- Like the minisupers, they are repelled
- Its software, software, and software
- Was it a basically good idea that will now work
as Merced?
22KSR 1 first commercial DSM NUMA (non-uniform
memory access) aka COMA (cache-only memory
architecture)
23Intels ipsc 1 Touchstone Delta
24In Dec. 1995 computers with 1,000 processors will
do most of the scientific processing.
- Danny Hillis 1990 (1 paper or 1 company)
25The Bell-Hillis BetMassive Parallelism in 1995
TMC World-wide Supers
TMC World-wide Supers
TMC World-wide Supers
Applications
Petaflops / mo.
Revenue
26Thinking Machines CM1 CM5 c1983-1993
27Bell-Hillis Bet wasnt paid off!
- My goal was not necessarily to just win the bet!
- Hennessey and Patterson were to evaluate what was
really happening - Wanted to understand degree of MPP progress and
programmability
28SCI (c1980s) Strategic Computing Initiative
funded
- ATT/Columbia (Non Von), BBN Labs, Bell
Labs/Columbia (DADO), CMU Warp (GE Honeywell),
CMU (Production Systems), Encore, ESL, GE (like
connection machine), Georgia Tech, Hughes
(dataflow), IBM (RP3), MIT/Harris, MIT/Motorola
(Dataflow), MIT Lincoln Labs, Princeton (MMMP),
Schlumberger (FAIM-1), SDC/Burroughs, SRI
(Eazyflow), University of Texas, Thinking
Machines (Connection Machine),
29Those who gave their lives in the search for
parallellism
- Alliant, American Supercomputer, Ametek, AMT,
Astronautics, BBN Supercomputer, Biin, CDC, Chen
Systems, CHOPP, Cogent, Convex (now HP), Culler,
Cray Computers, Cydrome, Dennelcor, Elexsi, ETA,
E S Supercomputers, Flexible, Floating Point
Systems, Gould/SEL, IPM, Key, KSR, MasPar,
Multiflow, Myrias, Ncube, Pixar, Prisma, SAXPY,
SCS, SDSA, Supertek (now Cray), Suprenum,
Stardent (ArdentStellar), Supercomputer Systems
Inc., Synapse, Thinking Machines, Vitec, Vitesse,
Wavetracer.
30What can we learn from this?
- The classic flow university research to product
development worked - SCI ARPA-funded product development failed. No
successes. Intel prospered. - ASCI DOE-funded product purchases creates
competition - First efforts in startups all failed.
- Too much competition (with each other)
- Too little time to establish themselves
- Too little market. No apps to support them
- Too little cash
- Supercomputing is for the large rich
- or is it? Beowulf, shrink-wrap clusters
31Humble beginningIn 1981would you have
predicted this would be the basis of supers?
32The Virtuous Economic Cycle that drives the PC
industry
Competition
Volume
Standards
Utility/value
Innovation
33Platform Economics
- Traditional computers custom or semi-custom,
high-tech and high-touch - New computers high-tech and no-touch
100000
10000
Price (K)
1000
Volume (K)
Applicationprice
100
10
1
0.1
0.01
Mainframe
WS
Browser
Computer type
34Computer ops/sec x word length /
35Intels ipsc 1 Touchstone Delta
36GB with NT, Compaq, HP cluster
37The Alliance LES NT Supercluster
Supercomputer performance at mail-order
prices-- Jim Gray, Microsoft
- Andrew Chien, CS UIUC--gtUCSD
- Rob Pennington, NCSA
- Myrinet Network, HPVM, Fast Msgs
- Microsoft NT OS, MPI API
192 HP 300 MHz
64 Compaq 333 MHz
38Are we at a new beginning?
- Now, this is not the end. It is not even the
beginning of the end, but it is, perhaps, the end
of the beginning. 1999 Salishan HPC
Conference from W. Churchill 11/10/1942 - You should not focus NSF CS Research on
parallelism. I can barely write a correct
sequential program. Don Knuth 1987 (to
Gbell) - Ill give a 100 to anyone who can run a program
on more than 100 processors. Alan Karp
(198x?) - Ill give a 2,500 prize for parallelism every
year. Gordon Bell (1987)
39Bell Prize and Future Peak Gflops (t)
Petaflops study target
CM2
XMP NCube
401989 Predictions vs 1999 Observations
- Predicted 1 TFlops PAP 1995. Actual 1996. Very
impressive progress! (RAPlt1 TF) - More diversity gtless software progress!
- Predicted SIMD, mC (incl. W/S), scalable SMP,
DSM, supers would continue as significant - Got SIMD disappeared, 2 mC, 1-2 SMP/DSM, 4
supers, 2 mCv with one address space 1 SMP
became larger and clusters, MTA, workstation
clusters, GRID - 3B (unprofitable?) industry 10 platforms
- PCs and workstations diverted users
- MPP apps market DID/could NOT materialize
41U. S. Tax Dollars At Work. How many processors
does your center have?
- Intel/Sandia 9000 Pentium Pro
- LLNL/IBM 488x8x3 PowerPC (SP2)
- LNL/Cray 6144 P in DSM clusters
- Maui Supercomputer Center
- 512x1 SP2
42ASCI Blue Mountain 3.1 Tflops SGI Origin 2000
- 12,000 sq. ft. of floor space
- 1.6 MWatts of power
- 530 tons of cooling
- 384 cabinets to house 6144 CPUs with 1536 GB
(32GB / 128 CPUs) - 48 cabinets for metarouters
- 96 cabinets for 76 TB of raid disks
- 36 x HIPPI-800 switch Cluster Interconnect
- 9 cabinets for 36 HIPPI switches
- about 348 miles of fiber cable
43Half of LASL
44Comments from LLNL Program manager
- Lessons Learned with Full-System Mode
- It is harder than you think
- It takes longer than you think
- It requires more people than you can believe
- Just as in the very beginning of computing,
leading edge users are building their own
computers.
45NEC Supers
4640 Tflops Earth Simulator RD Center c2002
47Fujitsu VPP5000 multicomputer(not available in
the U.S.)
- Computing nodesspeed 9.6 Gflops vector, 1.2
Gflops scalar primary memory 4-16 GBmemory
bandwidth 76 GB/s (9.6 x 64 Gb/s)
inter-processor comm 1.6 GB/s non-blocking
with global addressing among all nodesI/O 3
GB/s to scsi, hippi, gigabit ethernet, etc. - 1-128 computers deliver 1.22 Tflops
48C1999 Clusters of computers. Its MPP when
processors/cluster gt1000
- Who SP.pap SP. P.pap SP.pap/C Sp/.C SMp./C
SM.s - T.fps .K G.fps G.fps GB TB
- LLNL 3.9 5.9 .66 5.3 8 2.5 62(IBM)
- LANL 3.1 6.1 .5 64 128. 32 76 (SGI)
- Sandia 2.7 9.1 .3 .6 2 -(Intel)
- Beowulf 0.5 2.0 4
- Fujitsu 1.2 .13 9.6 9.6 1 4.-16
- NEC 4.0 .5 8 128 16 128ESRDC 40 5.12 8 64 8 16
49High performance architecture/program timeline
- 1950 . 1960 . 1970 . 1980 . 1990 . 2000
- Vtubes Trans. MSI(mini) Micro RISC nMicr
- Sequential programming----gt-----------------------
------- - ltSIMD Vector--//---------------
- Parallelization---
- Parallel programming lt---------------
- multicomputers lt--MPP era------
- ultracomputers 10X in size price! 10x MPP
- in situ resources 100x in //sm NOW VLSC
- Grid
50Yes we are at a new beginning!
- Single jobs, composed of 1000s of
quasi-independent programs running in parallel on
1000s of processors (or computers). - Processors (or computers) of all types are
distributed (I.e. connected) in every fashion
from a collection using a single shared memory
to globally disperse computers.
51Future
522010 component characteristics100x improvement
_at_60 growth
- Chip Density 500. Mt
- Bytes/chip 8. GB
- On chip clock 2.5 GHz
- Inter-system clock 0.5
- Disk 1. TB
- Fiber speed (1 ch) 10. Gbps
531999 buyers, users, ISVs,?
- Technical supers dying DSM (and SMPs) trying
- Mainline user ISV apps ported to PCs
workstations - Supers (legacy code) market lives on ...
- Vector apps (e.g ISVs) ported to parallelized
SMPs - ISVs adopt MPI for a few apps at their own peril
- Leading edge One-of-a-kind apps on clusters of
16, 256, ...1000s built from uni, SMP, or DSM at
great expense! - Commercial SMP mainframes and minis and clusters
are interchangeable (control is the issue) - Dbase tp SMPs compete with mainframes if
central control is an issue else clusters - Data warehousing may emerge just a Dbase
- High growth, web and stream servers Clusters
have the advantage
54Application Taxonomy
General purpose, non-parallelizable codes(PCs
have it!) Vectorizable //able(Supers all
SMPs) Hand tuned, one-ofMPP course grain.MPP
embarrassingly //(Clusters of anythings) Databas
e Database/TP Web Host Stream Audio/Video
If real rich then SMP clusters else PC Clusters
(U.S. only)
Technical
Commercial
If real rich then IBM Mainframes or large
SMPs else PC Clusters
55C2000 Architecture Taxonomy
Xpt SMPs (mainframes) Xpt-SMPvector
Xpt-multithread (Tera) multi as a
component Xpt-multi hybrid DSM-(commodity-SCI)?
DSM (scalar) DSM (vector) Commodity
multis Clusters of multis Clusters of
DSMs (scalar vector)
mainline
SMP
Multicomputers aka Clusters.MPP when
ngt1000processors
mainline
56Questions that will get answered
- How long will Moores Law continue?
- MPP (Clusters of gt1K proc,) vs SMP (incl. DSM)?
How much time and money for programming? How
much time and money for execution? - When or will DSM be pervasive?
- Is the issue of processor architecture (scalar,
MTA, VLIW/MII, vector important? - Commodity vs proprietary chips?
- Commodity, Proprietary, or Net interconnections?
- Unix vs VendorIX vs NT?
- Commodity vs proprietary systems?
- Can we find a single, all pervasive programming
model for scalable parallelism to support apps? - When will computer science teach parallelism?
57Switching from a long-term belief in SMPs (e.g.
DSM, NUMA) to Clusters
- 1963-1993
- SMP gt DSM inevitability after 30 years of
belief in building mPs - 1993
- clusters are inevitable
- 2000
- commodity clusters, improved log(p) SMPs gt DSM
58SNAP Systems circa 2000
Local global data comm world
Legacy mainframe minicomputer servers
terminals
Mobile Nets
Portables
Wide-area global ATM network
Person servers (PCs)
ATM Ethernet to PC, workstation, servers
scalable computers built from PCs SANs
Telecomputers aka Internet Terminals
???
Centralized departmental servers built from PCs
- A space, time (bandwidth), generation, and
reliability scalable environment
TCTVPC home ... (CATV or ATM or satellite)
59Scaling dimensions include
- reliability including always up
- number of nodes
- most cost-effective system built from best nodes
PCs with NO backplane - highest throughput distributes disks to each node
versus into a single node - location within a region or continent
- time-scale I.e. machine generations
60Why did I switch to clusters aka multicomputers
aka MPP?
- Economics commodity components give a 10-100x
advantage in price performance - Backplane connected processors (incl. DSMs) vs
board-connected processors - Difficulty of making large SMPs (and DSM)
- Single system image clearly needs more work
- SMPs (and DSMs) fail ALL scalabilities!
- size and lumpiness
- reliability
- cross-generation
- spatial
- We need a single programming model
- Clusters are the only structure that scales!
61Technical users have alternatives
- PCs work fine for smaller problems
- Do it yourself clusters e.g. Beowulf works!
- MPI lcd? programming model doesnt exploit
shared memory - ISVs have to use lcd to survive
- SMPs are expensive
- Clusters required for scalabilities or apps
requiring extra-ordinary performance ...so DSM
only adds to the already complex parallelization
problem - Non-U.S. users continue to use vectors
62Commercial users dont need them
- Highest growth is will be web servers
delivering pages, audio, and video - Apps are inherently, embarrassingly parallel
- Databases and TP parallelized and transparent
- A single SMP handles traditional apps
- Clusters required for reliability, scalabilities
632010 architecture
- Not much different I see no surprises, except at
the chip level. Good surprises would drive
performance more rapidly - SmP (mlt16) will be the component for clusters.
Most cost-effective systems are made from best
nodes. - Clusters will be pervasive.
- Interconnection networks log(p) continue to be
the challenge
64Computer (P-Mp) system Alternatives
- Node size most cost-effective SMPs
- Now 1-2 on a single board
- Evolves based on n processor per chip
- Continued use of single bus SMP multi
- Large SMP provide a single system image for small
systems, but not cost or space efficient for use
as cluster component - SMPs evolving to weak coherency DSMs
65Cluster system Alternatives
- System in a room SAN connected e.g. NOW,
Beowulf - System in the building LAN connected
- System across the continent or globe Inter- /
intra-net connected networks
66NCSA Cluster of 8 x 128 processors SGI Origin
67Architects architectures clusters aka (MPP if
pgt1000)clusters gt NUMA/DSM iff commodity
interconnects supply them
- U.S. vendors 9 x scalar processors
- HP, IBM, and SUN minicomputers aka servers to
attack mainframes are the basic building blocks - SMPs with 100 processors per system
- surprise 4-16 processors / chip MTA?
- Intel-based desktop small servers
- commodity supercomputers ala Beowulf
- Japanese vendors vector processors
- NEC continue driving NUMA approach
- Fujitsu will evolve to NUMA/DSM
681994 Petaflops Workshop c2007-2014. Clusters of
clusters. Something for everybody
- SMP Clusters Active Memory
- 400 P 4-40K P 400K P
- 1 Tflops 10-100 Gflops 1 Gflops
- 400 TB SRAM 400 TB DRAM 0.8 TB embed
- 250 Kchips 60K-100K chips 4K chips
- 1 ps/result 10-100 ps/result
- 100 x 10 Gflops threads
- 100,000 1 Tbyte discs gt 100 Petabytes.10
failures / day -
69HT-MT Whats 0.55?
70HT-MT
- Mechanical cooling and signals
- Chips design tools, fabrication
- Chips memory, PIM
- Architecture MTA on steroids
- Storage material
71HTMT heuristics for computer builders
- Mead 11 year rule time between lab appearance
and commercial use - Requires gt2 break throughs
- Teams first computer or super
- Its government funded albeit at a university
72Global interconnection
- Our vision ... is a system of millions of hosts
in a loose confederation. Users will have the
illusion of a very powerful desktop computer
through which they can manipulate objects. - Grimshaw, Wulf, et al Legion CACM Jan. 1997
73Utilize in situ workstations!
- NoW (Berkeley) set sort record, decrypting
- Grid, Globus, Condor and other projects
- Need standard interface and programming model
for clusters using commodity platforms fast
switches - Giga- and tera-bit links and switches allow
geo-distributed systems - Each PC in a computational environment should
have an additional 1GB/9GB!
74Or more parallelism and use installed machines
- 10,000 nodes in 1999 or 10x Increase
- Assume 100K nodes
- 10 Gflops/10GBy/100GB nodes or low end c2010 PCs
- Communication is first problem use the network
- Programming is still the major barrier
- Will any problems fit it?
75The GridBlueprint for a New Computing
InfrastructureIan Foster, Carl Kesselman (Eds),
Morgan Kaufmann, 1999
- Published July 1998
- ISBN 1-55860-475-8
- 22 chapters by expert authors including
- Andrew Chien,
- Jack Dongarra,
- Tom DeFanti,
- Andrew Grimshaw,
- Roch Guerin,
- Ken Kennedy,
- Paul Messina,
- Cliff Neuman,
- Jon Postel,
- Larry Smarr,
- Rick Stevens,
- Charlie Catlett
- John Toole
- and many others
A source book for the history of the future --
Vint Cerf
http//www.mkp.com/grids
76The Grid
- Dependable, consistent, pervasive access to
- high-end resources
- Dependable Can provide performance and
functionality guarantees - Consistent Uniform interfaces to a wide variety
of resources - Pervasive Ability to plug in from anywhere
77Alliance Grid Technology Roadmap Its just not
flops or records/se
78Summary
- 1000x increase in PAP has not always been
accompanied with RAP, insight, infrastructure,
and use. Much remains to be done. - The PC World Challenge is to provide commodity,
clustered parallelism to commercial and technical
communities - Only becomes true if software vendors e.g.
Microsoft deliver shrink-wrap software - ISVs must believe that clusters are the future
- Computer science has to get with the program
- Grid etc. using world-wide resources, including
in situ PCs is the new idea
792004 Computer Food Chain ???
Mainframe
Vector Super
Massively Parallel Processors
Networks of Workstation/PCs
Dave Patterson, UC/Berkeley
80The end
81When is a Petaflops possible? What price?
Gordon Bell, ACM 1997
- Moores Law 100xBut how fast can the clock
tick?Are there any surprises? - Increase parallelism 10Kgt100K 10x
- Spend more (100M è 500M) 5x
- Centralize center or fast network 3x
- Commoditization (competition) 3x
82Processor Alternatives
- commodity aka Intel micros
- Does VLIW work better as a micro than it did as
Cydrome Multiflow minis? - vector processor
- multiple processors per chip or
- multi-threading
- MLIW? a.k.a. signal processing
- FPGA chip-based special processors
83Russian Elbrus E2K Micro
- Who E2K Merced
- Clock GHz 1.2 0.8
- Spec i/fp 135./350 45./70
- Size mm2 126. 300.
- Power 35. 60.
- Pin B/W GB 15.
- Cache (KB) 64./256
- PAP Gflps 10.2
- System ship Q4./2001
84What Is The Processor Architecture?
VECTORS
VECTORS
OR
- Comp. Sci. View
- MISC gtgt CISC
- Language directed
- RISC
- Super-scalar
- MTA
- Extra-Long Instruction Word
Super Computer View RISC VCISC (vectors) multiple
pipes
85Observation CMOS supers replaced ECL in Japan
- 10 Gflops vector units have dual use
- In traditional mPv supers
- as basis for computers in mC
- Software apps are present
- Vector processors out-perform n (n10) micros
for many apps. - Its memory bandwidth, cache prediction,
inter-communication, and overall variation
86Weather model performance
87Observation MPPs 1, Users lt1
- MPPs with relatively low speed micros with lower
memory bandwidth, ran over supers, but didnt
kill em. - Did the U.S. industry enter an abyss?
- Is crying unfair trade hypocritical?
- Are U. S. users being denied tools?
- Are users not getting with the program
- Challenge we must learn to program clusters...
- Cache idiosyncrasies
- Limited memory bandwidth
- Long Inter-communication delays
- Very large numbers of computers
- NO two computers are alike gt NO Apps
88The Law of Massive Parallelism (mine) is based on
application scaling
- There exists a problem that can be made
sufficiently large such that any network of
computers can run efficiently given enough
memory, searching, work -- but this problem may
be unrelated to no other. - A ... any parallel problem can be scaled to run
efficiently on an arbitrary network of computers,
given enough memory and time but it may be
completely impractical - Challenge to theoreticians and tool buildersHow
well will or will an algorithm run? - Challenge for software and programmers Can
package be scalable portable? Are there models? - Challenge to users Do larger scale, faster,
longer run times, increase problem insight and
not just total flop or flops? - Challenge to funders Is the cost justified?
89GB's Estimate of Parallelism in Engineering
Scientific Applications
----scalable multiprocessors-----
PCs WSs
Supers
Clusters aka MPPsaka multicomputers
dusty decks for supers
new or scaled-up apps
log ( apps)
scalar 60
vector 15
Vector // 5
One-ofgtgt// 5
Embarrassingly perfectly parallel 15
granularity degree of coupling (comp./comm.)