Title: PowerPointPrsentation
1Large-Scale Scientific Computing 1946-2006John
G. Zabolitzky
2Segments of Computation
- 1. Scientific ? Commercial ? Consumer ?
Embedded - Solution of technical/scientific problems like
weather, fluid dynamics, nuclear reactor
simulation (usually involving many complicated
operations on real floating-point numbers) as
opposed to commercial problems like accounting,
inventory, banking (usually involving characters
and few, simple operations on fixed-point
numbers). Not considering consumer applications
like music, movies, games web-servers
dishwashers, coffeemakers, automotive. - 2. Large-Scale ? Small/Medium-Scale
- Looking at the largest problems which can be
treated in the current year. Not looking at
small-scale, e.g. laboratory-automation, student
paper, or small research problem. (M problems,
not k problems) - 3. Mainstream ? Experimental, Unique, small
market share machines - Machines which have had a major influence on
science/technology in general on a broad scale. - 4. What is a Computer ?
- Stored program (not fixed, not external)
electronic (not electromechanical) computer
3First 30 Years Time line 1946-1975 Scalar ("von
Neumann") Computing
- 1946 Zuse(electromechanical), ENIAC(wired
program), Whirlwind .... early attempts - 1950 ERA 1101 (Atlas 1)
- 1953 ERA 1103 (Atlas 2) IBM 701 "defense
calculator" - 1857 IBM 709
- 1959 CDC 1604
- 1960 IBM 7090 709t
- 1962 IBM 7094
- 1963 CDC 3600
- 1964 CDC 6600
- 1965 IBM /360 family
- 1969 CDC 7600
- 1971 IBM /360-195
4ERA 1101 (1950)
Vacuum Tubes 2 Registers (A(48), Q(24)) 24 bit
binary parallel Drum memory 16k words 4.400
add/mul/sec
1-arithmetic section 2-power supply 3-control
section 4-maintenance section 5-memory,
electronic section 6-memory, drum section 7-heat
transfer unit 8,9- control, paper tape
reader/punch
5ERA 1103 (1953)
Vacuum Tubes 2 Registers (A(72), Q(36)) 36 bit
binary parallel Williams tube memory 1k words
(CRT tube memory) Drum memory 16k words 4.400
add/mul/sec
6IBM 701 ("defense calculator") (1953)
Vacuum Tubes 2 Registers (A(38), Q(36)) 36 bit
binary parallel Williams tube memory 2k words
(CRT tube memory) Drum memory 8k words 4.000
add/mul/sec
7IBM 709 (1957)
Vacuum Tubes 5 Registers (A(38), Q(36), 3
index) 36 bit binary parallel magnetic core
memory 4/8/32k words Drum memory 8/16k
words 5.500 add/mul/sec
8CDC 1604 (1959)
discrete Transistor 8 Registers (A(96), Q(48), 6
index) 48 bit binary parallel magnetic core
memory 32k words 40k add/mul/sec
9IBM 7090 (1960)
discrete Transistor 5 Registers (A(38), Q(36), 3
index) 36 bit binary parallel magnetic core
memory 32k words 40k add/mul/sec
10IBM 7094 (1962)
discrete Transistor 9 Registers (A(38), Q(36), 7
index) 36 bit binary parallel magnetic core
memory 32k words 80k add/mul/sec
11CDC 6600 (1964)
discrete Transistor 32 Registers (8 X, 8 A, 8B, 8
instruction stack) 60 bit binary
parallel magnetic core memory 128k words 1
MFLOPS first fluid cooled
12CDC 6600 10 core modules - each 6 kByte - 130
modules total 2 logic frames
13discrete wire mat vector graphic console
14"Last week Control Data ... announced the 6600
system. I understand that in the laboratory
developing the system there are only 34 people
including the janitor. Of these, 14 are engineers
and 4 are programmers ... Constrasting this
modest effort with our vast development
activities, I fail to understand why we have lost
our industry leadership position by letting
someone else offer the world's most powerful
computer." -- Thomas Watson, CEO of IBM,
1964 "It seems like Mr. Watson has answered his
own question." -- Seymour Cray, Control Data
Corporation
15(No Transcript)
16CDC 7600 (1969)
- The 7600 has similar hardware stucture like the
6600 (discrete transistor), with some
improvements - - 12 word instruction stack (was 8 word), total
of 36 "registers" - - 275 nsec small core memory cycle time (64kW,
was 1000 nsec 128 kW), large core 512 kW - - 36 MHz clock (was 10 MHz)
- - more consequently pipelined functional units
- - faster peripheral prcoessors
-
-
-
17IBM /360 - 195 (1971)
integrated circuit 20 Registers (16 GP, 4
FP) 32/64 bit binary parallel magnetic core
memory 1Mword max 756 nsec silicon cache 32 kByte
54 nsec (4 kword) model 195 hidden registers in
CPU to overcome /360 limitations
18Compiled by Erich Strohmaier
19Second 30 Years Time line 1976-2006 Vector and
Parallel Computing
- 1976 Cray-1 first successful vector computer (
50 MFLOPS) - 1982 Cray X-MP first multiple-processor
shared-memory vector computer - 1985 Cray-2 large memory (256 MW 2 GByte)
- 1888 Cray Y-MP first to break 1 GFLOPS barrier
- 1993 Cray T3D first successful massively
parallel machine, 3D-Torus - 16 x 1 GFLOPS lt 512 x 0.150 76 GFLOPS
- 1995 Cray T3E most widely sold MPP machine
break 1 TFLOPS barrier - 1700 x 1.2 GFLOPS 2 TFLOPS
- 2004 IBM Blue Gene/L world performance leader
(development started 1999) - IBM today has dominant market share (gt 50)
- leadership recovered after 40 years of
CDC/Cray dominance - same interconnect structure as Cray T3D/T3E
(3D-Torus) - 2006 lowest-power processors (64k x 5 GFLOPS
320 TFLOPS)
20Seymour Cray Cray-1 1976 Single Processor 80/160
MFLOPS peak 1 Mword 8 Mbyte
Photograph courtesy of Charles Babbage Institute,
University of Minnesota, Minneapolis
21MUCH larger working set - 8 vector registers, 64
words - 8 scalar registers - 8 address
registers - large instruction buffer Performance
Features - vector processing one operation
affects 64 vector elements, streamed through
functional unit - small vector startup time -
chaining between vector ops - large, fast
semiconductor memory - requires vectorization
effort
22Cray X-MP 1982 4 processors 800 MFLOPS 16 Mword
128 MByte
23Cray-2 1985 4 processors 1200 MFLOPS 256 Mword
2 GByte
24Minnesota Supercomputer Center Minneapolis 1986 C
DC Cyber 205 Cray-2 (4) Cray-2 (1)
25Cray Y-MP 1988 8/16 processors 1-16
GFLOPS 16M-1Gword 128M-8GByte
26Cray T3D (1993) First widely successful
massively parallel system 512 x 0.15 MFLOPS 76
GFLOPS 4 Gword 32 Gbyte distributed memory 3D
Torus interconnect MPP requires massive software
effort
27Cray T3E (1995) Most successful massively
parallel system in the 1990s 2048 x 1200 MFLOPS
2.4 TFLOPS max.(8 cabinets) 64 Gword 256 Gbyte
distributed memory (large end of config.) 3D
Torus interconnect
3 cabinets 768 processors
28Cache not always useful
Latency,
congestion not discussed here
29From Thomas Lippert, FZJ
30From Thomas Lippert, FZJ
31From Thomas Lippert, FZJ 1 MW 1 M/year !!
32After 40 years (1964 - 2004) of CDC - Cray
(vector) dominance IBM has regained the market
leadership.Low-power technology is the key to
success- high density ? fast communication-
low utility cost, low building costScalar ?
Vector ? Parallel increasing burden on
programmer to obtain performance/efficiency