Title: Jim%20Gray
1 Gordon Bell 450 Old Oak Court Los Altos, CA
94022 GBell_at_Microsoft.com
Jim Gray 310 Filbert, SF CA 94133 Gray_at_Microsoft.
com
2MetaMessage Technology Ratios Are Important
- If everything gets fastercheaper at the same
rate THEN nothing really changes. - Things getting MUCH BETTER (104 x in 25 years)
- communication speed cost
- processor speed cost (PAP)
- storage size cost
- Things getting a little better (10 x in 25 years)
- storage latency bandwidth
- real application performance (RAP)
- Things staying about the same
- speed of light (more or less constant)
- people (10x more expensive)
3Consequent Message
- Processing and Storage are WONDERFULLY cheaper
- Storage latencies not much improved
- Must get performance (RAP) via
- Pipeline parallelism and (mask latency)
- Partition parallelism (bandwidth and mask
latency) - Scaleable Hardware/Software architecture
- Scaleable Commodity Network / Interconnect
- Commodity Hardware (processors, disks, memory)
- Commodity Software (OS, PL, Apps)
- Scaleability thru automatic parallel programming
- Manage program as a single system
- Mask faults
4Outline
- Storage trends force pipeline partition
parallelism - Lots of bytes bandwidth per dollar
- Lots of latency
- Processor trends force pipeline partition
- Lots of MIPS per dollar
- Lots of processors
- Putting it together
5Moore's LawExponential Change means continual
rejuvenation
- XXX doubles every 18 months 60 increase per year
- Micro Processor speeds
- CMOS chip density (memory chips)
- Magnetic disk density
- Communications bandwidth
- WAN bandwidth approaching LANs
- Exponential Growth
- The past does not matter
- 10x here, 10x there, soon you're
6Moores Law For Memory
Will Moore's Law continue to hold?
7Moore's Law for Memory
Capacity with 64Mb DRAMs
1.6m
4GB
8G
1GB
Memory Price _at_ 50/chip
1G
200k
32K
128MB
128M
25k
Number of
chips
4K
32MB
3k
8M
512
8MB
400
1M
64
640K DOS limit
50
128K
8
1/8th chip
6
1
8K
256M
1Kbit
4K
16K
64K
256K
1M
4M
16M
64M
2000
1970
1980
1990
8Trends Storage Got Cheaper
- /byte got 104 better
- /access got 103 better
- capacity grew 103
- Latency down 10x
- Bandwidth up 10x
1e 6
1e 2
9Partition Parallelism Gives Bandwidth
- parallelism use many little devices in parallel
- Solves the bandwidth problem
- Beware of the media myth
- Beware of the access time myth
At 10 MB/s 1.2 days to scan
1,000 x faster 2 minutes to scan
10Partitioned Data has Natural Parallelism
Split a SQL Table across many disks, memories,
processors.
Partition and/or Replicate data to get parallel
disk access
11Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Price vs Speed
Size vs Speed
Cache
Nearline
Offline
Tape
Main
Tape
1
Secondary
Disc
Online
Online
/MB
Size(B)
Secondary
Tape
Tape
Disc
Main
Offline
Nearline
Tape
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
12Trends Application Storage Demand Grew
- The New World
- Billions of objects
- Big objects (1MB)
- The Old World
- Millions of objects
- 100-byte objects
People
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Name
Address
David
NY
Mike
Berk
Won
Austin
People
Name
Picture
Voice
Address
Papers
NY
David
Mike
Berk
Won
Austin
13Good News Electronic Storage Ratios Beat Paper
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet - Disk disk (8 GB ) 4,000 ASCII
4 m pages 0.1 /sheet (30x cheaper) - Image 200 k pages 2 /sheet (similar
to paper) - Store everything on disk
14What's a Terabyte
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCII) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1994 1 M of magnetic disc 120
discs 500 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !!
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
15Standard Storage Metrics
- Capacity
- RAM MB and /MB today at 10Mb 100/MB
- Disk GB and /GB today at 5GB and 500/GB
- Tape TB and /TB today at .1TB and 50k/TB
(nearline) - Access time (latency)
- RAM 100 ns
- Disk 10 ms
- Tape 30 second pick, 30 second position
- Transfer rate
- RAM 1 GB/s
- Disk 5 MB/s - - - Arrays can go to 1GB/s
- Tape 5 MB/s - - - Arrays can go 100 MB/s
16New Storage Metrics KOXs, MOXs, GOXs, SCANs?
- KOX How many kilobyte objects served per second
- the file server, transaction processing
17Trends Storage Bandwidth Improved Little
Transfer Rates Improved Little
Processor Speedups
RAM (B/s)
Processors (i/s)
Disk (B/s)
Tape (B/s)
LANs WANs (b/s)
1960
1970
1980
1990
2000
1960
1970
1980
1990
2000
Year
Year
18Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (5x
cheaper than disc) Tape is cheap 30 /tape
20 GB/tape gt 1.5 /GB (700x
cheaper than disc).
19Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 20GB each) gt 20/GB ... 200/GB
(5x...50x cheaper than disc) Optical needs
a robot (100 k ) 100 platters 200GB ( TODAY
) gt 550 /GB ( same price as disc ) Robots
have poor access times Not good for Library
of Congress (25TB) Data motel data checks
in but it never checks out!
20The Access Time Myth
- Myth seek or pick time dominates
- Reality (1) Queuing dominates
- (2) Transfer dominates BLOBs
- (3) Disk seeks often short
- Implications many cheap servers better than
one fast expensive server - shorter queues
- parallel transfer
- lower cost/access and cost/byte
- This is now obvious for disk arrays
- This will be obvious for tape arrays
Wait
Transfer
Transfer
Rotate
Rotate
Seek
Seek
21The Disk Farm On a Card
- The 100GB disc card
- An array of discs
- Can be used as
- 100 discs
- 1 striped disc
- 10 Fault Tolerant discs
- ....etc
- LOTS of accesses/second
- bandwidth
14"
Life is cheap, its the accessories that cost
ya. Processors are cheap, its the peripherals
22Tertiary Storage Tape Farms, Not Mainframe Silos
100 robots
1M
20TB
50/GB
3K MOX
10K robot
1.5K GOX
10 tapes
2.5 Scans
200 GB
6 MB/s
50/GB
Scan in 10 hours. many independent tape
robots (like a disc farm)
30 MOX
15 GOX
23The Metrics Disk and Tape Farms Win
Data Motel Data checks in, but it never checks
out
GB/K
1
,
000
,
000
K
OX
100
,
000
MOX
GOX
10
,
000
SCANS/Day
1
,
000
100
10
1
0.1
0.01
1000 x
D
i
sc Farm
100x DLT
Tape Farm
STC Tape Robot
6,000 tapes, 8 readers
24Access/ (3-year life)
540
,000
67
,000
500K
100,000
KOX/
MOX/
GOX/
100
68
SCANS/k
23
120
10
4.3
7
7
100
2
1.5
1
0.2
0.1
1000 x Disc Farm
STC Tape Robot
100x DLT Tape Farm
6,000 tapes, 16
readers
25Summary (of storage)
- Capacity and cost are improving fast (100x per
decade) - Accesses are getting larger (MOX, GOX, SCANS)
- BUT Latencies and bandwidth are not improving
much - (3x per decade)
- How to deal with this???
- Bandwidth
- Use partitioned parallel access (disk tape
farms) - Latency
- Pipeline data up storage hierarchy (next section)
26Interesting Storage Ratios
- Disk is back to 100x cheaper than RAM
- Nearline tape is only 10x cheaper than disk
- and the gap is closing!
RAM /MB Disk /MB
1001
Disk DRAM look good
301
?
101
??? Why bother with Tape
Disk /MB Nearline Tape
11
1960 1970 1980 1990 2000
27Outline
- Storage trends force pipeline partition
parallelism - Lots of bytes bandwidth per dollar
- Lots of latency
- Processor trends force pipeline partition
- Lots of MIPS per dollar
- Lots of processors
- Putting it together
28MicroProcessor Speeds Went Up Fast
- Clock rates went from 10Khz to 300Mhz
- Processors now 4x issue
- SPECInt92 fits in Cache,
- it tracks cpu speed
- Peak Advertised Performance (PAP) is 1.2 BIPS
- Real Application Performance (RAP) is 60 MIPS
- Similar curves for
- DEC VAX Alpha
- HP/PA
- IBM R6000/ PowerPC
- MIPS SGI
- SUN
29System SPECint vs Price
SGI XL
SGI L
486_at_66 PCs
Pentium
NCR 3555
SUN 2000
Compaq
to 16 proc.
NCR 3525
SUN 1000
NCR 3600 AP
Tricord ES 5K
HP 9000
Price (s)
30Micros Live Under the Super Curve
- Super GFLOPS went up
- uni-processor 20x in 20 years
- SMP 600x in 20 years
- Microprocessor SPECint went up
- CAG between 40 and 70
- Microprocessors meet Supers
- same clock speeds soon
- FUTURE
- modest UniProcessor Speedups
- Must use multiple processors
- (or maybe 1 chip is different?)
Workstation SpecInt vs Time
1000
Intel Clock, 1979-1995 42 CAG
Sun
100
45 CAG
10
70 CAG
MicroVax
1
1985
1990
1995
31PAP vs RAP Max Memory Performance 10x Better
- PAP Peak Advertised Performance
- 300Mhz x 4x 1.2 BIPS
- RAP Real Application Performance on Memory
Intensive Applications (MIA commercial) - 2-4 L2 cache miss, 40MIPS to 80 MIPS
- MIA UP RAP improved 50x in 30 years
- Cray 6600 _at_ 1.4 MIPS in 1964
- Alpha _at_ 70MIPS in 1994
- Microprocessors have been growing up under
the memory barrier - Mainframes have been at the memory barrier
32Growing Up Under the Super Curve
- Cray IBM Amdahl are Fastest Possible (at
that time for N megabucks) - Have GREAT! memory and IO
- Commodity systems growing up under the super
memory cloud. - Near the limit.
- Interesting times ahead
- use parallelism to get speedup
Datamation Sort cpu time only
33Thesis Performance Storage Accesses not
Instructions Executed
- In the old days we counted instructions and
IOs - Now we count memory references
- Processors wait most of the time
Where the time goes
clock ticks used by AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
34Storage Latency How Far Away is the Data?
35The Pico Processor
1 M SPECmarks, 1TFLOP 106
clocks to bulk ram Event-horizon on chip. VM
reincarnated Multi-program cache On-Chip SMP
Terror Bytes!
36Masking Memory Latency
- MicroProcessors got 10,000x faster cheaper
- Main memories got 10x faster
- So... how get more work from memory?
- cache memory to hide latency (reuse data)
- wide memory for bandwidth
- pipeline memory access to hide latency
- SMP threads for partitioned memory access
37DataFlow ProgrammingPrefetch Postwrite Hide
Latency
- Can't wait for the data to arrive (2,000 years!)
- Need a memory that gets the data in advance (
100MB/S) - Solution
- Pipeline data to/from the processor
-
- Pipe data from source (tape, disc, ram...) to cpu
cache
38Parallel Execution masks latency
- Processors are pushing on the Memory Barrier
- MIA RAP ltlt PAP so learn from the FLOPS
Pipeline Mask Latency
Partition Increase Bandwidth Overlap
computation with latency
39Outline
- Storage trends force pipeline partition
parallelism - Lots of bytes bandwidth per dollar
- Lots of latency
- Processor trends force pipeline partition
- Lots of MIPS per dollar
- Lots of processors
- Putting it together
40Thesis Many Little Beat Few Big
- How to connect the many little parts
- How to program the many little parts
- Fault tolerance?
41Clusters Connecting Many Little
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS of processors,
discs Distributed Database techniques make
clusters work
42Success Stories OLTP
- Transaction Processing, Client/Server, File
Server have natural parallelism. - lots of clients,
- lots of small independent requests
- Near-linear scaleup
- Support gt 10 k clients
- Examples
- Oracle/Rdb scales to 3.7k tpsA
- on 5x4 Alpha Cluster
- Tandem Scales to 21k tpmC
- on 1x110 Tandem cluster
- Shared nothing scales best
Throughput vs CPUs
21k tpmC
110
2
32
cpus
43Success Stories Decision
- Relational databases are uniform streams of data
- allows pipelining (much like vector processing)
- allows partitioning (by range or hash)
- Relational operators are closed under composition
- output of operator can be streamed to next
operator - Get linear scaleup on SMP and SN
- (Teradata, Tandem, Oracle, Informix,...)
44Scaleables Uneconomic So Far
- A Slice is a processor, memory, and a few disks.
- Slice Price of Scaleables so far is 5x to 10x
markup - Teradata 70K for a Intel 486 32MB 4 disk.
- Tandem 100k for a MipsCo R4000 64MB 4 disk
- Intel 75k for an I860 32MB 2 disk
- TMC 75k for a SPARC 3 32MB 2 disk.
- IBM/SP2 100k for a R6000 64MB 8 disk
- Compaq Slice Price is less than 10k
- What is the problem?
- Proprietary interconnect
- Proprietary packaging
- Proprietary software (vendorIX)
45Network Trends Challenge
- Bandwidth UP 104 Price went DOWN
- Speed-of-light and Distance unchanged
- Software got worse
- Standard Fast Nets
- ATM
- PCI
- Myrinet
- Tnet
- HOPE
- Commodity Net
- Good software
- Then clusters become a SNAP! commodity 10k/slice
46Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blur Network
47Architectural Issues
- Hardware will be parallel
- What is the programming model?
- can you hide locality? No, locality is critical
- If build SMP, must program as shared-nothing
- Will users learn to program in parallel?
- No, successful products give automatic
parallelism - With 100s of computers, what about management?
- Administration costs 2.5k/year/PC (lowest
estimate) - Cluster must be
- As easy to manage as a single system (it is a
single system) - Faults diagnosed masked automatically
- Message based computation mode
- Transactions
- Checkpoint / Restart
48SNAP Business Issues
- Use commodity components (software hardware)
- Intel won - compatibility is important
- ATM will probably win LAN WAN, not CAN
- NT will probably win (UNIX too fragmented)
- SQL is wining parallel data access.
- What else?
- Automatic parallel programming
- Key to scaleability
- Desktop to glass house.
- Automatic management
- Key to economics
- Palmtops and mobile may be differentiated.
49SNAP Systems circa 2000
Mobile Nets
Local global data comm world
Legacy mainframe minicomputer servers
terminals
Portables
Wide-area global ATM network
ATM Ethernet PC, workstation, servers
Person servers (PCs)
scalable computers built from PCs CAN
Centralized departmental servers built from PCs
???
- A space, time (bandwidth), generation scalable
environment
TCTVPC home ... (CATV or ATM or satellite)
50The SNAP Software Challenge
Cluster Network OS. Automatic
Administration Automatic data placement
Automatic parallel programming  Parallel Query
Optimization Parallel concepts, algorithms,
tools  Execution Techniques load balance,
checkpoint/restart,
51Outline
- Storage trends force pipeline partition
parallelism - Lots of bytes bandwidth per dollar
- Lots of latency
- Processor trends force pipeline partition
- Lots of MIPS per dollar
- Lots of processors
- Putting it together (Scaleable Networks and
Platforms) - Build clusters of commodity processors storage
- Commodity interconnect is key (S of PMS)
- Traditional interconnects give 100k/slice.
- Commodity Cluster Operating System is key
- Fault isolation and tolerance is key
- Automatic Parallel Programming is key