Jim%20Gray - PowerPoint PPT Presentation

About This Presentation

Title:

Jim%20Gray

Description:

none – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 41

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Jim%20Gray

1

Gordon Bell 450 Old Oak Court Los Altos, CA
94022 GBell_at_Microsoft.com
Jim Gray 310 Filbert, SF CA 94133 Gray_at_Microsoft.
com
2
MetaMessage Technology Ratios Are Important

If everything gets fastercheaper at the same
rate THEN nothing really changes.
Things getting MUCH BETTER (104 x in 25 years)
communication speed cost
processor speed cost (PAP)
storage size cost
Things getting a little better (10 x in 25 years)
storage latency bandwidth
real application performance (RAP)
Things staying about the same
speed of light (more or less constant)
people (10x more expensive)

3
Consequent Message

Processing and Storage are WONDERFULLY cheaper
Storage latencies not much improved
Must get performance (RAP) via
Pipeline parallelism and (mask latency)
Partition parallelism (bandwidth and mask
latency)
Scaleable Hardware/Software architecture
Scaleable Commodity Network / Interconnect
Commodity Hardware (processors, disks, memory)
Commodity Software (OS, PL, Apps)
Scaleability thru automatic parallel programming
Manage program as a single system
Mask faults

4
Outline

Storage trends force pipeline partition
parallelism
Lots of bytes bandwidth per dollar
Lots of latency
Processor trends force pipeline partition
Lots of MIPS per dollar
Lots of processors
Putting it together

5
Moore's LawExponential Change means continual
rejuvenation

XXX doubles every 18 months 60 increase per year
Micro Processor speeds
CMOS chip density (memory chips)
Magnetic disk density
Communications bandwidth
WAN bandwidth approaching LANs
Exponential Growth
The past does not matter
10x here, 10x there, soon you're

6
Moores Law For Memory
Will Moore's Law continue to hold?
7
Moore's Law for Memory
Capacity with 64Mb DRAMs
1.6m
4GB
8G
1GB
Memory Price _at_ 50/chip
1G
200k
32K
128MB
128M
25k
Number of
chips
4K
32MB
3k
8M
512
8MB
400
1M
64
640K DOS limit
50
128K
8
1/8th chip
6
1
8K
256M
1Kbit
4K
16K
64K
256K
1M
4M
16M
64M
2000
1970
1980
1990
8
Trends Storage Got Cheaper

/byte got 104 better
/access got 103 better
capacity grew 103

Latency down 10x
Bandwidth up 10x

1e 6
1e 2
9
Partition Parallelism Gives Bandwidth

parallelism use many little devices in parallel
Solves the bandwidth problem
Beware of the media myth
Beware of the access time myth

At 10 MB/s 1.2 days to scan
1,000 x faster 2 minutes to scan
10
Partitioned Data has Natural Parallelism
Split a SQL Table across many disks, memories,
processors.
Partition and/or Replicate data to get parallel
disk access
11
Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Price vs Speed
Size vs Speed
Cache
Nearline
Offline
Tape
Main
Tape
1
Secondary
Disc
Online
Online
/MB
Size(B)
Secondary
Tape
Tape
Disc
Main
Offline
Nearline
Tape
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
12
Trends Application Storage Demand Grew

The New World
Billions of objects
Big objects (1MB)

The Old World
Millions of objects
100-byte objects

People
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Name
Address
David
NY
Mike
Berk
Won
Austin
People
Name
Picture
Voice
Address
Papers
NY
David
Mike
Berk
Won
Austin
13
Good News Electronic Storage Ratios Beat Paper

File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet
Disk disk (8 GB ) 4,000 ASCII
4 m pages 0.1 /sheet (30x cheaper)
Image 200 k pages 2 /sheet (similar
to paper)
Store everything on disk

14
What's a Terabyte
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCII) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1994 1 M of magnetic disc 120
discs 500 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !!
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
15
Standard Storage Metrics

Capacity
RAM MB and /MB today at 10Mb 100/MB
Disk GB and /GB today at 5GB and 500/GB
Tape TB and /TB today at .1TB and 50k/TB
(nearline)
Access time (latency)
RAM 100 ns
Disk 10 ms
Tape 30 second pick, 30 second position
Transfer rate
RAM 1 GB/s
Disk 5 MB/s - - - Arrays can go to 1GB/s
Tape 5 MB/s - - - Arrays can go 100 MB/s

16
New Storage Metrics KOXs, MOXs, GOXs, SCANs?

KOX How many kilobyte objects served per second
the file server, transaction processing

17
Trends Storage Bandwidth Improved Little
Transfer Rates Improved Little
Processor Speedups
RAM (B/s)
Processors (i/s)
Disk (B/s)
Tape (B/s)
LANs WANs (b/s)
1960
1970
1980
1990
2000
1960
1970
1980
1990
2000
Year
Year
18
Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (5x
cheaper than disc) Tape is cheap 30 /tape
20 GB/tape gt 1.5 /GB (700x
cheaper than disc).
19
Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 20GB each) gt 20/GB ... 200/GB
(5x...50x cheaper than disc) Optical needs
a robot (100 k ) 100 platters 200GB ( TODAY
) gt 550 /GB ( same price as disc ) Robots
have poor access times Not good for Library
of Congress (25TB) Data motel data checks
in but it never checks out!
20
The Access Time Myth

Myth seek or pick time dominates
Reality (1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implications many cheap servers better than
one fast expensive server
shorter queues
parallel transfer
lower cost/access and cost/byte
This is now obvious for disk arrays
This will be obvious for tape arrays

Wait
Transfer
Transfer
Rotate
Rotate
Seek
Seek
21
The Disk Farm On a Card

The 100GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth

14"
Life is cheap, its the accessories that cost
ya. Processors are cheap, its the peripherals
22
Tertiary Storage Tape Farms, Not Mainframe Silos
100 robots
1M
20TB
50/GB
3K MOX
10K robot
1.5K GOX
10 tapes
2.5 Scans
200 GB
6 MB/s
50/GB
Scan in 10 hours. many independent tape
robots (like a disc farm)
30 MOX
15 GOX
23
The Metrics Disk and Tape Farms Win
Data Motel Data checks in, but it never checks
out
GB/K
1
,
000
,
000
K
OX
100
,
000
MOX
GOX
10
,
000
SCANS/Day
1
,
000
100
10
1
0.1
0.01
1000 x
D
i
sc Farm
100x DLT
Tape Farm
STC Tape Robot
6,000 tapes, 8 readers
24
Access/ (3-year life)
540
,000
67
,000
500K
100,000
KOX/
MOX/
GOX/
100
68
SCANS/k
23
120
10
4.3
7
7
100
2
1.5
1
0.2
0.1
1000 x Disc Farm
STC Tape Robot
100x DLT Tape Farm
6,000 tapes, 16
readers
25
Summary (of storage)

Capacity and cost are improving fast (100x per
decade)
Accesses are getting larger (MOX, GOX, SCANS)
BUT Latencies and bandwidth are not improving
much
(3x per decade)
How to deal with this???
Bandwidth
Use partitioned parallel access (disk tape
farms)
Latency
Pipeline data up storage hierarchy (next section)

26
Interesting Storage Ratios

Disk is back to 100x cheaper than RAM
Nearline tape is only 10x cheaper than disk
and the gap is closing!

RAM /MB Disk /MB
1001
Disk DRAM look good
301
?
101
??? Why bother with Tape
Disk /MB Nearline Tape
11
1960 1970 1980 1990 2000
27
Outline

Storage trends force pipeline partition
parallelism
Lots of bytes bandwidth per dollar
Lots of latency
Processor trends force pipeline partition
Lots of MIPS per dollar
Lots of processors
Putting it together

28
MicroProcessor Speeds Went Up Fast

Clock rates went from 10Khz to 300Mhz
Processors now 4x issue
SPECInt92 fits in Cache,
it tracks cpu speed
Peak Advertised Performance (PAP) is 1.2 BIPS
Real Application Performance (RAP) is 60 MIPS
Similar curves for
DEC VAX Alpha
HP/PA
IBM R6000/ PowerPC
MIPS SGI
SUN

29
System SPECint vs Price
SGI XL
SGI L
486_at_66 PCs
Pentium
NCR 3555
SUN 2000
Compaq
to 16 proc.
NCR 3525
SUN 1000
NCR 3600 AP
Tricord ES 5K
HP 9000
Price (s)
30
Micros Live Under the Super Curve

Super GFLOPS went up
uni-processor 20x in 20 years
SMP 600x in 20 years
Microprocessor SPECint went up
CAG between 40 and 70
Microprocessors meet Supers
same clock speeds soon
FUTURE
modest UniProcessor Speedups
Must use multiple processors
(or maybe 1 chip is different?)

Workstation SpecInt vs Time
1000
Intel Clock, 1979-1995 42 CAG
Sun
100
45 CAG
10
70 CAG
MicroVax
1
1985
1990
1995
31
PAP vs RAP Max Memory Performance 10x Better

PAP Peak Advertised Performance
300Mhz x 4x 1.2 BIPS
RAP Real Application Performance on Memory
Intensive Applications (MIA commercial)
2-4 L2 cache miss, 40MIPS to 80 MIPS
MIA UP RAP improved 50x in 30 years
Cray 6600 _at_ 1.4 MIPS in 1964
Alpha _at_ 70MIPS in 1994
Microprocessors have been growing up under
the memory barrier
Mainframes have been at the memory barrier

32
Growing Up Under the Super Curve

Cray IBM Amdahl are Fastest Possible (at
that time for N megabucks)
Have GREAT! memory and IO
Commodity systems growing up under the super
memory cloud.
Near the limit.
Interesting times ahead
use parallelism to get speedup

Datamation Sort cpu time only
33
Thesis Performance Storage Accesses not
Instructions Executed

In the old days we counted instructions and
IOs
Now we count memory references
Processors wait most of the time

Where the time goes
clock ticks used by AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
34
Storage Latency How Far Away is the Data?
35
The Pico Processor
1 M SPECmarks, 1TFLOP 106
clocks to bulk ram Event-horizon on chip. VM
reincarnated Multi-program cache On-Chip SMP
Terror Bytes!
36
Masking Memory Latency

MicroProcessors got 10,000x faster cheaper
Main memories got 10x faster
So... how get more work from memory?
cache memory to hide latency (reuse data)
wide memory for bandwidth
pipeline memory access to hide latency
SMP threads for partitioned memory access

37
DataFlow ProgrammingPrefetch Postwrite Hide
Latency

Can't wait for the data to arrive (2,000 years!)
Need a memory that gets the data in advance (
100MB/S)
Solution
Pipeline data to/from the processor
Pipe data from source (tape, disc, ram...) to cpu
cache

38
Parallel Execution masks latency

Processors are pushing on the Memory Barrier
MIA RAP ltlt PAP so learn from the FLOPS

Pipeline Mask Latency
Partition Increase Bandwidth Overlap
computation with latency
39
Outline

Storage trends force pipeline partition
parallelism
Lots of bytes bandwidth per dollar
Lots of latency
Processor trends force pipeline partition
Lots of MIPS per dollar
Lots of processors
Putting it together

40
Thesis Many Little Beat Few Big

How to connect the many little parts
How to program the many little parts
Fault tolerance?

41
Clusters Connecting Many Little
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS of processors,
discs Distributed Database techniques make
clusters work
42
Success Stories OLTP

Transaction Processing, Client/Server, File
Server have natural parallelism.
lots of clients,
lots of small independent requests
Near-linear scaleup
Support gt 10 k clients
Examples
Oracle/Rdb scales to 3.7k tpsA
on 5x4 Alpha Cluster
Tandem Scales to 21k tpmC
on 1x110 Tandem cluster
Shared nothing scales best

Throughput vs CPUs
21k tpmC
110
2
32
cpus
43
Success Stories Decision

Relational databases are uniform streams of data
allows pipelining (much like vector processing)
allows partitioning (by range or hash)
Relational operators are closed under composition
output of operator can be streamed to next
operator
Get linear scaleup on SMP and SN
(Teradata, Tandem, Oracle, Informix,...)

44
Scaleables Uneconomic So Far

A Slice is a processor, memory, and a few disks.
Slice Price of Scaleables so far is 5x to 10x
markup
Teradata 70K for a Intel 486 32MB 4 disk.
Tandem 100k for a MipsCo R4000 64MB 4 disk
Intel 75k for an I860 32MB 2 disk
TMC 75k for a SPARC 3 32MB 2 disk.
IBM/SP2 100k for a R6000 64MB 8 disk
Compaq Slice Price is less than 10k
What is the problem?
Proprietary interconnect
Proprietary packaging
Proprietary software (vendorIX)

45
Network Trends Challenge

Bandwidth UP 104 Price went DOWN
Speed-of-light and Distance unchanged
Software got worse
Standard Fast Nets
ATM
PCI
Myrinet
Tnet
HOPE
Commodity Net
Good software
Then clusters become a SNAP! commodity 10k/slice

46
Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blur Network
47
Architectural Issues

Hardware will be parallel
What is the programming model?
can you hide locality? No, locality is critical
If build SMP, must program as shared-nothing
Will users learn to program in parallel?
No, successful products give automatic
parallelism
With 100s of computers, what about management?
Administration costs 2.5k/year/PC (lowest
estimate)
Cluster must be
As easy to manage as a single system (it is a
single system)
Faults diagnosed masked automatically
Message based computation mode
Transactions
Checkpoint / Restart

48
SNAP Business Issues

Use commodity components (software hardware)
Intel won - compatibility is important
ATM will probably win LAN WAN, not CAN
NT will probably win (UNIX too fragmented)
SQL is wining parallel data access.
What else?
Automatic parallel programming
Key to scaleability
Desktop to glass house.
Automatic management
Key to economics
Palmtops and mobile may be differentiated.

49
SNAP Systems circa 2000
Mobile Nets
Local global data comm world
Legacy mainframe minicomputer servers
terminals
Portables
Wide-area global ATM network
ATM Ethernet PC, workstation, servers
Person servers (PCs)
scalable computers built from PCs CAN
Centralized departmental servers built from PCs
???

A space, time (bandwidth), generation scalable
environment

TCTVPC home ... (CATV or ATM or satellite)
50
The SNAP Software Challenge
Cluster Network OS. Automatic
Administration Automatic data placement
Automatic parallel programming Parallel Query
Optimization Parallel concepts, algorithms,
tools Execution Techniques load balance,
checkpoint/restart,
51
Outline